The Indexing Pipeline: Index
| From Search and Retrieval | |
| The Indexing Pipeline | |
|---|---|
| Page metadata | |
| First created | Jun 7, 2026 |
| Last edited | Jun 7, 2026 |
Everything in the retrieval territories quietly assumed the index already exists: the postings lists, the IDF table, the document vectors. This territory is about where those come from, and the shift in mindset that took me the longest to internalize is that building them is not a search problem at all. It is a data-engineering problem, a batch job over the whole corpus with the same concerns any large offline job has: memory footprint, runtime, the schema of what it produces, how often it reruns, and how it hands its output to the live system without breaking it. The math the retrieval side cares about is settled. The pipeline is where the actual difficulty of running this in production lives, and it is the part most write-ups skip entirely.
Why the statistics are offline, not at query time
The first thing to be clear about is the split between what happens once, ahead of time, and what happens on every query. Computing document frequencies and IDF means counting, across the entire corpus, how many documents contain each term. That is a pass over every document, and it cannot happen while a user waits. So it is an offline batch job that runs on a schedule, produces a table, and the serving path only ever looks up the precomputed table. The same holds for the inverted index and the embeddings: built offline, read at query time. The query path is fast precisely because the expensive corpus-wide work already happened.
This is why the pipeline is a separate discipline from retrieval. Retrieval is latency-bound and runs per query; the pipeline is throughput-bound and runs per rebuild. They have different failure modes, different performance concerns, and often different languages and runtimes, and the contract between them is an artifact on disk that one writes and the other reads.
The memory swing nobody warns you about
The decision that surprised me most, because it is invisible until it hurts, is that the peak memory of a corpus-statistics job depends enormously on how you process the data, not just on how much data there is. The same arithmetic, computed two ways, can differ by an order of magnitude in memory, which is the difference between a job that fits and a job that dies.
I measured it to be sure I was not exaggerating. I computed document frequencies over a synthetic corpus two ways: one that materializes the whole corpus and all the per-document structures in memory before aggregating, and one that streams a single document at a time, updates the running counter, and discards the document. Same documents, identical DF result:
| strategy | peak memory | DF result |
|---|---|---|
| materialize all in RAM, then aggregate | 201.80 MB | identical |
| stream one document at a time | 0.19 MB | identical |
| ratio | 1036x |
A thousand-fold difference for the same answer. The materialize approach holds the entire corpus plus a token set per document all at once; the streaming approach holds one document plus the counter. My materialize version is deliberately maximal, so the real-world gap between a careless and a careful implementation is usually smaller, but the class of the problem is exactly this: corpus-stats jobs are sensitive to processing strategy, and the wrong strategy turns a tractable job into one that exhausts memory on a large columnar file. The language and runtime choice moves the same dial, because an eager-by-default row-materializing runtime and a lazy columnar-streaming one have very different peak footprints over the same parquet file. “Will this job fit in memory” is a real, upfront design question, not an afterthought, and it is one of the genuine difficulties the day job hands you that no IR textbook mentions.
The whole measurement is two versions of the same DF loop, with tracemalloc reading off the peak:
import tracemalloc
from collections import Counter
# A. materialize: hold the whole corpus AND all per-doc token sets at once
tracemalloc.start()
all_docs = [gen_doc() for _ in range(20_000)] # entire corpus in RAM
all_sets = [set(d) for d in all_docs] # every per-doc token set in RAM
df = Counter()
for s in all_sets:
for t in s:
df[t] += 1
peak_materialize = tracemalloc.get_traced_memory()[1] # ~201.80 MB
tracemalloc.stop()
# B. stream: one doc in scope at a time, discarded each iteration
tracemalloc.start()
df = Counter()
for _ in range(20_000):
doc = gen_doc() # one doc
for t in set(doc):
df[t] += 1
# doc and its set go out of scope -> never accumulated
peak_stream = tracemalloc.get_traced_memory()[1] # ~0.19 MB
tracemalloc.stop()
# same df both ways; peak_materialize / peak_stream ~ 1036xThe arithmetic is identical; only what is held in memory at once differs. The full script with the synthetic zipf-ish corpus generator is in the experiment workbench beside this page.
The three real decisions
Past the offline-versus-online framing and the memory question, this territory has three concrete decisions, and each is a leaf.
The first is the contract: what exactly the batch job produces and what the serving layer expects to read. The minimum is a table from term to weight, but a real artifact can carry more for traceability, and which columns the consumer actually reads, what happens for a term the table has never seen, and whether one artifact covers the whole collection or one per dataset are all real questions. That is the IDF artifact, and it is where this pipeline meets the IDF-variant agreement contract from the lexical side: the builder and the consumer have to agree on the formula and the schema.
The second is cadence: rebuild the whole thing from scratch on a schedule, or update it incrementally as documents change. Full rebuild is simple and correct but expensive and stale between runs; incremental is cheaper and fresher but harder to keep consistent. That is batch versus incremental, and it connects to how the inverted index handles updates at all.
The third is the handoff: how the freshly built artifact gets into the location the live serving layer reads, without the serving layer ever seeing a half-written index. The answer is to build into a staging location and promote atomically, and the ordering and safety of that promotion is its own problem. That is staging to serving, and it leads directly into the operational safety of serving and migration.
The build, release, refresh shape
One framing that ties the leaves together, drawn from how production data jobs actually look: a corpus pipeline tends to expose a small set of commands rather than one monolithic run. A build step computes the artifact into staging. A release step promotes it into the serving location. A refresh step tells the serving layer to pick up the new artifact. The same job runs on a schedule and on demand, and it is configured per environment, so the development, staging, and production runs write to different storage targets and can be exercised independently. The three leaves below are the hard parts inside that shape: what the build produces, how often it runs, and how the release and refresh happen safely.
The leaves
The IDF artifact
The contract between the batch job and the serving layer. The minimal term-to-weight schema versus a richer traceable one, the fallback for unseen terms, term capping at scale, and the granularity question of one artifact or one per dataset.
Batch versus incremental
Full recompute on a schedule versus event-driven incremental update. Why immutable-rebuild is the simple default, what incremental buys and costs, and how it relates to immutable index segments.
Staging to serving
The atomic handoff. Building into a staging location, discovering the latest staged build, promoting it into the serving location so the live system never reads a partial index, and reusing existing assembly and promotion logic rather than inventing it.
Up to the whole map. The statistics it computes are consumed by lexical retrieval and, for vectors, semantic retrieval. The live system it feeds, and the migration safety around the handoff, is serving at scale.
Index
- The IDF Artifact. The contract between the batch job that computes corpus statistics and the serving layer that consumes them. The minimal term-to-weight schema versus a richer traceable one, the fallback weight for terms the table has never seen, capping the table at scale, and whether one artifact covers the whole collection or one per dataset.
- Batch versus Incremental. How often to rebuild the index and statistics, and from scratch or in place. Why full immutable rebuild is the simple correct default, what incremental update buys in freshness and costs in consistency, the trap that corpus-wide statistics shift under any single change, and how this connects to immutable index segments.
- Staging to Serving. The atomic handoff. How a freshly built index reaches the location the live serving layer reads without the serving layer ever seeing a half-written index: build into staging, discover the latest staged build, and promote it atomically. The ordering question, why discovery is by timestamp not hardcoded path, and reusing existing promotion logic.