From The Indexing Pipeline

Batch versus Incremental

The index and statistics are built once and then read for as long as they stay current, but real corpora change: documents are added, edited, removed. So there is a cadence question underneath the whole pipeline, and it has two parts that are easy to run together and worth separating. How often do you rebuild, and do you rebuild from scratch or patch the existing index in place? This leaf is about that fork, and the thing that makes it less obvious than it looks is that corpus-wide statistics are not local: one document changing can, in principle, nudge a number that every query reads.

The simple default: full immutable rebuild

The simplest correct thing is to rebuild everything from scratch on a schedule. Run the batch job over the entire current corpus, produce a fresh IDF artifact and a fresh index, and replace the old ones wholesale. Daily, hourly, whatever the freshness need and the job’s runtime allow.

The appeal is that it is obviously correct. Every statistic is computed over a consistent snapshot of the corpus, so there is no drift, no partial state, no question of whether an old number is stale. The cost is two things. It is expensive, because you reprocess the whole corpus every time even if one document changed. And it is stale between runs: a document added just after a rebuild is invisible until the next one. For a corpus that changes slowly relative to how often you can afford to rebuild, full rebuild is simply the right answer, and the simplicity is worth a great deal because the failure modes of the alternative are subtle.

The “immutable” part matters as much as the “full” part. Replacing the whole artifact rather than editing it in place means the old version stays intact and consistent until the new one is fully ready, which is exactly what makes the staging-to-serving handoff safe. An immutable rebuild is a clean swap of one consistent state for another, never a live edit of a state queries are reading.

What incremental buys, and what it costs

Incremental update is the optimization: when documents change, update only the affected parts of the index rather than rebuilding everything. Add a document and you add its terms to the relevant postings lists; remove one and you remove them. The payoff is real, lower cost per change and fresher results, the new document is searchable in near-real-time instead of at the next rebuild.

The cost is consistency, and it has a specific shape that took me a moment to see. The postings are local: adding a document touches only the lists for its terms. But the statistics are global. Document frequency is a count over the whole corpus, and the document count N is the corpus size, so adding or removing a single document changes N and shifts the document frequency of every term that document contained, which shifts their IDF, which every query reads. A strictly correct incremental update would have to ripple a single document’s change out into the weights of all its terms. In practice systems do not recompute IDF on every document change, because the shift from one document in a large corpus is tiny, so they let the statistics drift slightly between periodic recomputations and accept that the IDF table is approximate in exchange for not rebuilding it constantly. That approximation is usually fine, but it is an explicit choice to be slightly wrong for a while, and it is the kind of choice that needs to be made on purpose rather than discovered later.

So the honest framing of incremental is that the index incrementalizes cleanly because postings are local, and the statistics incrementalize awkwardly because they are global, and most systems split the difference: incremental for the index so new documents are findable fast, periodic full recompute for the statistics so the weights do not drift forever. The freshness you actually get is the freshness of the index; the statistics lag.

Immutable segments, the middle path

There is a structural pattern that gets most of incremental’s freshness while keeping immutability’s safety, and it is worth naming because it is how the major lexical engines actually work. Rather than editing one big index in place or rebuilding it whole, the index is kept as a set of immutable segments. New and changed documents go into a new small segment, which is built and added without touching the existing ones, so writes are cheap and never corrupt what queries are reading. A query searches all the segments and merges the results. Periodically, in the background, small segments are merged into larger ones to keep the segment count manageable, and deletions are handled by marking documents as removed and dropping them at merge time rather than editing a segment.

This is the pattern Lucene uses, and it is a genuine middle path: each segment is immutable, so the safety of the immutable-rebuild model holds at the segment level, but you add a small segment for new content instead of rebuilding everything, so you get incremental freshness. It is also where the inverted index’s update story connects to the pipeline: an index is not edited, it is grown by adding immutable segments and reshaped by merging them, which is incremental-on-the-outside and immutable-on-the-inside.

Which to pick

The fork resolves to a few honest considerations. If the corpus changes slowly and the rebuild is cheap enough to run as often as freshness requires, full immutable rebuild is the right default, because it is correct by construction and has no drift to reason about. If freshness requirements outpace what full rebuild can deliver, segment-based incremental indexing buys near-real-time findability while keeping immutability’s safety, at the cost of segment-merge machinery. And whatever the index does, the global statistics almost always recompute on a slower full cadence, because they are not local and incrementalizing them exactly is more trouble than the tiny per-document drift is worth.

The principle I keep returning to is that “how fresh” and “how correct” trade against “how expensive,” and the trade is different for the local parts of the index than for the global statistics. Treating them as one decision is how you end up either rebuilding everything too often or trusting drifted statistics too long. Treating them separately, fast incremental index plus periodic statistics recompute, is the configuration most production systems converge on.

Up to the indexing pipeline. The structure that incrementalizes via immutable segments is the inverted index; the global statistics that resist incrementalizing are IDF, stored in the artifact. The safe swap of one build for the next is staging to serving.