From The Indexing Pipeline

The IDF Artifact

The IDF computation produces a thing, and that thing is a file: a table mapping each term to its weight, written by the offline batch job and read by the serving layer at query time. I had thought of IDF as a formula, and the formula is the easy part. The artifact is where the formula becomes a contract between two pieces of software that have to agree, and the agreement is what actually breaks or holds. This leaf is about the design of that artifact: what goes in it, what the consumer reads, and the few decisions that look trivial and are not.

The minimal contract

At its smallest the artifact is two columns: term, and weight. The serving layer’s query path looks up each query term in the table, gets back its IDF weight, and feeds that weight into the BM25 scorer. That is the entire functional contract. A table from string to number.

The thing that makes even this minimal version a contract rather than a convenience is the agreement requirement that runs through the whole lexical side. The weight in the table is computed by one IDF formula, and the consumer that reads it has to expect a weight computed by that formula. If the builder writes Lucene-style non-negative IDF and the consumer assumes the textbook form, the lookups return numbers the scorer interprets wrongly, and as the variants leaf measured, a mismatch on the wrong pair of formulas can reverse the weighting on common terms with no error anywhere. The artifact is the seam where that agreement is either kept or silently violated, so even the two-column version carries an implicit third fact: which formula produced these numbers.

The richer, traceable version

That implicit fact is the argument for a richer schema. Instead of just term and weight, the artifact can carry the underlying counts and provenance:

txt

minimal:     term, weight
traceable:   term, weight, document_frequency, document_count, analyzer_name

Each extra column earns its place by making a failure debuggable. Storing the raw document_frequency and the document_count lets a consumer or an auditor recompute the weight and confirm the formula, rather than trusting an opaque number. Storing the analyzer_name records which tokenizer produced these terms, which matters because the terms in the table are only valid against documents and queries tokenized the same way, and a mismatch there is the other silent-corruption path. The traceable schema turns “the scores look subtly wrong” from an unfalsifiable suspicion into something you can check column by column.

The fork is real: the minimal schema is smaller and simpler, the traceable schema is larger and self-documenting. Which to ship depends partly on a question that is easy to get backwards, which is what the consumer actually reads. Producing columns the consumer ignores is harmless waste; failing to produce a column the consumer needs is a broken contract. So the schema is not “everything that might be useful,” it is “exactly what the consumer ingests, plus what a human needs to debug it,” and confirming which columns the consumer reads is a real task rather than an assumption.

The unseen term

The decision that is invisible until a query hits it is what happens for a term the table does not contain. A user can always type a word that never appeared in the corpus, so its document frequency is zero, and it has no row in the table. The lookup misses. The serving layer cannot return nothing, so it needs a fallback weight, and choosing that fallback is a small design question with a real consequence.

The clean options: a hardcoded default weight applied at lookup time when a term is absent, or a sentinel row in the artifact carrying a chosen weight, often the maximum IDF in the table, on the reasoning that an unseen term is maximally rare and so maximally discriminating. The two are not equivalent. A hardcoded consumer-side default keeps the artifact pure but puts the policy in the serving code; a sentinel row puts the policy in the artifact where it travels with the data and can be set by the build job. Whether a given collection even needs a sentinel is itself a question, because some consumers handle the miss themselves and some expect the artifact to. The unseen-term behavior is part of the contract, and like the schema it is something to confirm rather than guess, because a wrong fallback silently mis-weights every out-of-vocabulary query term.

Capping the table at scale

On a small corpus the table holds every term and that is fine. At very large corpus sizes the vocabulary itself gets large enough that the artifact’s size becomes a concern, and the standard response is to cap the table to the N most common terms, dropping the long tail of rare terms whose rows would dominate the file. This sounds backwards, because rare terms are the high-IDF, high-signal ones, until you remember the unseen-term fallback: a rare term dropped from the table simply falls through to the fallback weight, which for a maximum-IDF sentinel is close to what its real weight would have been anyway. So capping keeps the common terms, whose weights genuinely differ from the fallback and matter most for getting the common-word contributions right, and lets the rare tail be approximated by the fallback. The cut is usually made by approximate quantiles over the document-frequency distribution rather than an exact sort, because at that scale exact is itself expensive.

At smaller corpus sizes, like a help-center-scale collection of a few thousand documents, this capping is unnecessary, the whole table fits comfortably. But knowing it is a deliberate omission rather than an oversight matters, because it is the kind of thing that is invisible when the corpus is small and becomes load-bearing when it grows, and a system that never capped will hit the size wall without warning.

One artifact or many

The last decision is granularity. Is there one IDF artifact for the entire collection, or one per dataset or per logical sub-collection? It depends on whether the sub-collections have genuinely different term statistics. If a single corpus is homogeneous, one artifact is simpler and the statistics are pooled over more documents, which makes them more stable. If the system serves distinct collections with different vocabularies, where a term that is rare in one is common in another, a shared artifact averages away the distinction and weights terms wrongly for both, and a per-collection artifact keeps each collection’s rarity signal honest at the cost of more artifacts to build, promote, and keep consistent. This is the same “what counts as the corpus” question that the chunking discussion raised one level down, now asked across collections instead of within a document: the unit over which you pool statistics determines what the statistics mean.

The thread through all of these is that the artifact is the meeting point of the build side and the serve side, and every one of its design choices is really a clause in the contract between them. The formula has to match, the schema has to match what the consumer reads, the unseen-term fallback has to be agreed, the cap has to be understood as deliberate, and the granularity has to match what counts as a corpus. Get the arithmetic right and any of these wrong, and the table is full of correct numbers that the consumer uses incorrectly.

Up to the indexing pipeline. The formula whose agreement this artifact carries is in IDF variants; the weights it stores feed BM25 and the WAND bounds. How often it rebuilds is batch versus incremental; how it reaches the live system is staging to serving.