From Search and Retrieval

The Whole Pipeline: From Query to Ranked Results

The previous pages built the ingredients one at a time. This page steps back to the system those ingredients live in. It follows one of them, the IDF weight, all the way from “what is this number” to “how do you actually compute it over a real collection and hand it to a live search engine.” Along the way it names the design choice at each stage, because a retrieval pipeline is mostly a sequence of decisions about what to compute slowly and ahead of time versus what to compute fast and live.

IDF: weighting a word by how rare it is

The lexical side of the pipeline rests on inverse document frequency. The job of an IDF score is to rank how rare a word is across the corpus: rare words get high scores, words that turn up in many documents get low scores. A word that appears in almost every document, like “the”, or “policy” in a corpus of policies, carries almost no signal about which document you want, so it should barely move the ranking. A rare, specific word does almost all the work of finding the right document, so it should dominate. (The formula itself was covered on the previous page and isn’t relitigated here.)

The subtle design choice is what you count. IDF is computed from document frequency, the number of documents that contain the term, not from collection frequency, the total number of times the term appears across the whole corpus. These come apart exactly when a word is repeated heavily inside a single document:

A word that appears 500 times, but all 500 inside one document, is still only useful for finding that one document. Across the corpus it is rare, and IDF should treat it as rare.
A word spread thinly across 500 different documents is genuinely common. It can’t tell those 500 apart, and IDF should treat it as common.

Document frequency captures the thing we actually care about: how good is this word at narrowing the field of documents? Collection frequency would let a single verbose document inflate a word’s apparent commonness, punishing a term that is actually a strong discriminator. The repetition of a word within a document is a separate signal, that’s term frequency, handled elsewhere in the score. IDF’s only question is across-document spread.

Before any of the design choices make sense, it helps to see where this piece lives in a working system, because it is easy to picture “search” as one program and miss that it is really two.

Imagine a website with a search box. A user types something, and a fraction of a second later a ranked list of pages comes back. That live, fast part is the online system, and it has a hard budget: it has to answer in milliseconds, so it cannot afford to read the entire collection of documents and do heavy math on every keystroke. To rank results it leans on numbers that were worked out ahead of time. One of those numbers is the IDF weight for each term, the rarity score from the earlier pages.

Working out those IDF weights is a separate program that runs on its own schedule, not while anyone is waiting. That is the offline system. It reads the whole collection of documents in bulk, counts how many documents each term appears in, turns those counts into IDF scores, and writes the result to a file. Later, the online search system loads that file and just looks numbers up in it. Nothing in this page is the search engine itself; it is the bulk job that prepares one of the ingredients the search engine consumes.

So the mental model is: a slow, thorough builder that runs occasionally and produces a file, and a fast consumer (the live search) that reads that file. Almost every decision below comes from taking that split seriously: the builder can be slow but must be correct and must fit in the machine it runs on, and whatever it writes must be in a form the consumer can use without surprises.

The thing you are producing: a lookup table

The output of the builder is a lookup table. Picture a two column list: on the left a term, on the right its IDF score. The live search, when it processes a query, takes each term the user typed and looks it up in this table to find how much weight that term should carry. That is the entire interface between the two systems, one file, looked up by term.

A few decisions shape that table.

A “term” is not always a single word. It can be one token (a unigram, like token) or two adjacent tokens (a bigram, like access token). Bigrams earn their place because some ideas are only specific as a pair. On its own, access is everywhere and token is everywhere, but access token points at a narrow set of documents. Deciding which terms to include, and how a search later balances single words against pairs, is a question for the search side. The builder’s job is narrower: produce terms and give each one a score.

One row is special: the fallback. A user will eventually type a term that is not in the table at all. You do not want the system to crash, and you usually do not want to silently throw that term away either, because a term you have never seen is often a rare and meaningful one. So the table carries a single sentinel entry that says “if you look up a term and it is missing, use this score instead.” A reasonable choice for that fallback is the highest IDF already present in the table: treat an unknown term as at least as rare as the rarest thing you actually saw. That keeps unknown terms influential rather than ignored, while staying tied to real numbers from your data rather than a value picked out of the air. (The cost of this choice: a typo or a junk token also gets treated as maximally important, which is something to be aware of and is usually softened by the cleaning step described next.)

The offline and online sides must agree on what a “term” is

Raw text does not become terms by magic. The text first runs through a tokenizer, also called an analyzer, which is just a cleaning and splitting step. A typical one will lowercase everything, strip accents, cut tiny filler words like “the” and “of,” reduce words to a common root so that “refunds” and “refunding” collapse to “refund,” and emit those bigrams. Whatever steps you pick, that pipeline defines what counts as a term.

This sets up the single most important rule connecting the two systems: the tokenizer the builder uses must be the same tokenizer the live search uses. The reason is direct. The builder writes the table using its idea of terms. At query time the live search cleans the user’s words with its own tokenizer and looks the results up in that table. If the two disagree, even slightly, the query terms will not match the keys that were written, every lookup will miss, every miss will fall back to the sentinel, and all the careful counting was for nothing. Agreement on tokenization is what makes a file built offline actually usable online.

Choosing which text to count

Real collections are rarely stored as neat whole documents. More often each document has been pre-split into smaller pieces so the system can handle and serve them, and each piece carries some shared descriptive fields of its parent document alongside its own slice of body text. That gives the builder a genuine choice: when counting term occurrences, do you read all of the body text across every piece, or only the per-document descriptive fields?

Counting only the descriptive fields is often the better call, and it works like this. Gather every piece that belongs to the same document (they share a document identifier), notice that the descriptive fields repeat identically across those pieces, keep just one copy, and count terms from that copy alone. You skip the body text entirely. Two payoffs. First, less noise: descriptive fields are dense, on topic, and deliberately written, while body text sprawls and repeats, so a table built from them discriminates better. Second, far less work: you are reading a tiny fraction of the bytes.

There is a consequence worth saying out loud, because it changes what your IDF actually means. Since you count one descriptive unit per document, the “document” that document frequency is counting is that single unit. A term sitting in a document’s descriptive fields counts exactly once for that document, no matter how many pieces the document was split into. That is the behavior you want, but only if you are deliberate about it. In practice this means the builder has to be explicit about which fields it reads, pulling in the descriptive fields and leaving the body out on purpose, rather than grabbing whatever columns happen to be present.

Two different things both called “chunks”

The word “chunk” shows up in two unrelated places here, and conflating them is the fastest way to get confused, so pin them apart:

The pieces the data arrives in. Whole documents are often too large to store or serve as single units, so the collection comes pre-split into pieces, many rows per document. This is just the shape of your input. You did not choose it; you receive it that way.
The batches you process in. When the builder runs, it does not swallow everything at once. It walks the data in groups, folding each group into running totals as it goes. These groups are an internal device for controlling how much sits in memory at any moment. They have nothing to do with how the documents were split; they are about how your program paces itself through the data.

One is how the data is laid out. The other is how your computation moves through it. Same word, different worlds.

Why you stream: the wall is memory, not time

The obvious way to build the table is: load all the data into memory, group the rows by document, count, write the file. The trouble is memory, not time. These table files are compressed on disk, and compression hides their true size. A file that is one gigabyte on disk can balloon to several times that once it is unpacked into live objects in a program. If the machine running the job has a fixed memory ceiling (a continuous integration runner with, say, sixteen gigabytes is a common case), the unpacked data simply does not fit. And this is not the kind of problem you can wait out by letting the job run longer overnight. Time is not the constraint. The data does not fit at one instant, and no amount of patience changes that.

The fix is to stream: never hold the whole thing at once. The loop is

Read one batch.
Process its rows, and as each document finishes, fold it into the global counters.
Throw that batch’s data away and move to the next.

At any moment you are holding one batch plus the running counters, which is small and constant no matter how large the full collection grows. The job fits under the memory ceiling by construction.

There is one assumption that makes this clean, and it is worth stating because it can quietly break. Streaming a single pass works only if all the rows for a given document arrive together, back to back, in the stream. When they do, you know a document is finished the instant the document identifier changes: you fold it in once, free its rows, and carry on. If instead a document’s rows were scattered throughout the stream, you could not tell when any document was truly done, so you would have to keep partial state for every document seen so far, which is exactly the pile of memory streaming was meant to avoid. So the builder leans on the data arriving grouped by document. If you cannot guarantee that upstream (by sorting, for instance), you either have to sort first or accept holding more state, and it is the kind of assumption that should be checked rather than hoped for, because if it is violated the same document can be counted twice and the IDF numbers come out wrong.

A note on the file format

The files in and out of this job are usually in a columnar binary format (Parquet is the common one). The plain way to think about it: it is like a CSV, a table of rows and columns, but stored in a compact binary form instead of text, compressed, and with a known type for each column. That makes it much better than plain text for large data, and two of its properties show up directly in the choices above. Its compression is exactly what creates the memory surprise from the streaming section, small on disk, large once unpacked. And because it stores each column separately, asking for only the descriptive-field columns and skipping the body is a cheap, native operation, not a read-everything-then-discard one. The format quietly rewards the metadata-only choice.

Where this leaves us

At this point one ingredient is fully prepared: a clean, offline IDF table, built to fit in memory, agreeing with the live tokenizer, counting the right unit of text. The live search engine can now load it and look terms up in constant time. What the search engine does with that weight, combining it with how often a term appears in a given document to produce a single relevance score for ranking, is the next concept, and it is where the BM25 formula finally earns its keep.