From Lexical Retrieval

The Inverted Index

The naive way to search a pile of documents is to read all of them. For each query, walk every document, check whether it contains the query words, keep the ones that do. It is the obvious thing and it is what I would write first, and the reason it does not survive contact with a real corpus is that it does work proportional to the size of the whole collection on every single query. A million documents means a million documents read per search, most of which share not one word with the query.

The fix is one idea, and once I had really looked at it I understood why it has been reinvented over and over for eight hundred years: turn the storage inside out. Do not keep, for each document, the list of words in it. Keep, for each word, the list of documents that contain it. That is the inverted index. The name says exactly what it is, an inversion of the natural document-to-words map into a words-to-documents map, and the payoff is that a query word stops being something you search for and becomes something you look up.

Where this came from

I keep coming back to the fact that the inverted index predates the computer by centuries. The first one I know of was a concordance of the Bible finished in Paris in 1239, organized under Hugh of Saint-Cher with help from a large number of Dominican friars. A concordance is an alphabetical list of words, and under each word, every place in the text it appears, given by book and chapter. A preacher who needed every passage about mercy looked up mercy and got the list, instead of rereading scripture. That is the data structure exactly: term, then postings. The friars even hit the engineering problems that come with it, including how finely to locate each hit, and they answered by splitting each chapter into lettered parts so a reference could point inside it.

Nothing about the idea needed electricity. What the computer added was scale and speed, not insight. When full-text retrieval systems were built in the second half of the twentieth century, the inverted index was the natural structure to reach for, and it still sits under Lucene, under Elasticsearch and Solr, under essentially every lexical engine in production. The leaf you are reading is the modern version of a medieval index card.

Watching it invert

The intuition is easy to assert and easy to nod along to without actually seeing it work, so I wrote out a tiny corpus and inverted it by hand. Three documents:

txt

D1: the cat sat on the mat
D2: the dog sat on the log
D3: the cat chased the dog

The natural representation, the one the documents already are, is the forward index: each document points to its words.

txt

D1 -> the, cat, sat, on, the, mat
D2 -> the, dog, sat, on, the, log
D3 -> the, cat, chased, the, dog

To answer cat against this, I have to read all three documents and check each. To invert it, I go term by term and record which documents each term shows up in. Walk D1 first: the appears in D1, cat appears in D1, sat in D1, and so on. Then D2 adds its terms, then D3. Collecting by term gives the inverted index:

txt

cat     -> D1, D3
chased  -> D3
dog     -> D2, D3
log     -> D2
mat     -> D1
on      -> D1, D2
sat     -> D1, D2
the     -> D1, D2, D3

Now cat is a single lookup that returns D1 and D3, and the documents that do not contain cat are never touched. A two-word query like cat dog becomes two lookups, cat giving {D1, D3} and dog giving {D2, D3}, and then a set operation on the two lists: their union {D1, D2, D3} if any word matching is enough, their intersection {D3} if a document must contain both. The whole engine of Boolean retrieval, AND and OR and NOT over queries, is set operations over these lists.

Two things jumped out at me from this small example. The first is that the points at every document, which is the inverted-index view of the problem that IDF exists to solve: a term whose postings list is the entire corpus carries no information about which document is relevant, and its list is also the longest and most expensive one to process. The second is that the index lost the original word order and the repetition. the appears twice in D1, but the bare list the -> D1, D2, D3 does not say so. Whether that loss matters is the first real design fork.

What goes inside a posting

In the toy version a postings list is just a list of document ids. That is enough for Boolean matching, but it throws away everything a ranking function needs. The moment you want to score rather than merely match, each entry in the list, each posting, has to carry more than an id.

The first thing to add back is the term frequency: how many times the term appears in that document. the -> D1, D2, D3 becomes the -> (D1, 2), (D2, 2), (D3, 2), recording that the appears twice in each. BM25 needs this number, because its whole job is to turn within-document frequency into a score with the right saturation curve. Storing term frequency in the posting is what lets the scorer avoid re-reading the document at all.

The next thing, if the system needs phrase or proximity queries, is positions: not just that cat is in D1 but that it is the second token. With positions, a phrase query like “sat on” can confirm the two terms are adjacent, and a proximity query can ask for them within some window. Positions are expensive. They can dominate the size of the index, because a long document has one position entry per token occurrence, so a system that does not need phrase search often leaves them out. That is a real fork: positional index versus a frequency-only one, paying storage and build time for the ability to answer phrase and proximity queries.

So a posting is a small bundle, and the schema of that bundle is a design decision:

txt

minimal:        docID
for ranking:    docID, termFrequency
for phrases:    docID, termFrequency, [position, position, ...]

Each step up the ladder buys query capability and costs index size. I had assumed the inverted index was a single fixed thing; it is really a family of structures that differ in how much they remember about each occurrence.

The fork that makes it fast at scale

The reason an inverted index is fast is the lookup, but there is a second speed problem hiding inside it: some postings lists are enormous. A common term in a large corpus can have a postings list with millions of entries, and a query that combines a common term with a rare one should not have to walk all million.

The structural answer is to keep postings lists sorted by document id and add skip pointers, forward links that let the reader jump ahead over chunks of the list instead of stepping through every entry. When intersecting cat (short list) with the (giant list), the reader advances through the short list and uses the skips on the long list to leap toward the matching ids, skipping the runs in between. This is the same instinct as the friars splitting chapters into lettered parts: make the locations navigable, not just present.

This is also exactly where ranking and the index structure start to feed each other. WAND uses per-term upper-bound scores, derived from IDF, to skip whole stretches of postings that cannot possibly contribute to the top-k. The skip machinery in the index and the score bounds from IDF combine so that a good lexical retriever never scores most of the documents a query technically matches. Sorted postings with skip pointers are what make that skipping mechanically possible.

Cost, and the choice it forces

Inverting is not free. The forward representation is the documents themselves, already on disk; the inverted index is a second copy of the corpus reorganized by term, and with term frequencies and positions it can rival or exceed the size of the original text. Building it is a pass over the whole corpus that tokenizes every document and accumulates postings, which is memory-hungry at scale and is one of the reasons corpus-wide work belongs in the indexing pipeline as a batch job rather than anything done at query time.

That cost sets up the standing fork of lexical search: you pay storage and build time up front, in exchange for query time that scales with how many documents match the query rather than with the size of the collection. For search, where queries are constant and the corpus is read far more often than it changes, that trade is almost always right, which is why the structure has survived from parchment to Lucene essentially unchanged in its core idea.

One layer down from the logical structure is the byte-level encoding that keeps the giant postings lists small, and the trick rests on a property the toy hides: postings are stored sorted by document id, so consecutive ids differ by small gaps. Instead of storing the ids [3, 17, 18, 25] you store the first id and then the gaps, [3, 14, 1, 7], which is delta encoding, and small gaps take few bits. Those small numbers are then packed with a variable-length scheme, variable-byte coding or the block-oriented PForDelta and its successors, so a common term’s million-entry list compresses to a fraction of its naive size. This is why sorting postings by id is not just convenient for the skip-pointer intersection but load-bearing for storage: the same ordering that lets you skip also makes the gaps small enough to compress.

The other thing the toy hides is that real corpora change, and the index above is built once and read forever. The resolution most engines use is not to edit postings in place at all but to keep the index as immutable segments: new documents go into a new small segment, queries search all segments and merge, and segments are periodically merged in the background. That keeps writes cheap and never corrupts what queries are reading, and it is exactly the batch-versus-incremental tradeoff handled at the pipeline level.

The code

The inversion that produced the worked example is a handful of lines: walk the documents, and for each term record the document it appeared in. Add term frequencies to the posting and a Boolean query becomes a set operation over the lists.

python

from collections import defaultdict, Counter

corpus = {
    "D1": "the cat sat on the mat".split(),
    "D2": "the dog sat on the log".split(),
    "D3": "the cat chased the dog".split(),
}

# build the inverted index: term -> {docID: term_frequency}
index = defaultdict(dict)
for doc_id, tokens in corpus.items():
    for term, tf in Counter(tokens).items():
        index[term][doc_id] = tf

index["cat"]          # -> {'D1': 1, 'D3': 1}   (a single lookup, untouched docs never read)
index["the"]          # -> {'D1': 2, 'D2': 2, 'D3': 2}  (postings list = the whole corpus)

# Boolean retrieval is set ops over the postings keys:
def search(a, b, op):
    A, B = set(index[a]), set(index[b])
    return A | B if op == "OR" else A & B

search("cat", "dog", "OR")    # -> {'D1', 'D2', 'D3'}
search("cat", "dog", "AND")   # -> {'D3'}

That {docID: tf} posting is exactly what TF-IDF and BM25 read to score without reopening the documents.

Up to the Lexical map. Sideways to the structure that gives each posting its weight, TF-IDF. The hidden dependency under all of it, what actually counts as a token, is tokenization and analysis.