Search and Retrieval: Index

From Search Infrastructure and Software Engineering at Shopify
Search and Retrieval
Page metadata
First created May 28, 2026
Last edited Jun 7, 2026

The oldest search engine I know of was finished in 1239, ran on about five hundred Dominican friars, and indexed exactly one book. They were building a concordance of the Bible: a list, for every word in scripture, of every place that word appeared, by book and chapter. A preacher who needed every passage containing mercy could now find them without rereading the Bible. No quotations, just locations. It took nine years by hand.

That is an inverted index. The data structure at the center of every lexical search engine running today is seven centuries old, and the thing that struck me when I found this out is that the friars had already understood the whole trick: do not store, per document, the words it contains; store, per word, the places it contains it. The computers came later and changed nothing about the idea. So I wanted to start the map there, because most of what I have been learning turns out to be a long elaboration on a medieval indexing problem, and the parts that are genuinely new are easier to see once you know which parts are not.

The naive version of search is the friars’ version. There is a pile of documents, someone gives you some words, and you hand back the documents that contain those words. For a surprising fraction of real queries that is still enough. What I found interesting is the catalogue of specific ways it breaks, because each break is where some piece of the modern machine actually came from.

The first break: words are not meaning

It fell over for me the first time on a query with no shared words at all. Someone searches how do I get my money back and the document that answers them says refund policy, and there is not one word in common, so pure word-matching returns nothing. The answer was sitting right there and the user walks away thinking it did not exist.

The fix that the field landed on was to stop matching words and start matching meaning. You run each document through a model that turns it into a point in a high-dimensional space, arranged so that things which mean similar things land near each other, and then finding the answer becomes finding the nearby points. That is the semantic arm. I still find it slightly uncanny to watch get my money back pull up a document that never says any of those words.

The lineage here is shorter than the inverted index but older than I expected. Gerard Salton, who more or less founded information retrieval as a field, was already representing documents and queries as vectors in his SMART system at Cornell in the 1960s. The idea of a document as a point in space was there decades before the embedding models that made it work well.

The second break: meaning is not enough either

What surprised me more was the opposite failure. The meaning-matching arm goes vague exactly where it looks strongest. Ask it for an error code, an API name, a product SKU, a function signature, a rare proper noun, and it hands back something thematically adjacent and useless. Those tokens carry their meaning in their literal form, not in some smooth neighborhood of nearby concepts. ENOENT has no synonyms. Someone who types it wants the document with that exact string, and a system that only matches meaning cannot give it to them.

So the arm everyone is tempted to throw away after discovering embeddings turns out to be the one that catches what embeddings miss. Lexical retrieval, the friars’ idea, but sharpened. The sharpening that matters most is a rarity signal. The naive count treats every shared word as equal, so a document matching the scores the same as a document matching ENOENT, which is obviously wrong: the is in everything and discriminates nothing, while ENOENT is in almost nothing and is therefore almost the entire signal. Weighting rare matched terms far above common ones is what turns word-matching from a toy into a real retriever. That weight is IDF, and the part I did not expect is how much harder it is to get exactly right in a running system than the formula makes it look.

The shape of the whole machine

Once I had both arms in my head the rest of the machine arranged itself around them. A document on disk becomes a ranked result through a path like this:

txt
OFFLINE  (build the indexes, ahead of any query)

  docs ──▶ analyze ──┬──▶ inverted index + IDF   (lexical)
                     └──▶ embed ──▶ vector index  (semantic)


ONLINE  (answer one query)

  query ──▶ analyze ──┬──▶ lexical retrieve ──┐
                      └──▶ semantic retrieve ─┴──▶ fuse ──▶ rerank ──▶ results

Two things in that picture took me a while to register. The first is that almost all the expensive work happens before anyone searches: analyzing text, computing corpus statistics, building indexes, embedding documents. The query path is only fast because the offline path already did the heavy lifting. The second is the one I kept underestimating. The same analysis step appears twice, once on the documents and once on the query, and if the two do not produce identical tokens the lexical arm scores against the wrong vocabulary and nothing tells you. No error, no crash, just quietly wrong results. That parity requirement turned out to be the single largest correctness risk I ran into, and it hides behind one bland line in a design doc.

The remaining boxes are all consequences of running this at scale and keeping it alive. Two ranked lists come out now, lexical and semantic, on score scales that have nothing to do with each other, so something has to fuse them into one. The fused list is fast but its ordering is rough, so a slower and more precise model reranks just the top handful. The corpus statistics the lexical arm depends on are a batch data-engineering job with real memory and runtime decisions, not anything that happens at query time. And serving all of it to live traffic, often while migrating from one search backend to another without regressing what users see, is its own discipline with its own failure modes and safety valves.

The six territories

Each box below is its own map. Breadth here, depth there. Start wherever the question that brought you here points.

Lexical retrieval

The friars’ arm, sharpened. The inverted index, TF-IDF, BM25, the IDF variants and why the choice between them matters, and the tokenization and analysis pipeline that everything silently depends on. This is the deepest territory in the tree, because it is the one the day job has handed me in the most detail and the one most public write-ups abandon right after quoting the BM25 formula.

Semantic retrieval

The meaning-matching arm. Turning text into vectors so that similar meanings become nearby points, and the approximate-nearest-neighbor tricks that make finding nearby points feasible when there are millions of them.

Why you run both arms and how you combine their ranked lists into one. The trick is fusing ranks rather than scores, because rank position is comparable across retrievers even when their raw scores live on wildly different scales.

Reranking

The two-stage idea: retrieve a rough big candidate set fast, then re-score the top of it slowly and precisely. Why a model that reads the query and document together beats one that compared them as separate points, and why you can only afford it on the final handful.

The indexing pipeline

Search as a data-engineering problem. Corpus statistics are an offline batch job, not a query-time computation, and that job has real decisions: full rebuild versus incremental update, the memory and runtime choices on large columnar files, the contract for the artifacts the serving layer consumes, and the atomic handoff from staging to serving so the live system never reads a half-written index.

Serving at scale and migration

Keeping search alive under real traffic. Two backends coexisting during a migration, measuring relevance so you can define “don’t regress” concretely before you flip traffic, and the killswitch and failover paths that let a bad index get rolled back without taking search down.

Edges I have not mapped yet

Two real stages sit just off this map, and I am leaving them off deliberately until I have worked them properly rather than sketching them badly.

The first is query rewriting and expansion: rewriting a malformed query, or fanning one query out into several variants, before retrieval runs. It belongs at the very front of the online path, ahead of both arms. The open question is whether it earns its own territory or rides along inside serving, and I would rather decide that after writing it than guess now.

The second is what happens after ranking. Hydration is the step where the index returns matching document ids and a separate fetch pulls the actual text, title, and URL to show the user. Past that, the ranked results often become the context for an LLM in a retrieval-augmented-generation setup. Both are downstream of everything mapped here, and both are their own subject; for now this map stops at the ranked list.

Index

  • Lexical Retrieval. Word-matching done right. The inverted index, TF-IDF and BM25, the IDF variants and why the choice matters, and the tokenization pipeline everything silently depends on.
  • Semantic Retrieval. The meaning-matching arm. Turning text into vectors so that things that mean similar things land near each other, the cosine that measures near, and the approximate-nearest-neighbor structures that make finding near feasible at millions of vectors. Where it beats lexical, where it quietly fails, and why it is only half of a real system.
  • Hybrid Search. Running both arms and combining them. Why lexical and semantic are complementary rather than competing, why you cannot just add their scores, and the rank-based fusion that merges two ranked lists living on incompatible scales into one. The case for both, and the pointer into how the fusion actually works.
  • Reranking. The two-stage idea: retrieve a rough big candidate set fast, then re-score the top of it slowly and precisely. Why a model that reads the query and document together beats one that compared them as separate points, why you can only afford it on the final handful, and the cost arithmetic that forces the split.
  • The Indexing Pipeline. Search as a data-engineering problem. Corpus statistics are an offline batch job, not a query-time computation, and that job has real decisions: the artifact contract the serving layer consumes, the memory and runtime choices on large columnar files, full rebuild versus incremental update, and the atomic staging-to-serving handoff so the live system never reads a half-written index.
  • Serving at Scale and Migration. Keeping search alive under real traffic, and moving it from one backend to another without users noticing. Two engines coexisting during a migration, measuring relevance so "do not regress" is a number you can check before flipping traffic, and the killswitch and failover paths that let a bad index get rolled back without taking search down.