Learning Modern Search and Retrieval
1. What problem is “search” even solving?
The basic situation: there’s a big pile of documents, and someone types a question. You want to hand back the few documents that actually answer it. Sounds simple, but “actually answers it” is the hard part. The old way (match the words in the query to the words in the document) breaks the moment someone phrases their question differently than the document is written. Somebody searches “how do I get my money back” and the doc says “refund policy” and there’s not a single shared word.
So modern search is really about matching meaning, not words.
TBD: where keyword matching still beats meaning matching, and why systems use both.
2. Embeddings: turning text into points in space
The key trick. You take a piece of text and convert it into a long list of numbers (a vector). The clever part: texts that mean similar things end up as points that are close together in this space, even if they share no words. “Get my money back” and “refund policy” land near each other.
A model does this conversion. You run every document through it once (ahead of time), and you run the query through it when someone searches. Then “find relevant documents” becomes “find the document-points closest to the query-point.”
The numbers themselves are meaningless to a human (a vector might be 3000+ numbers long). The only thing that matters is distance between points.
TBD: what “close” actually means mathematically (cosine? dot product?) and why one over the other. TBD: why the dimension count (e.g. 3072) matters and what tradeoff it represents.
3. The needle-in-a-haystack problem: ANN
Once everything’s a point in space, search = “find the nearest points to my query point.” Easy if there are 100 documents. Brutal if there are 100,000+ chunks, because checking the query against every single point is too slow to do on every search.
So instead of finding the exact nearest points, you find the approximately nearest ones, much faster, by being clever about not checking everything. That’s Approximate Nearest Neighbor (ANN). You give up a tiny bit of accuracy for a massive speed win.
The trick is usually: pre-group the points into clusters, and at search time only look inside the few clusters nearest the query instead of scanning all points. There are also tricks to compress the vectors so they take less space and compare faster.
TBD: how the clustering actually works (what IVF is doing). TBD: what “quantization” / compression is trading away, and why it’s usually fine. TBD: the accuracy-vs-speed knob and where you’d set it.
4. First-stage retrieval vs. reranking: why two steps?
Here’s the thing that confused me at first: why not just take the top results from the vector search and call it done?
Because the fast vector search is fast but rough. Comparing two points in space (embedding similarity) is a crude measure of “does this document answer this query.” It’s good enough to pull, say, the top 100 candidates out of 100,000. But the ordering within those 100 is unreliable.
So the pattern is two stages:
- Retrieve — fast, rough, pulls a big candidate set (top ~100) out of the whole corpus.
- Rerank — slow, precise, carefully re-scores just those ~100 and picks the real best handful.
You can’t run the slow precise method on all 100,000 (too expensive), and you can’t trust the fast method’s ordering. So you use the fast one to narrow, the slow one to sort.
TBD: roughly how much better reranking makes the final results, and when it’s worth the cost.
5. Cross-encoders: the precise reranker
The fast retrieval step embeds the query and the document separately and compares the two points. The document never “sees” the query.
A cross-encoder is different: it takes the query and a document together, at the same time, and reads them jointly to produce a relevance score. Because it looks at them together, it can catch things separate-embedding misses (“does this specific document actually answer this specific question”). That’s why it’s more accurate.
The catch: it’s slow, because you have to run the model fresh for every (query, document) pair. That’s exactly why it’s the second stage, used only on the ~100 candidates, never on the whole corpus.
TBD: the intuition for why seeing them together is more accurate than comparing two points. TBD: what a cross-encoder’s score actually represents.
6. Combining multiple searches: RRF
Sometimes you run more than one search for a single question. Maybe you rephrase the query a few different ways, or you search a few different document collections. Now you have several ranked lists and you need to merge them into one.
Reciprocal Rank Fusion (RRF) is the simple, popular way to do this. The intuition: a document that shows up near the top of several lists is probably really relevant, so it should rank high in the combined list. It scores each document by where it placed in each list (higher placement = more points), adds up the scores, and re-sorts. It also naturally handles the same document appearing in multiple lists (dedup).
The appealing thing: it doesn’t need the scores from the different lists to be on the same scale. It only uses rank position, which is comparable across lists even when the raw scores aren’t.
TBD: the actual formula and why “reciprocal” (1/rank) is the shape it uses. TBD: where RRF should sit in the pipeline — before or after reranking — and why ordering matters.
7. Query rewriting / expansion
People type bad queries — too short, ambiguous, missing context. Before searching, you can have a model rewrite the query into something better, or expand it into several variations that catch different angles of what the person might mean. Then you search with the improved version(s).
TBD: what kinds of rewrites actually help vs. add noise. TBD: where this should live in the pipeline.
8. The ingestion side: ETL, chunking, indexing
Before any of the above can happen, the documents have to get into the system. That’s a pipeline that: pulls the documents, splits each long document into smaller pieces (“chunks”) because you want to retrieve the relevant paragraph, not the whole 20-page doc, runs each chunk through the embedding model, and stores the resulting vectors in the search index.
TBD: why chunking matters and how chunk size changes results. TBD: how chunks relate back to their parent document. TBD: what has to happen when documents change (re-embedding, re-indexing cadence).
9. Hydration: getting the actual content back
Subtle one I want to understand. The search index might return which chunks matched (by ID), but not the actual text/title/URL of those documents. So there’s sometimes a separate step that takes the matching IDs and fetches the real content to hand back to the user (or to the LLM). That’s “hydration.”
Some systems store the content right on the index and skip this; others keep the index lean and hydrate separately. There’s a tradeoff.
TBD: the tradeoff between storing content on the index vs. hydrating separately.
10. How this all feeds an LLM (RAG)
The reason any of this matters for the AI agents: the LLM doesn’t actually know the help docs. So when someone asks a question, you retrieve the relevant documents first, then hand them to the LLM as context, and the LLM writes its answer grounded in what you retrieved. “Retrieval-Augmented Generation.” The quality of the final answer is capped by the quality of the retrieval — if you fetch the wrong docs, the LLM confidently answers wrong.
So all the retrieval/ranking work above is really the part that determines whether the AI gives a good answer. The retrieval is the bottleneck.
TBD: how much retrieval quality actually moves final-answer quality. TBD: failure modes (retrieved the wrong thing, retrieved nothing, retrieved too much).
Reading order I’m thinking
Embeddings (2) → ANN (3) → two-stage retrieval (4) → cross-encoders (5) → RRF (6) → then the ingestion/hydration plumbing (8, 9) → then how it all serves RAG (10). Query rewriting (7) slots in wherever.
TBD: revise this order once I know which parts I’m actually touching.