Semantic Retrieval: Index

First created Jun 7, 2026 Last edited Jun 7, 2026

Lexical retrieval matches words. Its hard limit, the one that motivates this entire arm, is that it cannot match meaning when the words differ. Someone asks how do I get my money back and the document that answers them is titled refund policy, and there is not one content word in common, so word-matching returns nothing. The answer is right there and the user is told it does not exist. The semantic arm exists to close exactly that gap, and before writing a word of how it works I wanted to confirm the gap is as wide as I remembered.

I took five query/answer pairs that any human would call a match, and measured their lexical overlap as the fraction of content words they share:

txt

J=0.00   "how do i get my money back"      ~  "refund policy and returns"
J=0.00   "the app keeps crashing on launch" ~  "application fails to start"
J=0.00   "change my email address"          ~  "update account contact information"
J=0.00   "reset forgotten password"         ~  "recover login credentials"
J=0.00   "cancel my subscription"           ~  "end recurring membership billing"

Zero, every one. Not “low,” zero shared content words. Pure lexical search returns nothing for any of these, and they are not contrived edge cases, they are how people actually phrase things. That is the hole. The fix is to stop comparing the words and start comparing the meanings, and the way the field made “meaning” into something a computer can compare is the subject of this territory.

The idea: meaning as a location

The move is to represent each piece of text as a point in a high-dimensional space, a vector of a few hundred to a couple thousand numbers, arranged by a learned model so that texts which mean similar things land near each other. Refund and money back end up close not because they share letters but because the model placed them close. Once text is a point, “find the relevant document” becomes “find the nearby points,” and relevance becomes geometry.

To feel the shape without a real model, I mapped those same five pairs into a tiny hand-built concept space and measured the cosine between them. The cosines came back between 0.93 and 1.00, the same pairs that scored zero lexically. The hand-set vectors are a stand-in, not a learned embedding, so the numbers are illustrative rather than real, but the shape is the whole point: lexical similarity zero, geometric similarity near one. Matching meaning recovers exactly the matches that matching words threw away.

This idea is older than the models that made it work. Gerard Salton was representing documents and queries as vectors in the SMART system at Cornell in the 1960s, the same vector-space thinking that underlies TF-IDF. What changed recently is not the idea of a document as a point; it is that learned models now place the points well enough that geometric nearness genuinely tracks meaning. The geometry was always the plan. The good coordinates are the new part.

The two questions this territory answers

Making semantic retrieval real comes down to two problems, and they are the two leaves below.

The first is where the vectors come from. A model has to turn arbitrary text into a point such that distance means dissimilarity of meaning, and it has to place the query and the documents in the same space so they are comparable. That is embeddings: what an embedding is, how the model learns to put paraphrases near each other, the cosine metric that measures near, and the choice between a single vector per document and finer-grained representations.

The second is that finding the nearest points is expensive. With millions of document vectors, comparing the query to every one of them is the same scan-everything waste the inverted index fixed for lexical search, and the fix is structurally similar: build an index that lets you find near neighbors without examining them all. The catch is that exact nearest-neighbor search in high dimensions is genuinely hard, so the field trades a little accuracy for a lot of speed and accepts approximate answers. That is approximate nearest neighbor search: the IVF and HNSW index structures, the quantization tricks that shrink the vectors, and the recall-for-speed dial that every one of them exposes.

Where it is strong and where it quietly fails

The arm everyone wants to switch to entirely turns out to have a failure mode as sharp as lexical’s, and in the opposite place, which is the single most important thing to hold onto about it.

Semantic search is strong exactly where lexical is weak: paraphrase, synonymy, the same idea in different words, the money back / refund gap above. It is weak exactly where lexical is strong. Ask it for an error code, an API name, a product SKU, a function signature, a rare proper noun, and it hands back something thematically adjacent and useless, because those tokens carry their meaning in their exact literal form, not in a smooth neighborhood of nearby concepts. ENOENT has no synonyms. Someone who types it wants the document with that exact string, and a model that embeds it into a region of “error-ish things” will return plausible neighbors that are all wrong.

So the conclusion this territory builds toward is not that semantic replaces lexical. It is that the two are complementary in a precise, almost mirror-image way: each catches what the other drops. That complementarity is why real systems run both arms and combine them, which is hybrid search, and it is why the rarity signal I spent the whole lexical territory getting right still matters even in a world with good embeddings. The semantic arm does not make the lexical arm obsolete. It makes the lexical arm’s blind spots survivable, and the lexical arm returns the favor.

The leaves

Embeddings

Text to vector. What an embedding is, how a model learns to place paraphrases near each other and rare literals poorly, the cosine similarity that measures nearness, and the representation forks: one vector per document versus per passage, and the dimensionality trade.

Approximate nearest neighbor search

Finding near points without checking them all. Why exact search in high dimensions is hard, the IVF and HNSW index structures, product and int8 quantization to shrink the vectors, and the recall-for- speed dial they all expose.

Up to the whole map. The other arm, whose blind spots this one covers and whose strengths it lacks, is lexical retrieval. The combination of the two is hybrid search, and the precise final re-scoring over the combined candidates is reranking.

Index

Embeddings. Turning text into a point in space so that nearness means similar meaning. What an embedding is, how a model learns to place paraphrases together and exact literals badly, the cosine similarity that measures near, and the representation forks: one vector per document or per passage, and what the dimensions buy.
Approximate Nearest Neighbor Search. Finding the nearby vectors without comparing the query to all of them. Why exact nearest-neighbor in high dimensions is hard, the IVF and HNSW index structures, the quantization tricks that shrink the vectors, and a measured recall-versus-cost curve showing you buy back most of the recall for a fraction of the work.