From Semantic Retrieval

Approximate Nearest Neighbor Search

Embeddings turn the relevance question into a geometric one: find the document vectors nearest the query vector. The obvious way to answer it is to compute the distance from the query to every document vector and keep the closest, and that obvious way is the same scan-everything waste the inverted index removed for lexical search. With millions of vectors it is too slow per query. This leaf is about the structures that find near neighbors without checking them all, and the uncomfortable fact that, unlike the lexical case, the speedup here comes with a real and measurable loss of accuracy that you have to choose on purpose.

Why exact is hard, and why we give up on it

For lexical search the inverted index gives an exact speedup: you look up only the documents that contain a query term, and you miss nothing. High-dimensional vectors do not offer the same clean escape. There is no equivalent of “contains the word” to prune on; every vector is some distance from the query and you cannot rule any out a priori. Worse, distance itself behaves badly as dimensions rise. In high dimensions the distances between random points concentrate, so the nearest and the farthest neighbor end up almost equally far, and the structure that would let you skip points gets thin. This is the curse of dimensionality, and it is why exact nearest-neighbor search in high dimensions tends to degrade toward scanning everything anyway.

So the field made a trade that the lexical side never had to make: accept approximate answers. An approximate nearest neighbor index returns vectors that are probably among the true nearest, trading a small, tunable chance of missing a true neighbor for a large speedup. The measure of how often it gets the true neighbors is recall, the fraction of the true top-k it actually returns, and recall against cost is the dial that defines this whole territory.

IVF: search only the nearby neighborhoods

The first family is the inverted file index, IVF, and its idea is the most intuitive. Cluster all the document vectors into cells, each with a centroid, ahead of time. At query time, find the few centroids nearest the query and search only the vectors in those cells, ignoring the rest. The number of cells you search is nprobe, and it is the dial: probe one cell and you are fast and lossy, probe many and you are slow and accurate. The name echoes the lexical inverted index because the structure is similar in spirit, a map from cell to its member vectors, searched selectively.

I built a toy IVF to watch the dial move, because the tradeoff is the kind of thing I would rather measure than assert. Four thousand vectors, clustered into sixty-four cells, top-ten search, recall measured against exact brute force:

$Line chart of ANN recall versus search cost as a fraction of brute force$

Toy IVF index: recall@10 against cost, as a fraction of exact brute-force search. A few probes buy most of the recall; searching all cells costs more than brute force. The sweet spot is the middle of the curve, well above the y=x line.

nprobe	avg cost	cost vs exact	recall@10
1	127	3%	16.4%
2	189	5%	24.9%
4	315	8%	40.6%
8	565	14%	59.0%
16	1063	27%	78.6%
32	2062	52%	95.9%
64	4064	102%	100.0%

The dial is right there. Probing one cell costs three percent of brute force and recovers only a sixth of the true neighbors, too lossy. Probing sixteen of the sixty-four cells costs about a quarter of brute force and recovers nearly eighty percent of the true top-ten. And probing all sixty-four gives perfect recall but costs more than brute force, because you pay to compare against every centroid and then every vector. That last row is the honest reminder that the index is not free: searched exhaustively it loses to the thing it was meant to beat, and its value lives entirely in the middle of the curve, where a few probes buy most of the recall for a fraction of the work.

One caveat I want to be straight about. My toy vectors were random, and random high-dimensional vectors have almost no cluster structure, which is the worst case for IVF, so the recall here climbs slower than it would on real embeddings. Real embedding spaces have genuine clusters, paraphrases really do bunch together, so IVF exploits structure that my random data did not have and does considerably better in practice. The curve’s shape, more probes for more recall at rising cost with a sweet spot in the middle, is the durable lesson; the exact recall numbers are pessimistic because the test data was deliberately structureless.

HNSW: navigate a graph of neighbors

The other dominant family is the hierarchical navigable small world graph, HNSW, and it takes a different route to the same goal. Instead of cells, build a graph where each vector links to some of its near neighbors, with a few layers: a sparse top layer for big jumps across the space and denser lower layers for fine local steps. To search, start at an entry point in the top layer, greedily walk to whichever neighbor is closer to the query, drop a layer, repeat, until you settle into the local neighborhood of the query in the bottom layer. You touch a tiny fraction of the vectors, just the ones along the path. HNSW generally gives excellent recall at high speed and is the default in many vector databases, at the cost of more memory for the graph links and a more involved build. Its dial is the breadth of the search, how many candidates you keep in flight as you walk, with the same recall-for-speed character as IVF’s nprobe.

The choice between IVF and HNSW is a real fork. IVF is simpler, cheaper to build, and pairs naturally with quantization for memory savings, but its recall at low cost is weaker. HNSW gives stronger recall per unit of search but costs more memory and build time. Which wins depends on whether the constraint is memory, build time, or query latency, and like the rest of these dials it is settled by measuring recall and latency on the real corpus, not by preference.

Quantization: make the vectors smaller

Orthogonal to which index you use is how much each vector costs to store and compare, and quantization is the lever. A raw embedding is a few hundred to a couple thousand floats, and at scale the vectors themselves dominate memory. Two common compressions:

Scalar / int8 quantization. Store each dimension as an eight-bit integer instead of a 32-bit float, a four-times shrink, accepting a small rounding error in each distance. Cheap, simple, and usually nearly lossless for recall.

Product quantization, PQ. More aggressive: split each vector into sub-vectors, cluster each sub-space into a small codebook, and store each sub-vector as the index of its nearest codebook entry. A vector becomes a handful of small codes, often a ten-times-plus shrink, and distances are approximated from precomputed codebook tables. PQ trades more accuracy for much more compression and is what makes billion-scale vector search fit in memory at all.

Quantization stacks with the index: an IVF index with PQ-compressed cells is a standard billion-scale configuration. Every step of compression is another small recall sacrifice on top of the approximation the index already makes, which is why the honest way to tune a vector search system is to hold a target recall fixed and minimize cost, or hold cost fixed and maximize recall, never to pretend the approximation is not there.

Where this sits

The output of all this is a fast, approximate top-k of document vectors for the query, the semantic arm’s candidate list. It is approximate by construction, which is fine, because it is not the final answer: it is one of two candidate lists that get fused with the lexical arm’s and then reranked by a slower, exact model over the top handful. The approximation here is affordable precisely because a later stage cleans up the ordering. The job of ANN is to get the right documents into the candidate set cheaply; getting them in the right order is someone else’s job downstream.

The code

The toy IVF is just k-means clustering plus “only search the nearest few cells,” and the experiment is the comparison of its results against exact brute force. The skip is the whole idea:

python

import math

# build: cluster all vectors into cells (k-means), keep each cell's members
# search: rank cells by centroid proximity, only scan the nprobe nearest cells

def exact_topk(q, vectors, k=10):
    scored = sorted(range(len(vectors)), key=lambda i: dot(q, vectors[i]), reverse=True)
    return scored[:k], len(vectors)              # cost = compare against ALL vectors

def ivf_topk(q, centroids, cells, nprobe, k=10):
    near = sorted(range(len(centroids)),         # which cells are closest to the query?
                  key=lambda c: dot(q, centroids[c]), reverse=True)[:nprobe]
    cand = [i for c in near for i in cells[c]]    # only the vectors in those cells
    cost = len(centroids) + len(cand)             # centroid comparisons + candidate comparisons
    scored = sorted(cand, key=lambda i: dot(q, vectors[i]), reverse=True)
    return scored[:k], cost

# recall@k = |ivf_topk ∩ exact_topk| / k, swept across nprobe -> the curve in the figure

The full script generates the vectors, runs the k-means build, and sweeps nprobe to produce the recall-vs-cost table and figure; it is in the experiment workbench beside this page. (The vectors there are random, the worst case for IVF; real embeddings cluster and do better, as the page notes.)

Up to semantic retrieval. The vectors it searches come from embeddings. Its approximate candidate list joins the lexical arm’s in hybrid search and gets cleaned up by reranking. The lexical analogue of skipping work it cannot afford is WAND.