Reranking: Index

From Search and Retrieval
Reranking
Page metadata
First created Jun 7, 2026
Last edited Jun 7, 2026

Everything upstream of here, lexical, semantic, and the hybrid fusion of the two, produces a candidate list quickly and roughly. Reranking is the admission that quick and rough is not good enough for the very top of the results, where the difference between the first and third result is most of what the user experiences, and the fix is a second stage: take the top of the fast candidate list and re-score it with a slower, more accurate model. This territory is about why that second model is more accurate, why it is too slow to use as the first stage, and why splitting the work into a cheap wide pass and an expensive narrow pass is the only way to get both.

The two-stage shape, and why it has to be two stages

The naive wish is to use the most accurate scorer on every document and skip the staging entirely. The reason you cannot is arithmetic, and I worked it out to make the constraint concrete rather than just assert “too slow.”

The accurate scorer, a cross-encoder, runs a full model over each query-document pair at query time. Put a rough cost on it: if comparing one document the cheap way (a precomputed vector dot product) costs one unit, running the cross-encoder on one pair costs a few thousand. Now price the two strategies over a million-document corpus:

txt
cross-encode ALL 1,000,000 docs:              5,000,000,000 units   (infeasible)
bi-encode 1,000,000 + cross-encode top-1,000:     6,000,000 units
bi-encode 1,000,000 + cross-encode top-100:       1,500,000 units
bi-encode 1,000,000 + cross-encode top-10:        1,050,000 units

Cross-encoding the whole corpus is five billion units, hopeless per query. But cheap-retrieve the million and cross-encode only the top hundred is one and a half million units, roughly three thousand times less, because the expensive model only ever sees the handful that the cheap stage already narrowed to. That ratio is the entire justification for the architecture. The first stage’s job is recall: get the right documents into the candidate set, cheaply, even if their order is rough. The second stage’s job is precision: order the top of that set correctly, expensively, on few enough documents that the expense is affordable. Neither stage can do the other’s job, which is why there are two.

What the precise model sees that the cheap one cannot

The cheap retriever and the precise reranker are not just fast and slow versions of the same thing. They differ in what information they can use, and that difference is the reason the reranker is worth its cost.

The cheap retriever is a bi-encoder. It turns the query into a vector and each document into a vector independently, then compares the two vectors with a dot product. Because the document vectors do not depend on the query, they can be computed offline once and reused for every query, which is exactly what makes retrieval fast and what the embeddings and ANN index are built on. The catch is that the comparison happens between two summaries made in isolation. The document was compressed to a vector before it ever saw the query, so any detail that the compression dropped is gone before the comparison starts.

The reranker is a cross-encoder. It feeds the query and the document into the model together, so the model can let query tokens and document tokens interact, attending across both at once, and produce a score from the joint view rather than from two separate summaries. That joint view carries information the independent summaries cannot. I built a toy to see the gap: a query for “small fast” against two documents that both contain the words small and fast somewhere. A bag-of-words bi-encoder scores them identically, because both contain both query terms, it cannot tell them apart. An interaction-aware scorer distinguishes them, because in one document both attributes describe the same subject and in the other they are scattered across different subjects, and only a model that looks at query and document together can check which document term each query term actually aligns with. The bi-encoder threw that alignment away when it summarized the document in isolation. The cross-encoder keeps it because it never summarizes in isolation.

That is the trade in one line: the bi-encoder is fast because it can precompute, and it can precompute because it ignores the query when encoding the document, which is exactly the information the cross-encoder spends its cost to use. Speed and precision are in tension here for a structural reason, not an incidental one, and the two-stage pipeline resolves the tension by using each model where its strength fits.

The leaf

Cross-encoders

The precise reranker in full: how a cross-encoder scores a query-document pair by joint attention, the bi-encoder versus cross-encoder distinction and exactly what information each can and cannot represent, the latency budget that caps how many documents you can rerank, and the forks, how deep to rerank and when a cheaper reranker is enough.

A note on fusion-as-reranking

One thing worth separating: RRF is sometimes called a reranker, because it reorders the retrieved lists. It is a fusion, not a model: it combines existing rankings by their rank positions and reads neither the query nor the document text. The reranking this territory is about is a learned model that reads the actual text of the query and document together. Both reorder results, but one merges rankings cheaply by position and the other re-scores documents expensively by content, and conflating them obscures that the expensive content-aware step is where the real precision comes from.


Up to the whole map. The candidate list it re-scores comes from hybrid fusion of the lexical and semantic arms. The bi-encoder it improves on is built from embeddings.

Index

  • Cross-Encoders. The precise reranker. How scoring a query and document together by joint attention beats comparing two independently-made vectors, exactly what information the bi-encoder discards when it summarizes a document without the query, the latency budget that caps how deep you can rerank, and the forks on reranker depth and size.