From Reranking

Cross-Encoders

The reranking territory established the shape: retrieve cheaply and widely, then re-score the top precisely. This leaf is the precise re-scorer, the cross-encoder, and the thing I wanted to understand exactly is why it is more accurate, because “it reads them together” is the kind of phrase I can repeat without understanding. The understanding is about what information survives each model’s encoding, and once I had it the speed-versus-precision trade stopped looking like a tuning choice and started looking like a law.

Two ways to compare a query and a document

Put the two architectures side by side, because the difference is the whole leaf.

A bi-encoder encodes the query into a vector and the document into a vector separately, with no interaction between them, then scores the pair by the cosine or dot product of the two vectors. The document encoding does not depend on the query at all. That independence is the source of the bi-encoder’s speed: because a document’s vector is the same for every query, you compute it once, offline, store it in the ANN index, and at query time you only encode the query and do fast vector comparisons. The entire retrieval stage runs on this property.

A cross-encoder encodes the query and the document together, as a single joined input, and runs the full model over the pair so that every query token can attend to every document token and back. The output is a single relevance score for that specific query against that specific document. There is no reusable document vector, because the document’s representation now depends on the query it is paired with. That dependence is the source of the cross-encoder’s precision and also of its cost: nothing can be precomputed, so the full model runs once per query-document pair at query time.

What the joint view keeps that two summaries drop

The precision gap comes down to one thing: a bi-encoder compares two summaries that were each written without knowledge of the other, and a cross-encoder never writes a summary in isolation. Whatever a document’s vector had to leave out to fit a few hundred numbers is gone before the query is ever considered, and the query cannot ask the document about a detail the document already discarded.

I built a small example to make the loss visible rather than abstract. The query is “small fast,” asking for two attributes that should both describe the same thing. Two documents both contain the words small and fast, but in one document both words describe the target subject and in the other they are scattered across different subjects, one attribute on the subject and the other on something else entirely. A bag-of-words bi-encoder, which is the crude stand-in for “summarize each side independently and compare,” scored both documents identically, because both contain both query terms somewhere. It genuinely could not tell them apart. An interaction-aware scorer, the stand-in for joint encoding, distinguished them, because it could check which document term each query term actually aligns with and see that only one document has both attributes attached to the right subject.

query “small fast”	both query words present?	bi-encoder	cross-encoder
docA: car is small + slow, delivery is fast	yes (scattered)	2	1
docB: car is small + fast, delivery is slow	yes (on the car)	2	2

The bi-encoder gives both a 2 and cannot rank them; the cross-encoder separates them because it sees which subject each attribute attaches to.

That toy is cruder than a real cross-encoder, which uses learned attention rather than my hand-coded alignment check, but the structural point is exactly the real one. The information that separates the two documents, which attribute attaches to which subject, is a fact about the query and document jointly. It is not a property of the document alone, so it cannot survive a document summary made before the query arrives. The bi-encoder is not failing to try hard enough; it is structurally unable to represent the distinction, because it committed to a query-independent document vector. The cross-encoder can represent it because it refused to commit early. The cost of refusing to commit is that nothing is reusable, which is the whole price.

The latency budget

Because a cross-encoder runs the full model per pair, the only question that governs its use in production is how many pairs you can afford, and that is a latency budget. A reranker has some milliseconds of the total query latency to spend, the model takes some time per document, and the quotient is how many documents you can rerank. That number is small, typically tens to a few hundred, which is exactly why reranking is a second stage over a narrowed candidate set and never a first stage over the corpus. The cost arithmetic made this concrete: cross-encoding a whole corpus is hopeless, cross-encoding the top hundred is cheap, and the gap is thousands of times.

This sets up a recall constraint that ties the whole pipeline together. The reranker can only reorder documents it is given, so if the right document was not in the top-k that retrieval and fusion handed up, the reranker never sees it and no amount of precision recovers it. The depth of the candidate set is therefore a joint decision between the cheap stage’s recall and the expensive stage’s budget: retrieve deep enough that the right answer is almost always in the set, but not so deep that reranking blows its latency. Reranking precision is bounded above by retrieval recall, which is a good reason the cheap first stage still matters even when a strong reranker sits behind it.

The forks

How deep to rerank. Rerank more documents and you are more likely to have the right one in the set and to order the top correctly, at higher latency. Rerank fewer and you are faster but riskier on recall. The right depth is set against a relevance measure and the latency budget together, the same measure-don’t-guess discipline that tunes BM25 and the ANN dial.

How big a reranker. Cross-encoders come in sizes, and a larger model is more precise per document and slower per document, which shrinks how many documents fit in the latency budget. There is a real trade between a big reranker over a shallow candidate set and a smaller reranker over a deeper one, and which wins depends on whether the corpus’s hard cases are about deep recall or fine ordering.

Whether you need one at all. A reranker earns its latency only if the cheap stage’s ordering is actually leaving relevance on the table near the top. If hybrid fusion already orders the top results well for your corpus, a cross-encoder is cost without benefit. The honest version of this fork is to measure the relevance lift from reranking against its latency cost and keep it only if the lift is real, rather than adding it because the architecture diagram has a box for it.

The thing I will keep from this leaf is that bi-encoder and cross-encoder are not fast-versus-accurate versions of one design. They are two different bets about when to look at the query: the bi-encoder looks late, after the document is already a fixed vector, which is what lets it precompute; the cross-encoder looks early, while it still has the full text of both, which is what lets it be precise. The pipeline uses each where its bet pays off, the bi-encoder to narrow the field cheaply and the cross-encoder to settle the top precisely.

The code

Two small experiments. The cost model prices the two strategies; the information toy shows the bi-encoder’s blind spot. First the cost, with one cross-encoder pair forward pass costed at ~5000 units and one precomputed dot product at 1:

python

N, DOT, PAIR = 1_000_000, 1, 5000              # corpus size, bi-encoder cost, cross-encoder cost
print(N * PAIR)                                  # cross-encode ALL -> 5,000,000,000  (infeasible)
for topk in (1000, 100, 10):
    print(N * DOT + topk * PAIR)                 # bi-encode all + cross-encode top-k
    #   top-1000 -> 6,000,000     top-100 -> 1,500,000     top-10 -> 1,050,000  (~3000x cheaper)

Then the information toy. A bag-of-words bi-encoder counts query terms present anywhere; an interaction-aware scorer checks the terms against the target subject only:

python

query = {"small", "fast"}
docA = {"text": {"small","slow","car","fast","delivery"}, "car_attrs": {"small","slow"}}
docB = {"text": {"small","fast","car","slow","delivery"}, "car_attrs": {"small","fast"}}

def bi_encoder(doc):    return len(query & doc["text"])        # present anywhere -> both score 2
def cross_encoder(doc): return len(query & doc["car_attrs"])   # on the car -> docA 1, docB 2

# bi_encoder(docA) == bi_encoder(docB) == 2   -> cannot tell them apart
# cross_encoder(docA)=1, cross_encoder(docB)=2 -> the joint view separates them

The real cross-encoder learns the alignment with attention rather than my hand-coded car_attrs check, but the structural lesson is the same. Both scripts are in the experiment workbench beside this page.

Up to reranking. The bi-encoder it improves on is built from embeddings and searched by ANN. The candidate set it re-scores is produced by hybrid fusion. The recall it depends on traces all the way back to the lexical and semantic arms.