From Hybrid Search

Reciprocal Rank Fusion

Hybrid search needs to merge the lexical arm’s ranked list and the semantic arm’s ranked list into one, and the obstacle is that the two arms score on scales that have nothing to do with each other. This leaf is the merge done right. Reciprocal Rank Fusion is almost embarrassingly simple once you see it, which made me suspicious of it until I ran it against the alternatives and watched it do the thing the alternatives could not.

The formula

RRF assigns each document a fused score by summing, over every ranked list it appears in, the reciprocal of a constant plus its rank in that list:

txt

rrf_score(d) = sum over lists L of   1 / (k + rank_L(d))

The symbols, in plain language. rank_L(d) is where document d sits in list L: one for the top result, two for the next, and so on. k is a constant, conventionally 60, that softens how steeply the value drops as rank worsens. The reciprocal is the whole idea: a document at rank one contributes 1 / (k + 1), a document at rank two contributes a little less, a document deep in the list contributes almost nothing, and a document absent from a list contributes zero from that list. Sum across the lists and a document that ranks high in either arm, or decently in both, accumulates a high fused score.

The thing that is not in the formula is the thing that matters: there is no score. Nowhere does a BM25 value or a cosine value appear. RRF reads only rank position, and rank position is the one quantity that means the same thing across retrievers, rank one is “this retriever’s best” whether the retriever tops out at thirty or at one. By using rank and only rank, RRF makes the scale mismatch irrelevant by construction. That is why it is the default in so many hybrid systems despite being four lines of code: it cannot be broken by a weird score distribution, because it never looks at the scores.

Watching it beat the alternatives

I did not want to take RRF’s superiority on faith, so I set up two retrievers returning the same six documents on deliberately mismatched scales, lexical scores from zero to thirty, semantic from zero to one, with document A as lexical’s clear favorite and document B as semantic’s clear favorite, and compared three fusions.

Naive score addition, summing the raw lexical and semantic scores, produced a fused ranking identical to the lexical ranking alone, with B sunk near the bottom. The cosine scores were rounding error next to the BM25 scores, so the semantic arm contributed nothing. Adding scores on mismatched scales is not fusion.

Min-max normalizing each retriever’s scores into a common range and then adding did better: both arms counted, and B rose into contention. But min-max is anchored to the single highest and lowest score in each set, so one outlier rescales the whole range, and the normalized values are not stable from query to query. It works until a query with an odd score spread quietly distorts it.

Across the three fusion methods, the telling number is where each puts the semantic arm’s own favorite, document B (semantic rank 1, lexical rank 5):

fusion method	final rank of semantic’s #1 (of 6)
naive score-sum	#5 — the larger lexical scale dominates
min-max normalized sum	#3 — better, but fragile to outliers
RRF	#2 — right next to lexical’s own favorite

RRF, fusing the ranks, pulled both A and B to the top together:

RRF score	doc	lexical rank	semantic rank
0.03202	A	#1	#4
0.03178	B	#5	#1
0.03175	C	#3	#3
0.03128	D	#2	#6
0.03128	E	#6	#2
0.03101	F	#4	#5

A and B sit at the top within a hair of each other, exactly right: each is some retriever’s number one, and RRF gives both number-ones the same reciprocal boost no matter what their underlying scores were. C, which was solidly third in both arms, wins the overall by being good everywhere rather than spiky anywhere. The whole ordering reflects rank standing across the two lists with the scales completely out of the picture. This is the merge naive addition could not produce and normalization could only approximate fragilely.

What k actually tunes, and an honest limit

The constant k is the one knob, and I had a tidy story in my head about it that the experiment partly corrected, which is worth reporting because the correction is the real understanding.

The clean part of the story is true: k controls how sharply rank one outweighs the ranks below it. With small k, the gap between 1/(k+1) and 1/(k+2) is large, so being number one is worth far more than being number two, and the top ranks dominate the fused score. With large k, the curve flattens, the difference between adjacent ranks shrinks, and documents deeper in the lists keep contributing meaningfully. So k is a dial between “reward only the very top” and “let the whole list count,” and 60 is the conventional middle.

The part I had wrong was assuming k therefore reorders results dramatically. I set up a tension case, one document spiky (rank one in one arm, rank five in the other) against one steady (rank three in both), and scanned k from one to two thousand to find where the winner flips. It never flipped. The spiky document’s rank-one advantage held at every k; the gap to the steady document shrank as k grew, but it did not reverse. The reason is the convexity of 1/(k+rank): a rank-one placement is worth so much more than the deeper ranks that, for these positions, no amount of flattening lets two middling ranks overtake one top rank plus one poor one. A flip is possible only when the spiky document’s second rank is itself shallow enough, and even then the crossover is gentle. So the honest characterization is that k mostly adjusts score gaps, and thus tie-breaking and genuinely close contests, more than it swings clear winners. It is a real knob, but its effect is subtle, and I would not expect retuning k to rescue a bad ranking. These are tiny hand-built lists, so the absolute numbers are illustrative, but the convexity argument is general.

What RRF gives up

RRF’s strength, ignoring scores, is also exactly what it sacrifices, and being clear about the cost is the point of understanding it rather than just using it. By keeping only rank, RRF discards the confidence in each ranking. A retriever that is wildly confident its number one is correct and a retriever that barely prefers its number one over its number two produce the same rank-one signal, and RRF treats them identically. The margin between adjacent scores, which often carries real information about how sure the retriever is, is thrown away the moment you reduce a score list to a rank list. A score-aware fusion that normalized carefully could in principle keep that confidence signal. RRF trades it for robustness, and on most corpora the trade is worth it, because robustness against scale mismatch is the bigger practical problem than the lost confidence margin. But it is a trade, and a system where the score margins are genuinely meaningful and the scales are made comparable is a system where a score-aware fusion could beat RRF.

The deeper reason RRF wins in practice anyway is that the fused list is not the final answer. It is a rough, fast candidate ordering that gets handed to reranking, where a slow precise model reads the top documents against the query and produces the real ordering. RRF only has to get the right documents near the top of the candidate set, not order them perfectly, and for that job its robustness matters more than the confidence information it drops. The precision it gives up is exactly the precision the next stage supplies.

The code

RRF is four lines. The toy that exposed naive-fusion’s failure is a pair of retrievers returning the same six documents on mismatched scales, lexical from 0 to 30 and semantic from 0 to 1. The data and the three fusions:

python

# doc -> (lexical_score 0..30, semantic_score 0..1). Hand-built so the disagreement is sharp:
docs = {
    "A_exact_term_match": (28.0, 0.31),   # lexical's #1, semantic lukewarm
    "B_paraphrase_match": (3.0,  0.94),   # semantic's #1, lexical lukewarm
    "C_both_ok":          (15.0, 0.70),
    "D_lexical_only":     (22.0, 0.10),
    "E_semantic_only":    (1.0,  0.88),
    "F_weak":             (4.0,  0.25),
}
lex = sorted(docs, key=lambda d: docs[d][0], reverse=True)   # lexical ranking
sem = sorted(docs, key=lambda d: docs[d][1], reverse=True)   # semantic ranking

# naive: add the raw scores -> identical to lexical alone (cosine 0..1 is noise next to BM25 0..30)
naive = sorted(docs, key=lambda d: docs[d][0] + docs[d][1], reverse=True)

# RRF: sum of 1/(k+rank) over both rankings. NO score ever enters the calculation.
def rrf(rankings, k=60):
    score = {d: 0.0 for d in docs}
    for ranking in rankings:
        for rank, d in enumerate(ranking, start=1):
            score[d] += 1.0 / (k + rank)
    return sorted(score, key=lambda d: score[d], reverse=True)

fused = rrf([lex, sem])     # A and B both rise to the top; scale is irrelevant

The full script also runs the min-max-normalized fusion and the k-sweep that showed k mostly adjusts score gaps rather than flipping clear winners; it is in the experiment workbench beside this page.

Up to hybrid search. The two ranked lists it fuses come from lexical retrieval and semantic retrieval. The stage that cleans up the ordering it produces is reranking.