Hybrid Search: Index

First created Jun 7, 2026 Last edited Jun 7, 2026

By the time I had worked through both arms, the case for hybrid search had already made itself. The lexical arm fails on paraphrase, where the query and the answer share no words. The semantic arm fails on literals, error codes and API names and SKUs, where the exact word is the whole point. They fail on disjoint sets of queries. That is the exact condition under which running both and combining them beats running either alone: each one catches what the other drops, so the union is strictly better than the parts. Hybrid search is the recognition that the two arms are not rivals to choose between but halves of one retriever.

This territory is small and has one real mechanism, so it is a short map with one deep leaf. The mechanism is how you combine the two ranked lists, and it is less obvious than it sounds.

Why you cannot just add the scores

My first instinct for combining two rankers was to add their scores: take each document’s lexical score and its semantic score and sum them. It does not work, and seeing exactly why it fails is the whole motivation for the fusion method the field actually uses.

The problem is that the two scores live on incompatible scales. A BM25 score might run from zero to thirty, with no fixed ceiling; a cosine similarity runs from zero to one. Add them and the BM25 score dominates entirely, because thirty swamps anything between zero and one. I checked this rather than assume it: I built two retrievers returning the same six documents on those two scales, where document A was lexical’s clear favorite and document B was semantic’s clear favorite, and summed the raw scores. The fused ranking came out identical to the lexical ranking alone, and B, the document semantic loved most, sank near the bottom. The semantic arm contributed essentially nothing. Naive score addition is not a fusion; it is the larger-scaled retriever wearing the other as a decoration.

The obvious patch is to normalize the scores first, rescaling each retriever’s scores to a common range before adding. In the same experiment, min-max normalizing both score sets and then summing did fix the domination, both arms started counting and B rose. But normalization is fragile in a way that matters in production: min-max is defined by the single highest and lowest score, so one outlier query or one anomalous document rescales everything, and the normalized scores are not stable across queries. Score normalization works until a weird score distribution breaks it, and then it breaks silently.

Fuse ranks, not scores

The move that sidesteps the whole scale problem is to throw the scores away and fuse the ranks. A document’s rank, first, second, third, in each retriever’s list, is comparable across retrievers even when the raw scores are not, because rank position means the same thing everywhere: rank 1 is “this retriever’s best,” whether the retriever is lexical or semantic, whether its scores top out at thirty or at one. Rank is the common currency the scores refused to be.

The standard method is Reciprocal Rank Fusion, RRF, which scores each document by summing 1 / (k + rank) over the lists it appears in, so a high rank in either list contributes a lot and a low rank contributes little, with scale never entering the calculation. In the same toy, RRF pulled both A (lexical’s number one) and B (semantic’s number one) to the top together, because each retriever’s top pick earns the same reciprocal-rank boost regardless of its underlying score, and the document that was solidly good in both arms won outright. That is the behavior you want from a fusion: reward being near the top of either list, reward being decent in both, and never let one retriever’s score units dominate the other’s. How RRF works in detail, the k knob and what it tunes, and where it has limits, is the leaf: Reciprocal Rank Fusion.

The weighting fork

Fusing ranks does not mean the two arms must count equally. You can weight them, giving the lexical or the semantic list more influence in the fused score, and the right weighting is a property of the corpus and the queries. A corpus full of exact identifiers and error codes leans lexical; a corpus of conversational questions over prose leans semantic. The weighting is a dial to tune against a relevance measure, the same offline evaluation machinery that tunes BM25’s parameters and the ANN recall dial, and it connects to the broader point that every knob in this system is set by measurement, not preference. Tune it wrong and you have two arms but the judgment of one.

Where hybrid sits in the path

Hybrid fusion produces a single ranked candidate list from the two arms, and that list is still not the final answer the user sees. It is fast and its ordering is rough, because both underlying scorers are cheap and the fusion is cruder than a model that reads the query and document together. So the fused list feeds the last stage, reranking, which re-scores the top of it with a slow, precise model. Hybrid’s job is to get the right documents into the candidate set from both arms; ordering the very top of that set correctly is reranking’s job. The pipeline is retrieve from both arms, fuse, rerank, and hybrid is the middle join that makes the two retrievers into one input.

The leaf

Reciprocal Rank Fusion

The fusion mechanism in full: the 1 / (k + rank) formula and what each piece does, why fusing ranks beats fusing scores even after normalization, what the k constant tunes and the honest limits of what it can reorder, and where RRF gives up information that a score-aware fusion would keep.

Up to the whole map. The two arms it joins are lexical retrieval and semantic retrieval. The stage it feeds is reranking.

Index

Reciprocal Rank Fusion. The standard way to merge two ranked lists on incompatible score scales into one. The 1/(k+rank) formula and what each piece does, a measured comparison against naive and normalized score fusion, what the k constant actually tunes, an honest look at when it does and does not reorder, and the information RRF throws away.