From Lexical Retrieval

BM25

TF-IDF got the big things right. Weight rare terms above common ones, reward a document for actually using a query term, and multiply the two so a worthless word cannot win on volume. When I worked it through I came out convinced it was most of the way to a real ranker. Then I looked at the two specific places it still misbehaves, and BM25 is exactly the repair of those two places, nothing more and nothing less. That framing helped me a lot: BM25 is not a new idea, it is TF-IDF with two patches, and if I keep the patches in view the formula stops looking like something memorized off a slide and starts looking like an argument.

This page is me walking the two failures, deriving the fix for each, and then running BM25 against raw TF-IDF on the same real corpus to watch the fixes change actual rankings. The patches are easy to state and the second one rescued a document I would have wanted rescued, which is when I trusted it.

Where this came from

The name is the part I did not expect and it is worth knowing because it deflates the formula nicely. BM25 came out of the Okapi information retrieval system at City University London, built in the early 1990s by Stephen Robertson and Steve Walker and colleagues, growing out of a probabilistic theory of relevance Robertson and Spärck Jones had been developing since the 1970s. “BM” stands for Best Match, and the 25 is a version number. It was the twenty-fifth weighting scheme they tried, and it is the one that worked well enough to stick. The default ranking function of lexical search for the last two decades, the thing under Lucene and Elasticsearch and Solr, is a numbered experiment that happened to land. I find that oddly reassuring. Nobody derived it from first principles in one shot; they iterated, and the twenty-fifth try was good.

What they were fixing was real. By the early 1990s TF-IDF was well understood and its weaknesses were too. BM25 is the form that packaged the fixes into a function with two interpretable knobs, and the interpretability is most of why it won over fancier alternatives.

The first failure: term frequency never stops counting

Raw term frequency grows without limit. A document that uses a query word ten times scores ten times higher on that word than a document that uses it once. I had nodded along to that in TF-IDF, and then I thought about what it actually claims: that a page mentioning gradient a hundred times is a hundred times more about gradients than a page that mentions it once. That is plainly false. The first occurrence tells you the document is on topic. The fifth confirms it. The fiftieth tells you almost nothing the first did not. The signal should rise fast and then level off, and raw TF rises forever in a straight line.

The fix is to run term frequency through a saturating function, one that climbs steeply at first and then approaches a ceiling. BM25’s choice is

txt

tf_factor = tf * (k1 + 1) / (tf + k1)

and the only new symbol is k1, a tuning constant. Let me say what each piece does before trusting it. The numerator tf * (k1 + 1) grows with tf. The denominator tf + k1 also grows with tf, and because both top and bottom grow, the ratio cannot run away; it flattens toward a ceiling. As tf gets very large the factor approaches k1 + 1, the asymptote. The constant k1 sets how fast you get there: it is roughly the term frequency at which the curve reaches half its climb, so a small k1 saturates quickly (the second occurrence already barely adds) and a large k1 saturates slowly (occurrences keep mattering for longer). Typical values sit around 1.2. Push k1 to infinity and the saturation disappears, the factor becomes proportional to tf again, and you are back to raw TF. That last fact is the tell that this really is TF-IDF with a knob: turn the knob to one extreme and the new behavior vanishes.

I did not want to take the shape on faith, so I computed the factor across a range of term frequencies with k1 = 1.2:

BM25 term-frequency factor (k1=1.2) climbs steeply then flattens toward its ceiling of k1+1=2.2, while raw TF (dashed) climbs forever. The hundredth occurrence buys almost nothing; the second buys a lot.

raw tf	1	2	3	5	10	20	50	100	∞
tf-factor	1.000	1.375	1.571	1.774	1.964	2.075	2.148	2.174	2.200
gain over previous	—	+0.375	+0.196	+0.203	+0.190	+0.111	+0.073	+0.026	→ 0

That is the whole argument in a column of numbers. Going from one occurrence to two adds 0.375. Going from fifty to a hundred adds 0.026. The hundredth occurrence is worth less than a tenth of what the second was worth, and the curve is visibly crawling toward its ceiling of 2.2. Raw TF-IDF, on the same document, would have scored the hundred-occurrence term roughly fifty times higher than the two-occurrence one. The saturation curve says: yes, more is more, but with sharply diminishing returns, which is what relevance actually does.

The second failure: long documents win by being long

The other thing TF-IDF cannot see is length. A long document accumulates higher term frequencies simply because it has more words, so it racks up score without being any more relevant. On the toy three-line corpus this was invisible because every document was the same size. On real text it is glaring, and it is the failure I most wanted to watch break, because length bias is the kind of thing that quietly ruins a ranking without ever looking like a bug.

The corpus I used makes the point on its own. The fifty-eight documents average about six hundred and ninety tokens, but the shortest is essentially empty and the longest is over three thousand. A raw score that grows with raw term frequency systematically favors the three-thousand-token end. The fix is to normalize a document’s term frequencies against its length, so that being long is neither a reward nor, taken too far, a penalty. BM25 folds the normalization into the denominator of the saturation term:

txt

                              tf * (k1 + 1)
score(t, d) = idf(t) * --------------------------------------
                        tf + k1 * (1 - b + b * (dl / avgdl))

Two new symbols, dl and b, and one quantity carried in from the corpus, avgdl. dl is this document’s length in tokens. avgdl is the average document length across the corpus, which is why length normalization needs a corpus-wide statistic and is not something a single document can compute about itself. The ratio dl / avgdl is how this document compares to typical: greater than one for a long document, less than one for a short one. b is the knob that controls how much that ratio matters, and it runs from 0 to 1. At b = 0 the whole length term collapses to 1 and length is ignored entirely, which recovers pure saturation with no normalization. At b = 1 length is applied in full. The usual setting is around 0.75, partway on, because fully penalizing length tends to over-correct and punish genuinely thorough documents.

The mechanism, once I traced it, is that a long document inflates its own denominator. For a document longer than average, dl / avgdl > 1, so the denominator grows, so each term’s contribution shrinks. The long document has to use the term proportionally more, not just absolutely more, to keep the same score. That is the right correction: it asks for density, not bulk.

Watching the patches change real rankings

The formula is one thing and a reranking is another, so I scored the same real corpus both ways, raw TF-IDF and full BM25 with k1 = 1.2, b = 0.75, and looked for queries where they disagree. The saturation fix I had already seen in the curve. The length fix is the one that produced the result I wanted. Here is the top of the embedding query under each, with each document’s length in tokens:

rank	TF-IDF (no length norm)	BM25 (b=0.75)
1	1715 tok	1715 tok
2	2510 tok	2510 tok
3	2724 tok	385 tok (“What Search Is”, was rank 7)

On the query embedding, raw TF-IDF’s top three are all long documents, between roughly seventeen hundred and twenty-seven hundred tokens, the kind of sprawling pages that mention embeddings a lot because they mention everything a lot. BM25 keeps the two genuinely on-topic long pages but pulls a three-hundred-and-eighty-five-token page, the short intro titled “What Search Is,” up into the top three from seventh place. A short, focused document that uses the term densely beats long documents that use it merely often. That is b doing exactly its job, on real text, and it is the behavior I would want as a user: the tight relevant answer should not lose to the long one just because the long one had more room to repeat the word.

The same pattern showed up on other queries. On naive bayes BM25 promoted a seven-hundred-token page over longer ones; on gradient descent it reshuffled a fifteen-hundred-token page up past where length alone had placed it. In every case the reordering favored density over size, which is the whole point of the normalization.

Then the sanity check, the one that convinced me BM25 really is TF-IDF plus two knobs rather than a different animal. I set k1 to ten thousand, effectively disabling saturation, and b to zero, disabling length normalization, and reran. The top ten results came back in the identical order as raw TF-IDF, ten out of ten positions matching. With both patches turned off, BM25 degenerates to the thing it was patching. That is the cleanest possible evidence that the two knobs are the entire difference.

These numbers are from a small pure-Python implementation I ran on the wiki, not from anything production, and the analyzer was crude on purpose. But they are real output, and the value was watching the short document win the embedding query. I believed length normalization mattered before; I understood it after seeing it move a specific page.

The knobs as a fork

k1 and b are not constants handed down from above. They are a tuning surface, and choosing them is the standing design fork of BM25.

k1, saturation speed. Low k1, around 1.0 or below, saturates fast: after the first couple of occurrences a term stops gaining, which suits corpora where a single mention already signals relevance, like short documents or titles. High k1 keeps rewarding repetition longer, which can suit long-form content where genuine term density carries meaning. The cost of getting it wrong is subtle, a slow drift in relevance rather than an outright break, which is exactly why it is worth tuning against a real relevance measure rather than eyeballing.

b, length sensitivity. b = 0 ignores length, which can be right when documents are uniform or when length genuinely correlates with quality. b = 1 normalizes fully, which over-penalizes thorough documents. The default 0.75 is a compromise that has held up across many corpora, but “the default holds up” is a statement to verify on your own data, not assume.

Tuning these is not guesswork in a mature system; it is an offline optimization against a labeled set of queries and judged-relevant documents, the same evaluation machinery that migrations lean on to prove a change does not regress relevance. The knobs exist because the right saturation and length behavior genuinely differ across corpora, and BM25’s durability comes largely from exposing exactly those two degrees of freedom and no more.

What BM25 still does not do

It is worth being clear about the edge of this. BM25 fixes saturation and length. It is still a bag-of-words function: it has no idea that running and ran are the same word, that NYC and New York refer to the same place, or that a query and a document can mean the same thing with no shared terms at all. The first of those, collapsing word forms so the statistics line up, is the job of the analyzer, and BM25 silently assumes it has already happened. The last of those, matching meaning across different words, is precisely what BM25 cannot do and what the semantic arm exists to cover. BM25 is the best the lexical side gets, and the lexical side has a ceiling.

There is also the operational payoff that the rarity weighting buys, which I had not connected at first. Because BM25’s per-term IDF gives an upper bound on how much any single term can contribute, a retriever can compute the largest score a not-yet-fully-scored document could possibly reach and skip it if that ceiling falls below the current top-k. That is WAND, and it means good IDF and good BM25 weights make retrieval both more relevant and faster at the same time.

The code

The whole scorer is a few lines, and the sanity check that BM25 is TF-IDF plus two knobs falls right out of it: set k1 enormous and b zero and the saturation and length terms both vanish.

python

import math

def idf(t, N, df):                               # the rarity weight, term t in df docs of N
    return math.log(1 + (N - df + 0.5) / (df + 0.5))   # lucene non-negative form

def bm25(tf, dl, idf_t, avgdl, k1=1.2, b=0.75):
    # tf      : term frequency in this document
    # dl      : this document's length in tokens
    # avgdl   : average document length across the corpus
    norm = 1 - b + b * dl / avgdl                # length term: >1 for long docs, shrinks their score
    return idf_t * (tf * (k1 + 1)) / (tf + k1 * norm)

# saturation alone (the curve in the figure): idf factored out, b=0, dl=avgdl
def tf_factor(tf, k1=1.2):
    return (tf * (k1 + 1)) / (tf + k1)           # -> k1+1 as tf -> infinity

# the sanity check: k1 huge + b=0 reproduces raw TF-IDF ordering exactly
#   bm25(tf, dl, idf, avgdl, k1=10_000, b=0.0)  ~  tf * idf

Running both scorers over the 58-document corpus and counting where they reorder the results, plus the k1=10000, b=0 degeneracy check that matched raw TF-IDF on all top-10 positions, is in the experiment workbench beside this page.

Up to the Lexical map. Back to the function BM25 patches, TF-IDF. The IDF inside the BM25 formula has its own variants and a correctness contract, IDF variants. The tokenizer BM25 assumes ran first is tokenization and analysis, and the skip-based retrieval its weights enable is WAND.