Counting Words Smarter: TF, Length Normalization, and IDF

When word-overlap actually wins

Counting how many words a query and a document share looks naïve, but it has one regime where it is genuinely the right tool: rare, specific tokens that have one meaning. Product SKUs, error codes (ECONNREFUSED), function names, library names, course codes, people’s names. When the user types one of these, they almost certainly want the document containing that exact string. Meaning-based search (“embeddings”) can fumble these — it might helpfully return documents about similar-sounding error codes — while exact match nails them.

This is why production retrieval systems do not choose between word-overlap and embeddings. The two methods fail in opposite directions: exact match misses paraphrases, embeddings miss rare tokens. Running both and combining the scores is what “hybrid retrieval” means.

Wrinkle 1: term frequency is not enough

Naïve word-overlap counts how many times a query term appears in a document. A document that mentions “refund” twice scores the same as another document that mentions it twice — regardless of how long the documents are. But two mentions in a 50-word document is dense (the document is probably about refunds); two mentions in a 5,000-word document is incidental.

The fix is length normalization: divide the term count by something related to document length, so long documents don’t win automatically just because they contain more of every word. The exact shape of that normalization is one of the things BM25 tunes carefully — for now, the principle is count matters, but discount long documents.

Wrinkle 2: not all words are equal

Even with length normalization, a second bug remains. Suppose the query is "the refund policy". One document contains “refund” once and “the” 47 times; another contains “refund” three times and “the” twelve times. If we just sum up term counts, the first document wins on raw volume — but it’s clearly the wrong answer. “The” appears in every document in the English language. Knowing a document contains “the” tells you nothing. Knowing a document contains “refund” is enormously informative, because “refund” only appears in a small slice of the corpus.

So each word needs a per-word weight: words that appear in almost every document are nearly worthless, words that appear in only a few documents are highly valuable.

That weight is called inverse document frequency (IDF), and computing it for every term in a corpus is the second of three ingredients in BM25.