From Lexical Retrieval

IDF Variants and the Agreement Contract

When I first wrote down IDF as log(N / df) I thought of it as the formula, singular. It is not. It is a family of formulas that agree about rare terms and disagree about common ones, and the disagreements are not cosmetic. One of them can hand a common word a negative weight. Another cannot. They are within a rounding error of each other in one place and point in opposite directions in another. I went into this page assuming the choice between them was a detail and came out understanding it is the place a lexical system most quietly breaks, because the breakage leaves no error behind, only wrong scores.

The reason this matters in production, and the reason it gets its own leaf, is the agreement contract. The formula that builds the IDF table and the formula the serving layer expects when it reads that table have to be the same formula. If they are not, every score comes out subtly off and nothing tells you. I wanted to know how subtly, so I measured it on a real corpus, and the answer turned out to depend entirely on which two variants disagree. That measurement is the spine of this page.

Why there is more than one formula

The clean log(N / df) is the teaching form, and it has two embarrassments the moment you run it on real data instead of a toy.

The first is the unseen term. If a query word never appears in the corpus, df = 0, and N / df divides by zero. The serving path cannot just crash on a word it has not indexed; it needs a defined answer, usually a fallback weight. So real formulas put a +1 or a +0.5 in the denominator to keep it finite when df is zero or tiny.

The second is the very common term. The textbook form stays positive as long as df <= N, which on a single corpus it always is, so on this corpus log(N/df) never actually goes negative. But the probabilistic derivation of IDF that BM25 grew out of produces a different expression,

txt
idf_classic = log( (N - df + 0.5) / (df + 0.5) )

and this one genuinely can go negative. When a term is in more than about half the documents, the numerator N - df drops below the denominator df, the ratio falls below one, and the log turns negative. A negative IDF means matching the term actively lowers a document’s score, which is a defensible thing to say about a word so common it is evidence of nothing, and a dangerous thing to feed an engine that assumes weights are non-negative. So Lucene and the systems descended from it use a third form,

txt
idf_lucene = log( 1 + (N - df + 0.5) / (df + 0.5) )

where the outer 1 + lifts the whole curve up so it can never cross zero. And the scikit-learn lineage uses yet another,

txt
idf_smoothed = log( N / (df + 1) ) + 1

with a +1 outside to keep weights comfortably positive and a +1 inside for the zero-df case. Four formulas, four different sets of design decisions about the two edges, all claiming to be IDF.

Where they agree and where they do not

I computed all four on the real corpus, the same fifty-eight ML-wiki documents I have been using, with N = 58. The pattern is the thing to internalize:

termdftextbookclassic BM25lucenesmoothed
the320.595−0.2040.5961.564
gradient151.3521.0321.3372.288
transformer62.2692.0892.2063.115
bayes23.3673.1183.1613.962

Read down the rare end first. For bayes, in two of fifty-eight documents, the four variants sit between 3.12 and 3.96, close enough that the choice barely matters. For transformer, in six, they spread a little wider but still cluster. The variants broadly agree about rare terms, which is reassuring, because rare terms are where most of the ranking signal lives.

Now read the common end. For the, in thirty-two of fifty-eight documents, the variants are all over the place: classic_bm gives -0.204, lucene gives +0.596, smoothed gives +1.564. The same word, four wildly different weights, one of them negative. The disagreement between IDF variants is concentrated almost entirely on common terms, exactly the terms a good ranker mostly wants to ignore. That is a strange and useful fact: the formulas fight over the words that matter least, which is part of why you can get away with the wrong one for a long time before noticing.

To see the negative case is not a fluke, I listed every term whose classic-BM25 IDF comes out negative on this corpus. There were eleven, all stop words, all with df above roughly N/2 = 29: and and the at -0.204, to and in and of and for at -0.136, and so on. The classic form is doing something deliberate, pushing the weight of near-ubiquitous words below zero, and a consumer that was not expecting negative weights would be quietly poisoned by exactly these terms.

The agreement contract, measured

Here is the part I got wrong, which is the part worth reporting. I assumed that if the builder used one variant and the consumer used another, the resulting weights would differ enough to scramble rankings, and that this was the whole danger. I decided to measure it instead of assert it. The clean way to ask “how different are two weightings of the same query” is to treat each query’s per-term IDF values as a vector and take the cosine between the builder’s vector and the consumer’s vector. Cosine near one means the same direction, so any per-query normalization would wash the difference out and the mismatch would be nearly harmless. Cosine well below one, or negative, means the relative weighting across the query’s terms is genuinely wrong, not just globally rescaled.

Grouped bars of weight-vector cosine for two kinds of formula mismatch
Builder/consumer formula mismatch, measured as the cosine of the two weight vectors. The close pair (lucene vs textbook) stays at 1.0; the sign-flipping pair (lucene vs classic) collapses, reaching -0.91 on an all-stopword query.

First I compared the builder using the Lucene form against a consumer assuming the textbook form, on a handful of queries:

querylucene vs textbook
the gradient descent1.0000
naive bayes the0.9999
transformer embedding attention1.0000
the is of1.0000

Cosine one, everywhere. My prediction was wrong. Lucene and textbook are numerically almost identical in the region where both are positive, look back at the table and the is 0.596 versus 0.595, gradient is 1.337 versus 1.352, so swapping one for the other rotates nothing. That particular mismatch is close to harmless after normalization. The +1 and the +0.5 that distinguish them barely move the numbers.

Then I compared the Lucene builder against a consumer assuming the classic form, the one that goes negative:

querylucene vs classicwhat happens
transformer embedding attention0.9999no stop words, variants agree
naive bayes the0.9795one common term mixed in
the gradient descent0.9326common + rare, weighting bends
the is of−0.9090all common, vector nearly reverses

There it is. The pure-content query, transformer embedding attention, with no stop words, stays at cosine one, because the variants agree about rare terms. But the moment a query mixes a common term with rare ones, the gradient descent, the cosine drops to 0.93. And the all-stopword query, the is of, comes back at cosine -0.909, a weight vector pointing almost exactly the wrong way. The mismatch does not merely rescale these queries. On the common terms it flips the sign, so the consumer weights the stop words in the opposite direction from what the builder intended, and on a query made mostly of common words that is the difference between up and down.

So the contract is sharper than “use the same formula.” The danger is specifically a mismatch where one variant changes the sign of common-term weights and the other does not. Two non-negative variants that differ only in their smoothing constants are nearly interchangeable after normalization. A non-negative variant paired with a sign-going one is a silent catastrophe that hides on content-heavy queries and surfaces on common-word ones. I would not have predicted which mismatch was which without computing it, and the lesson generalizes: when a system has a builder writing an artifact and a consumer reading it, the formula on both ends is a contract to verify, and “verify” means check the edge behavior, not just the happy path.

These cosines are real output from a small pure-Python script over the wiki, not production numbers, and the analyzer was crude. But the shape, agreement on rare terms and sign-divergence on common ones, is a property of the formulas, not of this corpus.

Choosing a variant

Given all that, the fork is which variant to standardize on, and the answer is mostly forced by the consumer.

Match the engine. If the serving layer is Lucene or anything descended from it, build the table with the Lucene non-negative form, because that is what the scorer’s own BM25 implementation uses internally and the artifact has to agree with it. The choice is not really yours; it is dictated by the contract.

Non-negative unless you mean it. A formula that can go negative is occasionally what you want, when you genuinely want ubiquitous terms to penalize rather than merely not-help. But most retrieval stacks assume non-negative weights in their skip-list and upper-bound logic, so a negative IDF can break WAND-style optimizations that rely on a term’s maximum contribution being a non-negative bound. Pick a sign-going variant only with eyes open.

Smoothing for the zero-df edge. Whatever the main form, it needs a defined answer for a query term the corpus has never seen, either a +1 in the denominator or an explicit fallback weight supplied at lookup time. This connects to the artifact’s design: the table can carry a sentinel row for the unknown-term case, or the consumer can apply a default. Either way the unseen term must resolve to something, and that something is part of the contract too. The shape of that artifact, term to weight plus whatever traceability columns it carries, is the IDF artifact.

The deeper point is the one the measurement made concrete. The IDF math has been settled for fifty years. What is not settled, and what actually breaks systems, is making two pieces of software agree on which settled formula they are both using. The variant is easy. The agreement is the work.

The code

The four variants are four one-liners, and the whole danger lives in the difference between them at the common-term edge. The cosine test that distinguished the harmless mismatch from the catastrophic one is the only non-trivial part, and it is six lines.

python
import math

def textbook(df, N):  return math.log(N / df)                          # never negative for df<=N
def classic(df, N):   return math.log((N - df + 0.5) / (df + 0.5))     # CAN go negative when df > N/2
def lucene(df, N):    return math.log(1 + (N - df + 0.5) / (df + 0.5)) # +1 keeps it non-negative
def smoothed(df, N):  return math.log(N / (df + 1)) + 1                # sklearn-style

def cosine(a, b):     # how much two weightings of the same query agree in DIRECTION
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a)); nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb) if na and nb else 0.0

# builder writes one variant, consumer assumes another -> compare the weight vectors
query = ["the", "is", "of"]
v_builder  = [lucene(df[t], N)  for t in query]
v_consumer = [classic(df[t], N) for t in query]
print(cosine(v_builder, v_consumer))     # -> -0.909 : the all-stopword query nearly reverses

The full script lists every term whose classic IDF goes negative, runs the cosine over many queries for both the close and the dangerous pairing, and produces the figures; it sits in the experiment workbench beside this page.


Up to the Lexical map. The formula this varies is derived in TF-IDF, and it sits inside BM25. The non-negativity it protects is what WAND relies on. The artifact where builder and consumer actually meet is the IDF artifact, and the tokenizer that has to match on both ends, the other half of the agreement problem, is tokenization and analysis.