IDF Variants and the Agreement Contract
When I first wrote down IDF as log(N / df) I thought of it as
the formula, singular. It is not. It is a family of formulas that agree about rare terms and disagree
about common ones, and the disagreements are not cosmetic. One of them can hand a common word a
negative weight. Another cannot. They are within a rounding error of each other in one place and
point in opposite directions in another. I went into this page assuming the choice between them was a
detail and came out understanding it is the place a lexical system most quietly breaks, because the
breakage leaves no error behind, only wrong scores.
The reason this matters in production, and the reason it gets its own leaf, is the agreement contract. The formula that builds the IDF table and the formula the serving layer expects when it reads that table have to be the same formula. If they are not, every score comes out subtly off and nothing tells you. I wanted to know how subtly, so I measured it on a real corpus, and the answer turned out to depend entirely on which two variants disagree. That measurement is the spine of this page.
Why there is more than one formula
The clean log(N / df) is the teaching form, and it has two embarrassments the moment you run it on
real data instead of a toy.
The first is the unseen term. If a query word never appears in the corpus, df = 0, and N / df
divides by zero. The serving path cannot just crash on a word it has not indexed; it needs a defined
answer, usually a fallback weight. So real formulas put a +1 or a +0.5 in the denominator to keep
it finite when df is zero or tiny.
The second is the very common term. The textbook form stays positive as long as df <= N, which on a
single corpus it always is, so on this corpus log(N/df) never actually goes negative. But the
probabilistic derivation of IDF that BM25 grew out of produces a different expression,
idf_classic = log( (N - df + 0.5) / (df + 0.5) )and this one genuinely can go negative. When a term is in more than about half the documents, the
numerator N - df drops below the denominator df, the ratio falls below one, and the log turns
negative. A negative IDF means matching the term actively lowers a document’s score, which is a
defensible thing to say about a word so common it is evidence of nothing, and a dangerous thing to
feed an engine that assumes weights are non-negative. So Lucene and the systems descended from it use
a third form,
idf_lucene = log( 1 + (N - df + 0.5) / (df + 0.5) )where the outer 1 + lifts the whole curve up so it can never cross zero. And the scikit-learn
lineage uses yet another,
idf_smoothed = log( N / (df + 1) ) + 1with a +1 outside to keep weights comfortably positive and a +1 inside for the zero-df case. Four
formulas, four different sets of design decisions about the two edges, all claiming to be IDF.
Where they agree and where they do not
I computed all four on the real corpus, the same fifty-eight ML-wiki documents I have been using, with
N = 58. The pattern is the thing to internalize:
| term | df | textbook | classic BM25 | lucene | smoothed |
|---|---|---|---|---|---|
| the | 32 | 0.595 | −0.204 | 0.596 | 1.564 |
| gradient | 15 | 1.352 | 1.032 | 1.337 | 2.288 |
| transformer | 6 | 2.269 | 2.089 | 2.206 | 3.115 |
| bayes | 2 | 3.367 | 3.118 | 3.161 | 3.962 |
Read down the rare end first. For bayes, in two of fifty-eight documents, the four variants sit
between 3.12 and 3.96, close enough that the choice barely matters. For transformer, in six,
they spread a little wider but still cluster. The variants broadly agree about rare terms, which is
reassuring, because rare terms are where most of the ranking signal lives.
Now read the common end. For the, in thirty-two of fifty-eight documents, the variants are all over
the place: classic_bm gives -0.204, lucene gives +0.596, smoothed gives +1.564. The same
word, four wildly different weights, one of them negative. The disagreement between IDF variants is
concentrated almost entirely on common terms, exactly the terms a good ranker mostly wants to ignore.
That is a strange and useful fact: the formulas fight over the words that matter least, which is part
of why you can get away with the wrong one for a long time before noticing.
To see the negative case is not a fluke, I listed every term whose classic-BM25 IDF comes out
negative on this corpus. There were eleven, all stop words, all with df above roughly N/2 = 29:
and and the at -0.204, to and in and of and for at -0.136, and so on. The classic form
is doing something deliberate, pushing the weight of near-ubiquitous words below zero, and a consumer
that was not expecting negative weights would be quietly poisoned by exactly these terms.
The agreement contract, measured
Here is the part I got wrong, which is the part worth reporting. I assumed that if the builder used one variant and the consumer used another, the resulting weights would differ enough to scramble rankings, and that this was the whole danger. I decided to measure it instead of assert it. The clean way to ask “how different are two weightings of the same query” is to treat each query’s per-term IDF values as a vector and take the cosine between the builder’s vector and the consumer’s vector. Cosine near one means the same direction, so any per-query normalization would wash the difference out and the mismatch would be nearly harmless. Cosine well below one, or negative, means the relative weighting across the query’s terms is genuinely wrong, not just globally rescaled.
First I compared the builder using the Lucene form against a consumer assuming the textbook form, on a handful of queries:
| query | lucene vs textbook |
|---|---|
| the gradient descent | 1.0000 |
| naive bayes the | 0.9999 |
| transformer embedding attention | 1.0000 |
| the is of | 1.0000 |
Cosine one, everywhere. My prediction was wrong. Lucene and textbook are numerically almost identical
in the region where both are positive, look back at the table and the is 0.596 versus 0.595,
gradient is 1.337 versus 1.352, so swapping one for the other rotates nothing. That particular
mismatch is close to harmless after normalization. The +1 and the +0.5 that distinguish them
barely move the numbers.
Then I compared the Lucene builder against a consumer assuming the classic form, the one that goes negative:
| query | lucene vs classic | what happens |
|---|---|---|
| transformer embedding attention | 0.9999 | no stop words, variants agree |
| naive bayes the | 0.9795 | one common term mixed in |
| the gradient descent | 0.9326 | common + rare, weighting bends |
| the is of | −0.9090 | all common, vector nearly reverses |
There it is. The pure-content query, transformer embedding attention, with no stop words, stays at
cosine one, because the variants agree about rare terms. But the moment a query mixes a common term
with rare ones, the gradient descent, the cosine drops to 0.93. And the all-stopword query, the
is of, comes back at cosine -0.909, a weight vector pointing almost exactly the wrong way. The
mismatch does not merely rescale these queries. On the common terms it flips the sign, so the consumer
weights the stop words in the opposite direction from what the builder intended, and on a query made
mostly of common words that is the difference between up and down.
So the contract is sharper than “use the same formula.” The danger is specifically a mismatch where one variant changes the sign of common-term weights and the other does not. Two non-negative variants that differ only in their smoothing constants are nearly interchangeable after normalization. A non-negative variant paired with a sign-going one is a silent catastrophe that hides on content-heavy queries and surfaces on common-word ones. I would not have predicted which mismatch was which without computing it, and the lesson generalizes: when a system has a builder writing an artifact and a consumer reading it, the formula on both ends is a contract to verify, and “verify” means check the edge behavior, not just the happy path.
These cosines are real output from a small pure-Python script over the wiki, not production numbers, and the analyzer was crude. But the shape, agreement on rare terms and sign-divergence on common ones, is a property of the formulas, not of this corpus.
Choosing a variant
Given all that, the fork is which variant to standardize on, and the answer is mostly forced by the consumer.
Match the engine. If the serving layer is Lucene or anything descended from it, build the table with the Lucene non-negative form, because that is what the scorer’s own BM25 implementation uses internally and the artifact has to agree with it. The choice is not really yours; it is dictated by the contract.
Non-negative unless you mean it. A formula that can go negative is occasionally what you want, when you genuinely want ubiquitous terms to penalize rather than merely not-help. But most retrieval stacks assume non-negative weights in their skip-list and upper-bound logic, so a negative IDF can break WAND-style optimizations that rely on a term’s maximum contribution being a non-negative bound. Pick a sign-going variant only with eyes open.
Smoothing for the zero-df edge. Whatever the main form, it needs a defined answer for a query term
the corpus has never seen, either a +1 in the denominator or an explicit fallback weight supplied at
lookup time. This connects to the artifact’s design: the table can carry a sentinel row for the
unknown-term case, or the consumer can apply a default. Either way the unseen term must resolve to
something, and that something is part of the contract too. The shape of that artifact, term to weight
plus whatever traceability columns it carries, is the IDF
artifact.
The deeper point is the one the measurement made concrete. The IDF math has been settled for fifty years. What is not settled, and what actually breaks systems, is making two pieces of software agree on which settled formula they are both using. The variant is easy. The agreement is the work.
The code
The four variants are four one-liners, and the whole danger lives in the difference between them at the common-term edge. The cosine test that distinguished the harmless mismatch from the catastrophic one is the only non-trivial part, and it is six lines.
import math
def textbook(df, N): return math.log(N / df) # never negative for df<=N
def classic(df, N): return math.log((N - df + 0.5) / (df + 0.5)) # CAN go negative when df > N/2
def lucene(df, N): return math.log(1 + (N - df + 0.5) / (df + 0.5)) # +1 keeps it non-negative
def smoothed(df, N): return math.log(N / (df + 1)) + 1 # sklearn-style
def cosine(a, b): # how much two weightings of the same query agree in DIRECTION
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a)); nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb) if na and nb else 0.0
# builder writes one variant, consumer assumes another -> compare the weight vectors
query = ["the", "is", "of"]
v_builder = [lucene(df[t], N) for t in query]
v_consumer = [classic(df[t], N) for t in query]
print(cosine(v_builder, v_consumer)) # -> -0.909 : the all-stopword query nearly reversesThe full script lists every term whose classic IDF goes negative, runs the cosine over many queries for both the close and the dangerous pairing, and produces the figures; it sits in the experiment workbench beside this page.
Up to the Lexical map. The formula this varies is derived in TF-IDF, and it sits inside BM25. The non-negativity it protects is what WAND relies on. The artifact where builder and consumer actually meet is the IDF artifact, and the tokenizer that has to match on both ends, the other half of the agreement problem, is tokenization and analysis.