TF-IDF
The inverted index tells me which documents contain a query word. It does not tell me which of them to put first, and ranking is most of what makes a search engine feel good or bad to use, so the second question turns out to be the harder and more interesting one. This page is me working through it from the most naive possible answer down to the weighting scheme that lexical search actually settled on, and then checking the story against a real corpus to see where the clean version was lying to me.
The naive answer is the one I would have written without thinking: rank by how many query words a document matches, most matches first. It is the obvious thing, it is where the Lexical map starts, and what I wanted to understand was the exact sequence of ways it falls apart, because it turns out each repair is a piece of the function that ended up running everywhere. I will build it up on a tiny corpus where I can do every number by hand, and then run the same idea over fifty-eight real documents to see whether the hand-waving survives contact with messy text. (It mostly does. One part of it does not, and that part was the most useful thing I learned writing this.)
The naive rank, and the first crack
Take the three-document corpus the inverted index leaf used, so the toy world carries over:
D1: the cat sat on the mat
D2: the dog sat on the log
D3: the cat chased the dogSomeone searches cat. Two documents contain it, D1 and D3, and with a single query word the match-count ranking is a tie: both score 1. Fine. Now they search the cat. Every document contains the, two contain cat, so D1 and D3 score 2 (they have both words) and D2 scores 1 (it has the). D1 and D3 rank above D2. Also fine, almost by accident.
The crack shows the moment a common word is the only thing a document shares with the query. Imagine a fourth document that is one long paragraph using the twenty times and never mentioning a cat. Against the query the cat it matches one query word, the, same as D2. Against a count that rewards repetition it could even beat D1. A document about nothing the user asked for climbs the ranking on the strength of the most worthless word in the language. The naive count has no idea that the is worthless. It treats every shared word as worth exactly one.
So the arm needs two things the count does not have. It needs to care how much a document uses a word, not just whether it does. And it needs to care how much that word is worth, because the and cat are not the same evidence. Those are the two halves, and the field built them separately before it multiplied them together.
Half one: term frequency, and why more is more
The first half is the easy intuition. If a document says cat five times and another says it once,
the first one is probably more about cats. Within-document term frequency, TF, is just that
count: the number of times a term t appears in a document d, written tf(t, d). It is the
“aboutness” signal, and it is already sitting in the posting
once you store frequencies, so the scorer never has to reopen the document to get it.
On the toy corpus, term frequencies are small:
the cat sat on mat dog log chased
D1 2 1 1 1 1 0 0 0
D2 2 0 1 1 0 1 1 0
D3 2 0 0 0 0 1 0 1Read a column to see a term’s spread, a row to see what a document is made of. the has tf = 2
in every document; cat has tf = 1 in D1 and D3 and 0 in D2.
Ranking by TF alone is the count idea with volume added, and it fails in the same place, harder. The query cat still does fine, because cat is a discriminating word and TF only sharpens it. But weight a multi-word query by TF and the common words dominate by sheer frequency. the has the highest TF in the corpus and contributes nothing about relevance. TF measures how much a document talks about a term; it has no opinion on whether anyone should care that it does. That opinion is the second half.
Half two: the rarity signal, and a 1972 idea
The second half is the one that took the field a real insight to find, and it is the heart of the day job, so it gets the slow walk.
The question is how much a matched term is worth as evidence. The answer the field converged on is a rarity argument: a term is worth listening to in proportion to how few documents contain it. A term that appears in every document, like the, partitions the corpus not at all, so matching it tells you nothing about which document is the one. A term that appears in two documents out of three cuts the corpus almost in half when you match it. A term that appears in exactly one document points at that document and no other. Rarity is discriminating power.
The count of documents a term appears in is its document frequency, df(t). Note what it is
and is not: it counts documents, not occurrences. the appears twice in D1, but D1 contributes 1 to
df(the), not 2. The term is deduplicated within a document before counting, because the question
df answers is “in how many documents does this term show up at all,” not “how many times in total.”
On the toy corpus, with N = 3 documents:
term df in which docs
the 3 D1, D2, D3
cat 2 D1, D3
sat 2 D1, D2
on 2 D1, D2
dog 2 D2, D3
mat 1 D1
log 1 D2
chased 1 D3df(the) = 3 is the whole corpus. df(mat) = 1 is a term that names one document. The weight has
to run the other way from df: high df, low weight; low df, high weight. So you want an inverse
function of document frequency, and the cleanest one is to divide the corpus size by the document
frequency, N / df(t), and take the log. That is inverse document frequency, IDF:
idf(t) = log( N / df(t) )What each symbol is doing, in plain language. N is how many documents exist. df(t) is how many
of them contain the term. N / df(t) is “one over the fraction of the corpus that has this term,”
so it is large for rare terms and shrinks toward 1 as a term spreads everywhere. The log is there to
keep the reward for rarity from exploding: the jump from a term in half the corpus to a term in a
tenth of it should matter, but not ten times as much, because evidence does not compound that
fast. A term in every single document gets N / df = 1, and log(1) = 0, so it contributes exactly
nothing. That is the property the naive count was missing, stated as a number: the is worth zero.
The lineage here is one I find worth knowing because it predates almost everything else in the modern stack. Karen Spärck Jones, working in Cambridge, set this out in a 1972 paper on what she called term specificity, arguing that a term’s weight in retrieval should be a function of its inverse document frequency. As I understand it she was reasoning about the statistics of language and relevance rather than implementing a search engine, and the striking thing is that the formula has barely changed in the fifty years since. The idea that the worth of a word is its rarity was sitting there in a 1972 paper before the field had really settled its name, and the production system I work on is still computing essentially the same quantity.
Run IDF on the toy corpus. Logs are natural log here; the base only scales everything by a constant and does not change the ranking.
idf(the) = log(3/3) = log(1) = 0.00
idf(cat) = log(3/2) = log(1.5) = 0.41
idf(sat) = log(3/2) = 0.41
idf(dog) = log(3/2) = 0.41
idf(mat) = log(3/1) = log(3) = 1.10
idf(chased) = log(3/1) = 1.10The ordering is already the one intuition wanted. the comes out worthless. Words in two of three documents are worth a moderate 0.41. Words that name a single document, mat and chased, are worth the most at 1.10. Nobody told the formula that the is a stop word. It worked it out from the document frequencies alone, which is the part I still find a little magical: a pure count of where words appear reconstructs the intuition that some words carry information and some do not.
I want to flag one thing about that idf(the) = 0.00 before it misleads me later. It came out
exactly zero only because the happens to be in literally every document of this three-document
toy, so df = N and log(N/df) = log(1) = 0. That is a property of the toy, not of the word. On a
real corpus the shows up almost everywhere but not quite everywhere, so its IDF lands low but not
zero, and that gap turns out to matter. I will come back to it with real numbers below.
The product, and why it had to be a product
Each half is incomplete on its own. TF knows how much a document is about a term and not whether the term matters. IDF knows whether a term matters and not how much any given document is about it. The move, generally credited to Gerard Salton and the SMART system at Cornell in the 1960s and 70s, is to multiply them:
score(t, d) = tf(t, d) * idf(t)and a document’s score for a multi-word query is the sum of this over the query’s terms. Multiplying is the right combiner, not adding, and the toy corpus shows why. Score the query the cat against each document, term by term.
tf*idf for "the" tf*idf for "cat" total
D1 the:2*0.00 = 0.00 cat:1*0.41 = 0.41 0.41
D2 the:2*0.00 = 0.00 cat:0*0.41 = 0.00 0.00
D3 the:2*0.00 = 0.00 cat:1*0.41 = 0.41 0.41D1 and D3 score 0.41, D2 scores 0. the contributes nothing to any of them, because its IDF is
zero and zero times any term frequency is still zero. The whole score is carried by cat, which is
exactly the word that should carry it. Go back to the imagined fourth document, the one that says
the twenty times and never says cat. Under the naive count it threatened to win. Under TF-IDF it
scores 20 * 0.00 + 0 * 0.41 = 0. The product is what makes a high term frequency on a worthless
word stay worthless: IDF can veto TF down to nothing. Addition could not do that. If the combiner
were tf + idf, the twenty occurrences of the would pile up regardless of its zero IDF, and the
junk document would still climb. Multiplication lets the rarity signal gate the frequency signal
instead of merely competing with it.
That gating is the entire reason TF-IDF beats the count. A document is rewarded for using a term a lot and for that term being rare, and the reward collapses to nothing the instant either of those fails. Common words cannot win on volume, and rare words cannot win on a single passing mention.
The definition that decides every weight: what is one document
Here is the part the textbook treatment skips and the production reality cannot. Every number above
rests on N and df, and both of those rest on a definition: what counts as one document. On the
toy corpus that question is invisible because each document is one short line. In a real corpus it is
the first real decision, and getting it wrong corrupts every IDF weight in the table at once.
The trap appears when source content is chunked. Long documents do not get indexed whole; they get split into smaller passages so retrieval can return the relevant piece rather than a whole manual. So one logical document becomes many chunk rows, each carrying a slice of the text and a pointer back to the document it came from. Now ask the df question again. Does a term that appears in eight chunks of one long document count as df 8, or df 1?
It has to be df 1, and seeing why is the whole point. Document frequency is supposed to measure how many distinct things a term appears in, as a proxy for how discriminating it is. If a single long document gets sliced into eight chunks, a term that runs through all eight should not look eight times more common than a term confined to one short document. Counting df over chunks would punish terms for the accident of living in long documents and would make IDF a measure of chunking, not of rarity. So df is counted over logical documents: group the chunks by their document id, treat all the chunks of one document as one document, and count a term once if it appears in any of them. The deduplication that turned the’s two occurrences in D1 into a df contribution of 1 is the same deduplication, one level up. Dedup within a chunk, then within a document.
This is the kind of thing that does not show up until something downstream looks subtly wrong. The IDF table can be computed flawlessly by the formula and still be wrong because the unit of counting was the chunk and not the document. The math was never the hard part. The definition underneath the math was. I wanted to know how badly chunk-counting actually distorts the weights rather than just asserting that it does, so I measured it, and the size of the distortion surprised me. That is in the next section.
Running it on a real corpus
The toy corpus is honest about the mechanics and dishonest about the mess. Three documents, all one line long, every term either everywhere or in one place. I did not trust that my intuitions would survive real text, so I built the simplest possible version of all of this in a hundred lines of pure Python and pointed it at a corpus I actually had lying around: the fifty-eight markdown documents in the Machine Learning section of this same wiki. Real documents, wildly different lengths, the ordinary chaos of written prose. The analyzer was deliberately crude, just lowercasing and splitting on non-letters, because I wanted to see what the raw statistics do before any of the tokenization cleverness gets involved. Here is what came back, and where it agreed with the toy and where it did not.
The bottom of the IDF table looked exactly as promised, and the top, restricted to terms in at least two documents so I was not just reading typos, did too:
| lowest IDF | df | idf | highest IDF | df | idf | |
|---|---|---|---|---|---|---|
| the | 32/58 | 0.595 | bayes | 2/58 | 3.367 | |
| and | 32/58 | 0.595 | transformer | 6/58 | 2.269 | |
| of | 31/58 | 0.626 | embedding | 7/58 | 2.115 | |
| to | 31/58 | 0.626 | regularization | 9/58 | 1.863 | |
| in | 31/58 | 0.626 | ||||
| is | 30/58 | 0.659 |
The shape is right. Function words sink, technical terms float. But the bottom row of that first
block is the thing I flagged earlier and it is worth stopping on. On the toy corpus the scored
exactly 0.00. Here it scores 0.595, because it appears in thirty-two of fifty-eight documents,
not all fifty-eight. That is not zero. A document can still earn ranking mass from matching the on
this corpus, just less than from matching a real word. The clean toy claim that “IDF drives stop
words to zero” is only true in the limit where a word is in literally every document, which real
corpora never quite reach. This is the first concrete reason production systems do not rely on plain
log(N/df) alone to neutralize stop words, and instead either remove them outright in the
analyzer or use an IDF
variant with different edge behavior. I had read that and
half-believed it. Watching the refuse to go to zero on real text is what made me believe it.
Then I checked the claim at the center of the page, that count-ranking and TF-IDF actually disagree. I ran the query the gradient descent both ways. By raw match count, the top results are a five-way tie of long documents that all happen to contain all three words, led by one where the alone appears one hundred and eighty-four times. That document leads not because it is about gradient descent but because it is long and full of the most common word in English, which is the exact failure the whole page is about, sitting there in real data. By TF-IDF the ranking reorders: the documents whose mass comes from gradient and descent rise, the page literally titled gradient descent climbs into the top handful where the count ranking had buried it in the tie, and the contributes almost nothing despite its enormous frequency. Same corpus, same query, materially different answer, in the direction the toy predicted. It is one thing to argue that on three sentences and another to watch it rerank real pages.
The part that genuinely surprised me was the chunking experiment. I split each of the fifty-eight documents into roughly one-hundred-and-twenty-token passages, which turned fifty-eight documents into three hundred and seventy-seven chunks, and recomputed IDF counting document frequency over chunks instead of over documents. The weights did not just shift uniformly, which would have been harmless. They stretched apart unevenly:
| term | IDF over documents | IDF over chunks | direction |
|---|---|---|---|
| the | 0.595 | 0.109 | looks much commoner |
| gradient | 1.352 | 1.602 | looks rarer |
| transformer | 2.269 | 3.099 | looks much rarer |
| regularization | 1.863 | 2.797 | looks much rarer |
Counting over chunks makes the look far commoner than it is, because it recurs in nearly every chunk, so its already-low IDF collapses toward nothing. At the same time it makes a genuinely rare term like transformer look even rarer, because it clusters into a few chunks of a few documents and is absent from the hundreds of others. The rarity signal does not survive the change of unit intact; it warps, compressing the common end and exaggerating the rare end. When I ranked the whole vocabulary by IDF under each counting scheme and looked at which terms moved the most, ordinary content words shifted by hundreds of positions purely as an artifact of how I had sliced the documents. Nothing about the documents changed. Only the definition of “a document” changed, and the weights moved out from under me. That is the abstract warning from the previous section turned into a number I can point at, and it is why “what counts as one document” is the first thing I would now check on any IDF table I did not build myself.
The experiment is not the product and it is not shipped, but the numbers in this section are real output from that run, not invented. The value of running it was not confirmation. It was the two places the clean story bent: stop words do not actually reach zero, and the chunking distortion is large and uneven rather than small and uniform.
The forks
Every choice so far had an alternative, and the alternatives are worth naming because each is a place a real system can sit differently.
Raw TF or dampened TF. The product above used raw term frequency, and raw TF grows without limit: a document that says a term ten times scores ten times higher on that term. That is already suspicious, because the tenth occurrence of refund in a support article tells you almost nothing the first did not. A common patch is to dampen TF before multiplying, often with a log, so the curve rises fast at first and then flattens. This is the right instinct, and it is the instinct BM25 turns into a proper tunable saturation curve. TF-IDF with raw TF is the version that motivates the fix; it is not the version you ship.
No length normalization. Nothing in TF-IDF accounts for document length, and length quietly inflates scores: a long document accumulates higher term frequencies just by being long, so it racks up TF-IDF mass without being any more relevant. The toy corpus hides this because all three documents are about the same length. On real text it is the other thing BM25 was built to fix, alongside saturation. Raw TF-IDF has no defense against a long document drowning a short, precise one.
Which IDF formula. The log(N / df) written here is the clean teaching form, and it is not the
form most production engines use. It can behave badly at the edges, and there are smoothed and
non-negative variants that handle very common and very rare terms more gracefully. The choice among
them is not cosmetic: when one component computes IDF and another consumes it, the two formulas have
to be the same formula, or the weights are looked up against numbers that were never computed. That
contract gets its own leaf, IDF variants.
Why absolute IDF scale matters less than it looks. This one confused me until I worked it
through, so it is worth the few lines. Some serving paths normalize the per-term IDF weights of a
single query to unit length before scoring, dividing each weight by the square root of the sum of
their squares. The consequence is that multiplying every IDF in the table by a constant changes
nothing about the result. Take a two-term query with weights (2, 1). Normalized, that is
(2, 1) / sqrt(4 + 1) = (0.894, 0.447). Now double both weights to (4, 2), as a different IDF
variant or a different log base might. Normalized, that is (4, 2) / sqrt(16 + 4) = (0.894, 0.447),
the identical vector. The normalization divides the common factor right back out. So when people say
the choice of IDF variant is “only about relative weights, not absolute scale,” this is the mechanism
they mean: any per-query rescaling that the variants share washes out, and only the shape of the
weighting across a query’s terms survives. That reassured me that picking a variant is mostly a
question of edge behavior, not of getting some global constant right.
Why the clean formula grows smoothing terms. The log(N / df) I have used throughout has no
+1 or +0.5 anywhere, and every production formula does. The reasons are concrete, not decorative.
A +1 inside the denominator, log(N / (df + 1)), keeps the expression defined for a term the table
has never seen, where df = 0 would otherwise divide by zero. A +0.5 on both sides is a smoothing
convention borrowed from probability estimates, nudging the ratio away from its extremes so a term in
exactly one document and a term in zero documents do not produce wildly unstable weights. And an outer
1 +, as in log(1 + N/df), keeps the whole thing from going negative, which the bare form can do
once a term is common enough that N/df drops below one. The toy formula is the idea; these constants
are what it takes to run the idea on a corpus where terms appear zero times, every time, and
everything in between. The exact combination a given engine uses, and the requirement that the
builder and the consumer use the same combination, is the subject of the next leaf, IDF
variants.
TF-IDF, then, is best understood as the first version that worked, not the version that wins. It got two things right that the count got wrong, rarity-weighting and rewarding within-document use, and it left two things for its successor: the unbounded TF curve and the blindness to length. That is the exact shape of the move from TF-IDF to BM25.
The code
The whole experiment is one short pure-Python file, no nltk and no sklearn, so every number above is traceable by hand. The core is twenty lines: tokenize, count document frequency with per-document dedup, and define IDF.
import math, re
from collections import Counter
from pathlib import Path
TOKEN_RE = re.compile(r"[a-z][a-z0-9_]+") # crude analyzer: see the caveats above
def tokenize(text):
return TOKEN_RE.findall(text.lower())
# one logical document per file; strip YAML frontmatter first
docs = {p.name: strip_frontmatter(p.read_text()) for p in corpus_dir.rglob("*.md")}
N = len(docs)
df = Counter()
doc_tokens = {}
for name, raw in docs.items():
tf = Counter(tokenize(raw))
doc_tokens[name] = tf
for term in tf: # dedup within a document
df[term] += 1 # ...then count once per document
def idf(term):
return math.log(N / df[term]) if df[term] else 0.0
def score_tfidf(query, name): # the product, summed over query terms
tf = doc_tokens[name]
return sum(tf[q] * idf(q) for q in query)The chunking distortion is the same code with one line changed, counting df over 120-token slices
instead of over whole documents, and the count-vs-TF-IDF comparison is just ranking the corpus by
score_count and by score_tfidf and reading off where they disagree. The full script, including the
chunk experiment and the rank-movement analysis that produced the figures, lives alongside the page in
the experiment workbench.
Up to the Lexical map. Back to where the term frequencies live, the inverted index. Forward to the function that fixes raw TF and length, BM25, and to the formula-agreement contract, IDF variants. The table this all produces, term mapped to weight, is built offline as the IDF artifact.