From Lexical Retrieval

TF-IDF

The inverted index tells me which documents contain a query word. It does not tell me which of them to put first, and ranking is most of what makes a search engine feel good or bad to use, so the second question turns out to be the harder and more interesting one. This page is me working through it from the most naive possible answer down to the weighting scheme that lexical search actually settled on, and then checking the story against a real corpus to see where the clean version was lying to me.

The naive answer is the one I would have written without thinking: rank by how many query words a document matches, most matches first. It is the obvious thing, it is where the Lexical map starts, and what I wanted to understand was the exact sequence of ways it falls apart, because it turns out each repair is a piece of the function that ended up running everywhere. I will build it up on a tiny corpus where I can do every number by hand, and then run the same idea over fifty-eight real documents to see whether the hand-waving survives contact with messy text. (It mostly does. One part of it does not, and that part was the most useful thing I learned writing this.)

The naive rank, and the first crack

Take the three-document corpus the inverted index leaf used, so the toy world carries over:

txt

D1: the cat sat on the mat
D2: the dog sat on the log
D3: the cat chased the dog

Someone searches cat. Two documents contain it, D1 and D3, and with a single query word the match-count ranking is a tie: both score 1. Fine. Now they search the cat. Every document contains the, two contain cat, so D1 and D3 score 2 (they have both words) and D2 scores 1 (it has the). D1 and D3 rank above D2. Also fine, almost by accident.

The crack shows the moment a common word is the only thing a document shares with the query. Imagine a fourth document that is one long paragraph using the twenty times and never mentioning a cat. Against the query the cat it matches one query word, the, same as D2. Against a count that rewards repetition it could even beat D1. A document about nothing the user asked for climbs the ranking on the strength of the most worthless word in the language. The naive count has no idea that the is worthless. It treats every shared word as worth exactly one.

So the arm needs two things the count does not have. It needs to care how much a document uses a word, not just whether it does. And it needs to care how much that word is worth, because the and cat are not the same evidence. Those are the two halves, and the field built them separately before it multiplied them together.

Half one: term frequency, and why more is more

The first half is the easy intuition. If a document says cat five times and another says it once, the first one is probably more about cats. Within-document term frequency, TF, is just that count: the number of times a term t appears in a document d, written tf(t, d). It is the “aboutness” signal, and it is already sitting in the posting once you store frequencies, so the scorer never has to reopen the document to get it.

On the toy corpus, term frequencies are small:

txt

        the   cat   sat   on   mat   dog   log   chased
D1       2     1     1    1     1    0     0      0
D2       2     0     1    1     0    1     1      0
D3       2     0     0    0     0    1     0      1

Read a column to see a term’s spread, a row to see what a document is made of. the has tf = 2 in every document; cat has tf = 1 in D1 and D3 and 0 in D2.

Ranking by TF alone is the count idea with volume added, and it fails in the same place, harder. The query cat still does fine, because cat is a discriminating word and TF only sharpens it. But weight a multi-word query by TF and the common words dominate by sheer frequency. the has the highest TF in the corpus and contributes nothing about relevance. TF measures how much a document talks about a term; it has no opinion on whether anyone should care that it does. That opinion is the second half.

Half two: the rarity signal, and a 1972 idea

The second half is the one that took the field a real insight to find, and it is the heart of the day job, so it gets the slow walk.

The question is how much a matched term is worth as evidence. The answer the field converged on is a rarity argument: a term is worth listening to in proportion to how few documents contain it. A term that appears in every document, like the, partitions the corpus not at all, so matching it tells you nothing about which document is the one. A term that appears in two documents out of three cuts the corpus almost in half when you match it. A term that appears in exactly one document points at that document and no other. Rarity is discriminating power.

The count of documents a term appears in is its document frequency, df(t). Note what it is and is not: it counts documents, not occurrences. the appears twice in D1, but D1 contributes 1 to df(the), not 2. The term is deduplicated within a document before counting, because the question df answers is “in how many documents does this term show up at all,” not “how many times in total.” On the toy corpus, with N = 3 documents:

txt

term     df     in which docs
the       3     D1, D2, D3
cat       2     D1, D3
sat       2     D1, D2
on        2     D1, D2
dog       2     D2, D3
mat       1     D1
log       1     D2
chased    1     D3

df(the) = 3 is the whole corpus. df(mat) = 1 is a term that names one document. The weight has to run the other way from df: high df, low weight; low df, high weight. So you want an inverse function of document frequency, and the cleanest one is to divide the corpus size by the document frequency, N / df(t), and take the log. That is inverse document frequency, IDF:

txt

idf(t) = log( N / df(t) )

What each symbol is doing, in plain language. N is how many documents exist. df(t) is how many of them contain the term. N / df(t) is “one over the fraction of the corpus that has this term,” so it is large for rare terms and shrinks toward 1 as a term spreads everywhere. The log is there to keep the reward for rarity from exploding: the jump from a term in half the corpus to a term in a tenth of it should matter, but not ten times as much, because evidence does not compound that fast. A term in every single document gets N / df = 1, and log(1) = 0, so it contributes exactly nothing. That is the property the naive count was missing, stated as a number: the is worth zero.

The lineage here is one I find worth knowing because it predates almost everything else in the modern stack. Karen Spärck Jones, working in Cambridge, set this out in a 1972 paper on what she called term specificity, arguing that a term’s weight in retrieval should be a function of its inverse document frequency. As I understand it she was reasoning about the statistics of language and relevance rather than implementing a search engine, and the striking thing is that the formula has barely changed in the fifty years since. The idea that the worth of a word is its rarity was sitting there in a 1972 paper before the field had really settled its name, and the production system I work on is still computing essentially the same quantity.

Run IDF on the toy corpus. Logs are natural log here; the base only scales everything by a constant and does not change the ranking.

txt

idf(the)    = log(3/3) = log(1)   = 0.00
idf(cat)    = log(3/2) = log(1.5) = 0.41
idf(sat)    = log(3/2)            = 0.41
idf(dog)    = log(3/2)            = 0.41
idf(mat)    = log(3/1) = log(3)   = 1.10
idf(chased) = log(3/1)            = 1.10

The ordering is already the one intuition wanted. the comes out worthless. Words in two of three documents are worth a moderate 0.41. Words that name a single document, mat and chased, are worth the most at 1.10. Nobody told the formula that the is a stop word. It worked it out from the document frequencies alone, which is the part I still find a little magical: a pure count of where words appear reconstructs the intuition that some words carry information and some do not.

I want to flag one thing about that idf(the) = 0.00 before it misleads me later. It came out exactly zero only because the happens to be in literally every document of this three-document toy, so df = N and log(N/df) = log(1) = 0. That is a property of the toy, not of the word. On a real corpus the shows up almost everywhere but not quite everywhere, so its IDF lands low but not zero, and that gap turns out to matter. I will come back to it with real numbers below.

The product, and why it had to be a product

Each half is incomplete on its own. TF knows how much a document is about a term and not whether the term matters. IDF knows whether a term matters and not how much any given document is about it. The move, generally credited to Gerard Salton and the SMART system at Cornell in the 1960s and 70s, is to multiply them:

txt

score(t, d) = tf(t, d) * idf(t)

and a document’s score for a multi-word query is the sum of this over the query’s terms. Multiplying is the right combiner, not adding, and the toy corpus shows why. Score the query the cat against each document, term by term.

txt

                tf*idf for "the"        tf*idf for "cat"        total
D1   the:2*0.00 = 0.00      cat:1*0.41 = 0.41                   0.41
D2   the:2*0.00 = 0.00      cat:0*0.41 = 0.00                   0.00
D3   the:2*0.00 = 0.00      cat:1*0.41 = 0.41                   0.41

D1 and D3 score 0.41, D2 scores 0. the contributes nothing to any of them, because its IDF is zero and zero times any term frequency is still zero. The whole score is carried by cat, which is exactly the word that should carry it. Go back to the imagined fourth document, the one that says the twenty times and never says cat. Under the naive count it threatened to win. Under TF-IDF it scores 20 * 0.00 + 0 * 0.41 = 0. The product is what makes a high term frequency on a worthless word stay worthless: IDF can veto TF down to nothing. Addition could not do that. If the combiner were tf + idf, the twenty occurrences of the would pile up regardless of its zero IDF, and the junk document would still climb. Multiplication lets the rarity signal gate the frequency signal instead of merely competing with it.

That gating is the entire reason TF-IDF beats the count. A document is rewarded for using a term a lot and for that term being rare, and the reward collapses to nothing the instant either of those fails. Common words cannot win on volume, and rare words cannot win on a single passing mention.

The definition that decides every weight: what is one document

Here is the part the textbook treatment skips and the production reality cannot. Every number above rests on N and df, and both of those rest on a definition: what counts as one document. On the toy corpus that question is invisible because each document is one short line. In a real corpus it is the first real decision, and getting it wrong corrupts every IDF weight in the table at once.

The trap appears when source content is chunked. Long documents do not get indexed whole; they get split into smaller passages so retrieval can return the relevant piece rather than a whole manual. So one logical document becomes many chunk rows, each carrying a slice of the text and a pointer back to the document it came from. Now ask the df question again. Does a term that appears in eight chunks of one long document count as df 8, or df 1?

It has to be df 1, and seeing why is the whole point. Document frequency is supposed to measure how many distinct things a term appears in, as a proxy for how discriminating it is. If a single long document gets sliced into eight chunks, a term that runs through all eight should not look eight times more common than a term confined to one short document. Counting df over chunks would punish terms for the accident of living in long documents and would make IDF a measure of chunking, not of rarity. So df is counted over logical documents: group the chunks by their document id, treat all the chunks of one document as one document, and count a term once if it appears in any of them. The deduplication that turned the’s two occurrences in D1 into a df contribution of 1 is the same deduplication, one level up. Dedup within a chunk, then within a document.

This is the kind of thing that does not show up until something downstream looks subtly wrong. The IDF table can be computed flawlessly by the formula and still be wrong because the unit of counting was the chunk and not the document. The math was never the hard part. The definition underneath the math was. I wanted to know how badly chunk-counting actually distorts the weights rather than just asserting that it does, so I measured it, and the size of the distortion surprised me. That is in the next section.

Running it on a real corpus

The toy corpus is honest about the mechanics and dishonest about the mess. Three documents, all one line long, every term either everywhere or in one place. I did not trust that my intuitions would survive real text, so I built the simplest possible version of all of this in a hundred lines of pure Python and pointed it at a corpus I actually had lying around: the fifty-eight markdown documents in the Machine Learning section of this same wiki. Real documents, wildly different lengths, the ordinary chaos of written prose. The analyzer was deliberately crude, just lowercasing and splitting on non-letters, because I wanted to see what the raw statistics do before any of the tokenization cleverness gets involved. Here is what came back, and where it agreed with the toy and where it did not.

The bottom of the IDF table looked exactly as promised, and the top, restricted to terms in at least two documents so I was not just reading typos, did too:

lowest IDF	df	idf	highest IDF	df	idf
the	32/58	0.595	bayes	2/58	3.367
and	32/58	0.595	transformer	6/58	2.269
of	31/58	0.626	embedding	7/58	2.115
to	31/58	0.626	regularization	9/58	1.863
in	31/58	0.626
is	30/58	0.659

The shape is right. Function words sink, technical terms float. But the bottom row of that first block is the thing I flagged earlier and it is worth stopping on. On the toy corpus the scored exactly 0.00. Here it scores 0.595, because it appears in thirty-two of fifty-eight documents, not all fifty-eight. That is not zero. A document can still earn ranking mass from matching the on this corpus, just less than from matching a real word. The clean toy claim that “IDF drives stop words to zero” is only true in the limit where a word is in literally every document, which real corpora never quite reach. This is the first concrete reason production systems do not rely on plain log(N/df) alone to neutralize stop words, and instead either remove them outright in the analyzer or use an IDF variant with different edge behavior. I had read that and half-believed it. Watching the refuse to go to zero on real text is what made me believe it.

Then I checked the claim at the center of the page, that count-ranking and TF-IDF actually disagree. I ran the query the gradient descent both ways. By raw match count, the top results are a five-way tie of long documents that all happen to contain all three words, led by one where the alone appears one hundred and eighty-four times. That document leads not because it is about gradient descent but because it is long and full of the most common word in English, which is the exact failure the whole page is about, sitting there in real data. By TF-IDF the ranking reorders: the documents whose mass comes from gradient and descent rise, the page literally titled gradient descent climbs into the top handful where the count ranking had buried it in the tie, and the contributes almost nothing despite its enormous frequency. Same corpus, same query, materially different answer, in the direction the toy predicted. It is one thing to argue that on three sentences and another to watch it rerank real pages.

The part that genuinely surprised me was the chunking experiment. I split each of the fifty-eight documents into roughly one-hundred-and-twenty-token passages, which turned fifty-eight documents into three hundred and seventy-seven chunks, and recomputed IDF counting document frequency over chunks instead of over documents. The weights did not just shift uniformly, which would have been harmless. They stretched apart unevenly:

term	IDF over documents	IDF over chunks	direction
the	0.595	0.109	looks much commoner
gradient	1.352	1.602	looks rarer
transformer	2.269	3.099	looks much rarer
regularization	1.863	2.797	looks much rarer

Counting over chunks makes the look far commoner than it is, because it recurs in nearly every chunk, so its already-low IDF collapses toward nothing. At the same time it makes a genuinely rare term like transformer look even rarer, because it clusters into a few chunks of a few documents and is absent from the hundreds of others. The rarity signal does not survive the change of unit intact; it warps, compressing the common end and exaggerating the rare end. When I ranked the whole vocabulary by IDF under each counting scheme and looked at which terms moved the most, ordinary content words shifted by hundreds of positions purely as an artifact of how I had sliced the documents. Nothing about the documents changed. Only the definition of “a document” changed, and the weights moved out from under me. That is the abstract warning from the previous section turned into a number I can point at, and it is why “what counts as one document” is the first thing I would now check on any IDF table I did not build myself.

The experiment is not the product and it is not shipped, but the numbers in this section are real output from that run, not invented. The value of running it was not confirmation. It was the two places the clean story bent: stop words do not actually reach zero, and the chunking distortion is large and uneven rather than small and uniform.

The forks

Every choice so far had an alternative, and the alternatives are worth naming because each is a place a real system can sit differently.

Raw TF or dampened TF. The product above used raw term frequency, and raw TF grows without limit: a document that says a term ten times scores ten times higher on that term. That is already suspicious, because the tenth occurrence of refund in a support article tells you almost nothing the first did not. A common patch is to dampen TF before multiplying, often with a log, so the curve rises fast at first and then flattens. This is the right instinct, and it is the instinct BM25 turns into a proper tunable saturation curve. TF-IDF with raw TF is the version that motivates the fix; it is not the version you ship.

No length normalization. Nothing in TF-IDF accounts for document length, and length quietly inflates scores: a long document accumulates higher term frequencies just by being long, so it racks up TF-IDF mass without being any more relevant. The toy corpus hides this because all three documents are about the same length. On real text it is the other thing BM25 was built to fix, alongside saturation. Raw TF-IDF has no defense against a long document drowning a short, precise one.

Which IDF formula. The log(N / df) written here is the clean teaching form, and it is not the form most production engines use. It can behave badly at the edges, and there are smoothed and non-negative variants that handle very common and very rare terms more gracefully. The choice among them is not cosmetic: when one component computes IDF and another consumes it, the two formulas have to be the same formula, or the weights are looked up against numbers that were never computed. That contract gets its own leaf, IDF variants.

Why absolute IDF scale matters less than it looks. This one confused me until I worked it through, so it is worth the few lines. Some serving paths normalize the per-term IDF weights of a single query to unit length before scoring, dividing each weight by the square root of the sum of their squares. The consequence is that multiplying every IDF in the table by a constant changes nothing about the result. Take a two-term query with weights (2, 1). Normalized, that is (2, 1) / sqrt(4 + 1) = (0.894, 0.447). Now double both weights to (4, 2), as a different IDF variant or a different log base might. Normalized, that is (4, 2) / sqrt(16 + 4) = (0.894, 0.447), the identical vector. The normalization divides the common factor right back out. So when people say the choice of IDF variant is “only about relative weights, not absolute scale,” this is the mechanism they mean: any per-query rescaling that the variants share washes out, and only the shape of the weighting across a query’s terms survives. That reassured me that picking a variant is mostly a question of edge behavior, not of getting some global constant right.

Why the clean formula grows smoothing terms. The log(N / df) I have used throughout has no +1 or +0.5 anywhere, and every production formula does. The reasons are concrete, not decorative. A +1 inside the denominator, log(N / (df + 1)), keeps the expression defined for a term the table has never seen, where df = 0 would otherwise divide by zero. A +0.5 on both sides is a smoothing convention borrowed from probability estimates, nudging the ratio away from its extremes so a term in exactly one document and a term in zero documents do not produce wildly unstable weights. And an outer 1 +, as in log(1 + N/df), keeps the whole thing from going negative, which the bare form can do once a term is common enough that N/df drops below one. The toy formula is the idea; these constants are what it takes to run the idea on a corpus where terms appear zero times, every time, and everything in between. The exact combination a given engine uses, and the requirement that the builder and the consumer use the same combination, is the subject of the next leaf, IDF variants.

TF-IDF, then, is best understood as the first version that worked, not the version that wins. It got two things right that the count got wrong, rarity-weighting and rewarding within-document use, and it left two things for its successor: the unbounded TF curve and the blindness to length. That is the exact shape of the move from TF-IDF to BM25.

The code

The whole experiment is one short pure-Python file, no nltk and no sklearn, so every number above is traceable by hand. The core is twenty lines: tokenize, count document frequency with per-document dedup, and define IDF.

python

import math, re
from collections import Counter
from pathlib import Path

TOKEN_RE = re.compile(r"[a-z][a-z0-9_]+")        # crude analyzer: see the caveats above

def tokenize(text):
    return TOKEN_RE.findall(text.lower())

# one logical document per file; strip YAML frontmatter first
docs = {p.name: strip_frontmatter(p.read_text()) for p in corpus_dir.rglob("*.md")}
N = len(docs)

df = Counter()
doc_tokens = {}
for name, raw in docs.items():
    tf = Counter(tokenize(raw))
    doc_tokens[name] = tf
    for term in tf:                              # dedup within a document
        df[term] += 1                            # ...then count once per document

def idf(term):
    return math.log(N / df[term]) if df[term] else 0.0

def score_tfidf(query, name):                    # the product, summed over query terms
    tf = doc_tokens[name]
    return sum(tf[q] * idf(q) for q in query)

The chunking distortion is the same code with one line changed, counting df over 120-token slices instead of over whole documents, and the count-vs-TF-IDF comparison is just ranking the corpus by score_count and by score_tfidf and reading off where they disagree. The full script, including the chunk experiment and the rank-movement analysis that produced the figures, lives alongside the page in the experiment workbench.

Up to the Lexical map. Back to where the term frequencies live, the inverted index. Forward to the function that fixes raw TF and length, BM25, and to the formula-agreement contract, IDF variants. The table this all produces, term mapped to weight, is built offline as the IDF artifact.