From Lexical Retrieval

Tokenization and Analysis

Everything in this territory so far has quietly assumed I know what the words in a document are. The inverted index maps words to documents. TF-IDF counts how often a word appears and in how many documents. BM25 saturates a word’s frequency. Every one of them takes “a word” as a given. It is not a given. It is the output of a process, the process has a surprising number of decisions in it, and getting the process even slightly inconsistent corrupts everything built on top of it without raising a single error. This is the deepest leaf in the territory because it is the one the day job is actually built around, and the one most write-ups wave past with the phrase “tokenize the text.”

I want to do two things here. Build the analyzer up stage by stage so the decisions are visible. Then show, with a real measurement, why the consistency requirement is the central correctness risk of a lexical system and not a footnote.

Splitting on spaces is not tokenization

The naive idea, the one I would reach for first, is that tokenizing is splitting on whitespace. Take the text, cut it at the spaces, those are your words. It is wrong in more ways than it looks, and each way is a stage the real pipeline adds back.

Consider the single document fragment The user’s refunds were running late. Split on whitespace and you get The, user's, refunds, were, running, late. Now a user searches refund. None of those tokens is refund. Not refunds, which is plural. The answer is sitting in the document and the match fails on grammar. Worse, The with a capital T is a different string from the, so the same word in two casings becomes two entries in the index with two separate document-frequency counts, and the IDF statistics for both are wrong. Whitespace splitting does not produce words. It produces surface forms, and surface forms are not what you want to match on.

A real analyzer is a pipeline that turns surface forms into normalized tokens. A representative ordering:

txt

raw text
  -> tokenize        split into candidate tokens (smarter than spaces:
                     handles punctuation, contractions, some scripts)
  -> lowercase       "The" and "the" become one token
  -> ASCII-fold      "café" and "cafe" become one token
  -> strip possessive  "user's" -> "user"
  -> remove stopwords  drop "the", "were", function words that carry no signal
  -> stem            "refunds", "refunding", "refunded" -> "refund"
normalized tokens

Run the fragment through it and The user’s refunds were running late becomes roughly user, refund, run, late. Now the query refund matches, because the query goes through the same pipeline and also becomes refund. The whole point of the pipeline is to erase the differences between surface forms that a human would call the same word, so that matching and counting happen over meanings-of-words rather than spellings-of-words.

Watching the pipeline collapse the vocabulary

Each stage merges distinct surface forms into one token, and the effect on the index is concrete: the vocabulary shrinks, and each remaining token’s IDF bucket gets more documents in it and therefore a more stable rarity estimate. I did not want to assert the shrinkage, so I ran three analyzers over the real corpus, each adding one stage to the last: a naive splitter, then folding plus stopword removal, then a light stemmer on top.

analyzer	stages	vocabulary	cut
A0	naive split	5702	—
A1	+ lowercase / fold / stopword	4508	−21%
A2	+ stemming	3495	−22% more

Folding and stopword removal cut the vocabulary by a fifth; stemming cut another fifth on top. Nearly forty percent of the naive “vocabulary” was surface-form duplication, the same words in different casings and inflections counted as different terms. Every one of those duplicates was splitting a term’s document frequency across multiple keys and corrupting its IDF. The pipeline is not cosmetic cleanup; it is what makes the corpus statistics mean what they are supposed to mean.

The value of the stemming stage in particular is easiest to feel by looking at what it unifies. A few of the stem groups from the run:

txt

smooth     <- smooth, smoothed, smoothing, smoothly, smooths
represent  <- represent, represented, representing, represents
block      <- block, blocked, blocking, blocks
prompt     <- prompt, prompted, prompting, prompts
map        <- map, mapped, mapping, maps

Five spellings of smooth collapse into one token, so a query for any of them finds documents using any of the others, and the document frequency of the concept is counted once rather than smeared across five entries. That is the gain. There is also a cost, and the same run handed me a clean example of it: my light stemmer turned compress, compressed, compressing into the stem compres, mangling the word because the heuristic for undoubling consonants fired wrongly. Real stemmers, the Porter algorithm and its descendants, are much more careful than my hundred-line version, but they are still heuristic and they still occasionally overstem or understem. Stemming trades precision for recall, and the trade is usually worth it, but it is a trade, which is the third thing the naive splitter never had to think about.

The break that matters: three analyzers, one vocabulary

Here is the load-bearing claim of the entire territory, and the reason this leaf is the deepest. The analyzer does not run once. It runs in at least three places, and they must all produce identical tokens or the system is silently wrong.

It runs when the corpus statistics are computed, because IDF is a count over tokens and you cannot count tokens you have not produced. It runs when each document is indexed into the inverted index, because the postings are keyed by token. And it runs at query time, because the incoming query has to be turned into the same tokens to look anything up. Three runs of the analyzer: the statistics build, the index build, and the query path.

If any two of them disagree by even one rule, the token the query produces is not the token the index stored, so the lookup misses, or the IDF weight is fetched from the wrong bucket, and the result is a wrong score with no error anywhere. Nothing crashes. The query returns something, just the wrong something, ranked by weights computed against a vocabulary the query is not speaking. This is the worst kind of bug: invisible, plausible, and everywhere at once.

I wanted to know how much damage one missing rule actually does, so I measured it. I took the real corpus and asked: if the index was built with the full analyzer including stemming, but the query path used the analyzer without stemming, what fraction of query tokens would be looked up under the wrong key? The answer:

	tokens	share
total query tokens (A1)	31407	100%
tokens whose stemmed form differs (would misroute)	8080	25.7%

A quarter of all query terms. One stage of disagreement, stemming present on one side and absent on the other, and more than a quarter of every query is silently looked up against weights that do not belong to it. Not a quarter of rare edge cases. A quarter of all tokens, because inflected words are ordinary. The concrete picture, from the same run:

input	index token (full analyzer)	query token (no stemming)	match?
Running	run	running	no — silent miss
Refunds	refund	refunds	no
Embeddings	embedding	embeddings	no
refund’s	refund	refund	yes

Running indexed as run will never be found by a query that produces running, even though they are the same word, because one side did the stemming and the other did not. The user types a word that is in the corpus and gets nothing, and no log line records that anything went wrong. That is the failure the consistency requirement exists to prevent, and twenty-five percent is how big it is when the analyzers differ by a single stage. These are real numbers from a small script on the wiki, with a deliberately partial stemmer, so the exact percentage is corpus- and stemmer-specific, but the order of magnitude, a large fraction rather than a rounding error, is the durable lesson.

Per-field analysis, because not every field is the same

One refinement that took me a moment to absorb: the analyzer is not chosen once for the whole system. It is chosen per field by the index schema. A title field, a body field, and an exact-match identifier field can each use a different analyzer, because they want different behavior. A body field wants full stemming and stopword removal for recall. An identifier or SKU field wants a keyword analyzer that does almost nothing, because you must match the exact string and stemming an error code would be destructive. There are also analyzer families that keep n-grams, indexing adjacent word pairs as single tokens to capture phrases, with their own rules like dropping n-grams that are mostly stopwords.

The practical consequence is that “which analyzer does this field use” is a question you look up in the schema, not one you decide freely, and the agreement requirement applies per field. The statistics, the index, and the query for a given field must all use that field’s analyzer. Mixing a field’s query analysis with another field’s index analysis is the same silent corruption as before, scoped to a field.

The hard part: parity across runtimes

The consistency requirement is easy to state and easy to satisfy when everything runs in one process with one analyzer object. It becomes genuinely hard when the analyzer lives in one runtime and the statistics pipeline lives in another. The canonical analyzer in many search stacks is JVM code, the Lucene analyzer chain, because the engine that does the indexing and serving is Lucene-based. The data pipeline that computes corpus statistics is frequently written in Python, for all the usual data-engineering reasons. So the exact tokens are defined by a piece of Java, and they have to be reproduced by a piece of Python, faithfully enough that the IDF table the Python job builds is keyed by byte-identical tokens to the ones the Java engine will produce at index and query time.

That reproduction is the central design decision hiding behind the bland phrase “wire in the analyzer,” and it has a few real options, each with a cost. You can port the analyzer pipeline into the second runtime, reimplementing the tokenizer, the stemmer, the stopword list, and the folding rules in Python, which is the most fragile path because every rule is a chance to diverge, exactly the divergence the previous section measured at twenty-five percent. You can call out to the JVM from the pipeline, running the real Lucene analyzer over a bridge so the tokens are guaranteed identical, trading correctness-by-construction for the operational weight of running a JVM inside a data job. You can use a Lucene binding or an embedded analyzer library that wraps the real code. Or you can move the statistics computation into the JVM world entirely so there is only ever one analyzer. Which one is right depends on the stack, but the decision is never “just tokenize the text.” It is “guarantee that two runtimes produce the same tokens,” and the measurement is the reason that guarantee is worth paying for.

There is a verification angle here too, and it is the one I would actually build first: whatever the reproduction strategy, you want a parity test that runs a sample of real text through both analyzers and asserts the token streams are identical, so a divergence shows up as a failing test rather than as a quarter of production queries quietly returning the wrong thing. The agreement is the contract; the parity test is how you keep the contract from rotting.

The forks, gathered

How much to normalize. More stages, lowercasing through stemming, raise recall by collapsing more surface forms, at the cost of precision, because aggressive stemming merges words that are not really the same. The right amount is per field and per corpus, and it is measurable against a relevance set rather than guessable.

Stem or not, and which stemmer. Light stemmers understem and leave variants unmerged; aggressive ones overstem and merge unrelated words, like my compres mistake scaled up. The choice is a recall-precision dial, and an unstemmed analyzer is a legitimate setting for fields where exact form matters.

Where the canonical analyzer lives, and how the other runtime matches it. The deepest fork, the one above. Port, bridge, bind, or consolidate. There is no free option; there is only the one whose cost you would rather pay, chosen with the parity test as your safety net.

The thing I keep coming back to is that the math in this territory was the easy part. TF-IDF and BM25 and the IDF variants are settled. The corpus statistics are arithmetic. What is hard, and what actually breaks lexical search in production, is making the tokens agree, across three runs of the analyzer and sometimes across two runtimes, so that every weight is computed and looked up against the same vocabulary. The analyzer is where the words come from, and if the words disagree, nothing built on them can be right.

The whole correctness argument is this one picture, the three places the analyzer runs and the single vocabulary they must all produce:

txt

                          THE SAME ANALYZER, three times

   corpus ──▶ [analyzer] ──▶ IDF statistics  ┐
                                             │
   corpus ──▶ [analyzer] ──▶ inverted index  ├──▶ one shared vocabulary
                                             │      (or every weight is wrong)
   query  ──▶ [analyzer] ──▶ query tokens ───┘

   disagree by ONE rule  ──▶  ~25% of query tokens looked up under the wrong key,
                              silently, no error anywhere

The code

The analyzer is a pipeline of small string transforms; the danger is entirely in whether two copies of it apply the same transforms. Here is the staged version that produced the vocabulary-collapse and misroute numbers, deliberately partial so the stemmer’s own imperfections (the compres mistake) stay visible.

python

import re
from collections import Counter

STOPWORDS = {"the", "a", "an", "and", "of", "to", "in", "is", "for", ...}

def split_naive(text):
    return re.findall(r"[A-Za-z][A-Za-z]+", text)        # A0: keeps case, keeps everything

def fold_and_stop(text):                                 # A1: lowercase, strip 's, drop stopwords
    out = []
    for w in split_naive(text):
        w = w.lower()
        if w.endswith("'s"): w = w[:-2]
        if w in STOPWORDS:   continue
        out.append(w)
    return out

def analyze(text):                                       # A2: A1 + a light stemmer
    return [light_stem(w) for w in fold_and_stop(text)]

# the silent corruption: index uses analyze(), query path uses fold_and_stop()
query_tokens = fold_and_stop(corpus_text)
misrouted = sum(1 for w in query_tokens if light_stem(w) != w)
print(misrouted / len(query_tokens))                     # -> 0.257

The full script, the light stemmer, the vocabulary counts per stage, and the stem-group analysis that showed both stemming’s value and its mistakes, lives in the experiment workbench beside this page.

Up to the Lexical map. The structures that assume tokens already exist are the inverted index, TF-IDF, and BM25. The other half of the agreement problem, two formulas that must match, is IDF variants. The batch job that runs this analyzer to build the statistics, and the runtime-memory cost of doing it at scale, lives in the indexing pipeline.