From Lexical Retrieval

WAND and Skip-Based Retrieval

For most of this territory I thought of IDF as a quality signal. It tells the ranker which matched words discriminate. Then I learned it is also what lets a retriever do less work, and that reframed why getting the weights right matters operationally and not just for relevance. This leaf is about that second role: the algorithms that use IDF-derived score bounds to skip documents they can prove cannot make the cut. WAND is the canonical one, and what convinced me it was real was watching it skip nearly half the candidates on a query while returning the identical top-k.

The setup is the part I want to be honest about first, because the naive picture of retrieval is wasteful in a way that is easy to miss.

The waste in scoring everything

A query word’s postings list is every document containing that word. The straightforward way to find the top-k results is to gather every document that contains any query term, score each one with BM25, sort, and return the top k. It is correct and it is what I would write first. The waste is that most of those documents have no chance of finishing in the top k, and you score them anyway.

Picture the query the gradient. the matches almost the entire corpus; gradient matches a handful. The straightforward approach scores every document that contains the, which is nearly everything, even though a document matching only the and not gradient scores almost nothing, because the has tiny IDF. You spend nearly all your scoring effort on documents that were never going to rank, and on a real corpus, where the common term’s postings list can be millions long, that effort is the dominant cost of the query. The realization behind WAND is that you already have enough information to skip those documents without scoring them, and the information is the IDF.

The bound, and what makes it sound

The key quantity is an upper bound on how much a single term can ever contribute to any document’s score. In BM25 a term’s contribution is its IDF times a saturating term-frequency factor, and that factor has a ceiling: no matter how many times the word appears, the saturation caps it. So the most a term t can add to any document is bounded, and the bound is proportional to idf(t). A rare term has a high bound, a common term a low one. You compute one such bound per query term, up front, before scoring anything.

Now the bounds become a proof. Suppose you are partway through a query, you have a running top-k, and the kth-best score so far is the threshold every remaining candidate must beat to enter the list. For any document still unscored, the largest score it could possibly achieve is the sum of the upper bounds of the query terms it contains. If that sum is below the threshold, the document cannot reach the top-k, no matter what its actual term frequencies turn out to be, so you can skip scoring it entirely and lose nothing. The bound is what makes the skip sound: you are not guessing the document is irrelevant, you are proving its best case loses.

This is why WAND is exact and not approximate. It returns the same top-k as scoring everything. It just refuses to evaluate documents whose best possible score, by the IDF bounds, falls short of the current threshold. The name is from the original paper by Broder and colleagues at IBM in the early 2000s, where WAND stands for Weak AND or Weighted AND, an operator that sits between a strict AND (every term must match) and a pure OR (any term) by using the score bounds to decide which documents are worth looking at.

The pivot, briefly

The mechanism that turns the bound into skipping is the pivot. Keep each query term’s postings list sorted by document id with a pointer into it. Sort the terms by the document their pointer currently sits on. Walk down them accumulating upper bounds, and the first term at which the accumulated bound exceeds the threshold is the pivot: the earliest document that could possibly clear the bar is the one the pivot term’s pointer is on. Every document before the pivot document can be skipped, because the terms whose postings start earlier do not, even combined, have enough bound to matter. So you advance the trailing pointers straight to the pivot document, skipping the runs in between, and only fully score documents at or beyond the pivot. The skip pointers in the postings lists are what make jumping ahead cheap. As the top-k fills and the threshold rises, the pivot moves later and later, and more gets skipped, so WAND naturally does less work the deeper into the query it gets.

Watching it skip

I wanted a number, not just the argument, so I built a small WAND alongside an exhaustive scorer over the real corpus and counted how many full BM25 evaluations each did for the same top-k. Every row returned an identical top-k, so the optimization is exact; the interesting column is how much it skipped.

query	exhaustive	WAND	skipped	top-k same
the gradient	32	17	15 (47%)	yes
loss function regularization	24	14	10 (42%)	yes
gradient descent	15	11	4 (27%)	yes
transformer embedding attention	10	8	2 (20%)	yes
naive bayes	5	5	0 (0%)	yes
the of is in	32	29	3 (9%)	yes

The headline is the gradient, which skips forty-seven percent of its candidates. That is exactly the case the algorithm was built for: a query mixing a common term with a rare one. Once the top-k fills with documents that actually contain gradient, the threshold is high, and the enormous run of documents that contain only the cannot clear it, because the’s upper bound is tiny. WAND proves that and skips them. The savings are real and they come straight from the IDF gap between the two terms.

The two rows that skip almost nothing are the more instructive ones, because they show where the optimization has nothing to work with. naive bayes skips zero percent: both terms are rare, their postings lists are short, and nearly every candidate genuinely contains both and is a real contender, so there is nothing to prune. And the of is in, all common terms, skips only nine percent, which surprised me until I thought it through. When every term has a similarly low upper bound, no term dominates, the accumulated bounds creep up slowly, and the threshold never gets high enough relative to the bounds to rule much out. WAND saves the most precisely when the query has IDF skew, a rare term whose high bound sets a bar the common-only documents cannot reach. A query that is all rare or all common gives it little leverage.

That is the connection I came for. WAND’s speedup is a function of how spread out the IDF values are across the query’s terms, which means accurate IDF does double duty: it makes the ranking better and it makes the bounds tight enough to skip aggressively. A wrong IDF, from a formula mismatch or a tokenization disagreement, does not just rank wrong, it loosens or breaks the bounds, and a negative IDF from the wrong variant can break the upper-bound logic outright. The same accuracy that the relevance side needs, the speed side needs too. These are small numbers on a tiny corpus and the absolute counts mean nothing at scale, but the pattern, big skips on IDF-skewed queries and near-zero on flat ones, is the durable behavior.

The forks

Block-Max WAND. Plain WAND uses one global upper bound per term, the most that term can contribute to any document. A document can be pruned only if that global bound is too low. Block-Max WAND, the standard improvement, stores a separate maximum per block of the postings list, so the bound used for skipping is the local block maximum rather than the global one. Local maxima are usually much tighter than the global one, so the bounds are smaller, the pivot moves faster, and more documents get skipped. The cost is storing the per-block maxima in the index and a more complex traversal. At scale this is generally worth it; the global-bound version is the one to understand first.

How much to retrieve before reranking. WAND produces a fast top-k whose ordering is rough, because BM25 is a cheap scorer. The standard pattern is to retrieve a generous candidate set quickly and then rerank the top of it with a slower, more precise model. WAND is what makes the first stage cheap enough that you can afford the second. How large a k to pull is a recall-versus- cost fork: too small and the reranker never sees the right document, too large and you pay for scoring and reranking documents that do not matter.

When skipping is not worth it. On short postings lists, the bookkeeping of bounds and pivots can cost more than it saves, which is exactly the naive bayes row: zero skips, pure overhead. Some systems fall back to plain scoring when the candidate set is small. The optimization earns its keep on the long, common postings lists, which is also where the waste it removes is largest.

The thing I will keep from this leaf is that the rarity signal I spent the rest of the territory getting right is not only how the engine decides what is relevant. It is also how the engine decides what it is allowed to ignore. Good IDF is both the quality of the ranking and the speed of producing it, which is a much better reason to care about the weights than “the formula says so.”

The code

The idea is small: per term, the most it can ever add to a document is bounded, and once the top-k threshold rises above the sum of a document’s reachable term bounds, you skip it. The bound and the skip are the whole trick.

python

def upper_bound(term):
    # most this term can contribute to ANY document: its idf times the saturated tf ceiling.
    # (bounded here by the max tf actually seen in its postings)
    return max(bm25(tf, doc_len[d], term) for d, tf in postings[term])

# WAND, sketched: walk docs in id order, keep a running top-k and its threshold.
# for each candidate, sum the upper bounds of the query terms it could match.
# if that best-possible score < threshold, it CANNOT enter the top-k -> skip it,
# without ever scoring it fully.
for doc in candidates_in_id_order:
    best_possible = sum(upper_bound(t) for t in query if t in doc)
    if best_possible < threshold:
        continue                       # the skip: proven it can't make the cut
    s = full_bm25_score(doc, query)    # only now do the expensive work
    push_to_topk(doc, s)
    threshold = kth_best_score()       # rises as the top-k fills -> more skipping later

A negative IDF from the wrong variant breaks upper_bound, which is why the non-negative form matters here specifically. The full implementation, the pivot logic, and the skip-rate measurement that produced the figure are in the experiment workbench beside this page.

Up to the Lexical map. The bound comes from BM25 and IDF; the non-negativity it needs is guarded by the right IDF variant; the skip pointers it rides are in the inverted index. The fast candidate set it produces feeds reranking.