Relevance Evaluation
Every dial in this system, BM25’s parameters, the ANN recall trade, the RRF weighting, how deep to rerank, and the migration gate itself, gets set by measurement rather than preference. That measurement is this leaf. You cannot tune what you cannot score, and you cannot say a change did not regress relevance unless “relevance” is a number. The thing I had to get clear on is that there is no single such number, and the choice of which one you use determines what changes you are even able to detect.
Graded judgments: the ground truth
A relevance metric needs something to measure against, and that something is a set of judgments: for a collection of queries, which documents are relevant and how relevant. The simplest form is binary, relevant or not, but the more useful form is graded, where each query-document pair gets a gain on a small scale, say zero for irrelevant up to three for a perfect answer. These judgments come from human raters or from logged user behavior, and they are the ground truth the metrics compare a ranking against. They are also expensive to produce and never complete, which is a real limitation worth holding: a metric is only as good as its judgments, and a document no rater judged looks irrelevant to the metric even if it is perfect.
With judgments in hand, a metric is a function that takes a ranking and the judgments and returns a score saying how good that ordering is. Two metrics dominate, and they see different things.
MRR: where is the first good answer
Mean reciprocal rank is the simplest useful metric. For one query, find the rank of the first relevant
result and take its reciprocal: a relevant document at rank one scores 1, at rank two scores 1/2, at
rank three 1/3, and so on. Average over all queries and you have MRR. It answers one question: how high
up is the first relevant result, on average.
MRR is the right metric when the task is “find me one good answer fast,” like a question with a single correct response. Its blindness is everything after the first relevant result. Once a relevant document is at rank one, MRR is satisfied and does not care what is at ranks two through ten, so it cannot see whether the rest of the ordering is good or garbage. That blindness is not a flaw so much as a scope, but it is a scope you have to know about, because a change that improves the whole ordering while leaving the first hit in place is completely invisible to MRR.
NDCG: how good is the whole ordering
Normalized discounted cumulative gain measures the entire ranking, weighted by how relevant each result is and how high it sits. The construction, piece by piece. Cumulative gain sums the relevance grades of the results. Discounted cumulative gain divides each result’s grade by a log of its position, so a relevant document counts for more near the top and less further down, which matches how users actually read a results list. Normalized DCG divides the ranking’s DCG by the DCG of the ideal ordering, the one that puts the most relevant documents first, so the score lands between zero and one regardless of how many relevant documents a query has, which makes it comparable across queries.
NDCG sees what MRR cannot: the quality of the whole ordering, with partial credit for getting the second-best and third-best documents into good positions. That extra sensitivity is exactly why it is the more common gate for ranking changes.
The metric is part of the gate
I built both metrics on a small graded judgment set with two systems, an old engine and a new one, to see how they would behave on the same change, and the result made the point better than any argument:
| metric | old engine | new engine | delta |
|---|---|---|---|
| mean NDCG@4 | 0.878 | 0.974 | +0.096 |
| MRR | 1.000 | 1.000 | +0.000 |
MRR said the two systems were identical. It was not wrong, both engines put a relevant document at rank one for every query, so the first-relevant-result metric genuinely could not distinguish them. NDCG said the new engine was clearly better, because it ordered the second and third relevant documents better, which NDCG weighs and MRR ignores. NDCG even surfaced one query where the new engine slightly regressed, dropping from a perfect ordering to a near-perfect one, a per-query regression MRR was structurally blind to.
The lesson is that “did this change regress relevance” has no answer until you pick a metric, and the metric you pick decides which regressions you can see. Gate a ranking change on MRR and you will ship ordering regressions it cannot detect; gate it on NDCG and you see them but you also need good graded judgments for NDCG to be meaningful. The choice of metric is not a measurement detail downstream of the gate, it is the gate, and choosing it wrong means the gate is open to exactly the failures the metric is blind to. For ranking quality NDCG is usually the right default, with MRR as a secondary view for the find-one-answer case, but the real discipline is knowing what your chosen metric does not see.
Offline metrics and online A/B
These are offline metrics: computed against fixed judgments, fast, repeatable, and what you use to tune and to gate a change before it touches a user. Their limit is the judgments, which are incomplete and may not reflect what users actually want. So offline metrics are the necessary first gate, not the final word.
The final word is online: an A/B test that serves the old ranking to some users and the new ranking to others and measures real behavior, clicks, dwell time, reformulations, task success, on live traffic. The offline metric says “this change is safe to try on users”; the A/B test says “this change actually helped users.” The two are complementary, and the right shape is to use offline metrics to filter changes down to the ones worth an A/B test, because A/B tests are slow and expensive in user exposure, and offline metrics are cheap. A change should clear the offline gate first, then prove itself online. A migration leans on both: offline NDCG and latency to decide the new engine is ready to take any traffic at all, then a ramped A/B against the old engine to confirm it on real users before committing.
The thread is that relevance is not self-evident and not free. It is a number you construct from judgments, with a metric you chose for what it can see, validated against real behavior you measure carefully. Everything else in this system that calls itself “tuned” or “not regressed” is resting on this leaf being done honestly.
The code
Both metrics are short. The whole reason NDCG saw the change and MRR did not lives in the difference between “where is the first relevant result” and “how good is the whole ordering”:
import math
# judged relevance: query -> {doc: gain} (0 irrelevant ... 3 perfect)
judgments = {"refund policy": {"d_refund": 3, "d_returns": 2, "d_billing": 1}, ...}
def reciprocal_rank(ranking, rel): # MRR: 1/rank of the FIRST relevant doc
for i, d in enumerate(ranking, start=1):
if rel.get(d, 0) > 0:
return 1.0 / i
return 0.0
def dcg(ranking, rel, k): # discounted cumulative gain over top-k
return sum(rel.get(d, 0) / math.log2(i + 1) for i, d in enumerate(ranking[:k], start=1))
def ndcg(ranking, rel, k): # DCG normalized by the ideal ordering's DCG
ideal = sorted(rel.values(), reverse=True)
idcg = sum(g / math.log2(i + 1) for i, g in enumerate(ideal[:k], start=1))
return dcg(ranking, rel, k) / idcg if idcg else 0.0MRR stops at the first relevant hit, so two systems that both rank a relevant doc first look identical
to it. NDCG’s log2(i+1) discount weights every position, so it sees the second- and third-best docs
move. The full script with the judgment set and both engines’ rankings is in the experiment workbench
beside this page.
Up to serving at scale. The dials this metric tunes are throughout the system: BM25, ANN recall, RRF weighting, rerank depth. The change it most directly gates is a migration.