Serving at Scale and Migration: Index

First created Jun 7, 2026 Last edited Jun 7, 2026

Every other territory was about producing a good ranking. This one is about keeping that ranking answering live queries without going down, and about the hardest version of that problem, which is changing the engine underneath while it runs. After working through the retrieval and pipeline pieces I found this territory has a different texture: the questions are less about algorithms and more about operational nerve, how you make a change to a system real users depend on and stay able to undo it. The two things that make that possible are a way to measure whether a change helped or hurt, and a way to reverse a change fast when it hurt.

Two backends, one search

The setup that makes this concrete is migration. A search system that has been running for years is often served by one engine, a self-hosted lexical engine in the Lucene family, and the plan is to move it to a newer backend, a hosted or distributed search platform. For the duration of the move, both backends exist and both can serve “search,” and the first discipline is simply to not conflate them. They have different performance characteristics, different operational behaviors, and a given index or artifact targets one of them. Knowing which engine a given query, index, or IDF artifact is bound to is the baseline for not making a mess during the overlap.

The reason migrations are hard is not the new engine. It is that the old engine is working, users depend on its results being at least as good as they are today, and the new engine has to match or beat that bar before it can take traffic. The whole game is moving from old to new without users experiencing a regression, which means you need to be able to say, precisely, whether the new engine regresses. That is the first leaf.

Making “don’t regress” a number

“Don’t regress relevance” is a slogan until it is a metric, and the central move of a safe migration is turning it into one. You need a labeled set of queries with judged-relevant documents, and a metric that scores a ranking against those judgments, so that “is the new engine as good as the old one” becomes “is the new engine’s mean score at least the old one’s, minus a tolerance.”

I built the standard metrics on a small judged set to see how they behave, and the instructive part was that the metric you choose decides what you can even see. Mean reciprocal rank, which only cares about where the first relevant result lands, said the old and new engines were identical, because both put a relevant document at rank one for every query. NDCG, which scores the whole ordering weighted by graded relevance, showed the new engine clearly better, because it ordered the second and third relevant documents better, and it even surfaced one query where the new engine slightly regressed, a regression MRR was blind to. Same two systems, two metrics, two different verdicts about whether anything changed.

txt

mean NDCG@4:  old=0.878  new=0.974  delta=+0.096
MRR:          old=1.000  new=1.000  delta=+0.000

The lesson is that the metric is the migration gate, and the choice of metric is part of the gate. Without a relevance metric, “is the new engine good enough” is answered on vibes, and you are flipping traffic blind. With one, the regression bar is explicit and checkable before any user is affected. How these metrics work, what each one can and cannot see, and how offline judgments relate to online A/B tests is the leaf relevance evaluation.

Reversing a change fast

The second discipline is that no matter how carefully you measured, you have to assume a change can still go wrong in production, and you build the ability to undo it before you need it. A bad index gets promoted, a new engine misbehaves under real load, a downstream dependency fails. The operational safety valves, the killswitch that pulls a bad change out of the traffic path and the failover that routes around a broken backend, are what let those failures be recoverable instead of outages.

This connects directly back to the staging-to-serving handoff: because builds are immutable and promotion is an atomic pointer flip, rolling back a bad index is flipping the pointer to the previous build, not rebuilding anything. The same property that makes the handoff safe makes the rollback fast. And during a migration, the ability to send traffic to the old engine or the new one, and to flip that allocation quickly, is the migration-scale version of the same valve: you ramp traffic onto the new engine gradually, watch the relevance and latency metrics, and if they go bad you flip back to the old engine that is still running. How traffic gets ramped and flipped, and what the killswitch and failover paths actually guard against, is the leaf migration.

Latency is the other regression

One thing worth holding alongside relevance: a migration can preserve relevance and still regress on latency, and users feel latency directly. The new engine has to match the old one on speed under real load, not just on ranking quality, so the migration gate is really two metrics, relevance and latency, and a win on one does not excuse a loss on the other. This is where the WAND-style skipping and the ANN recall-cost dial stop being abstractions and become the levers you actually pull to keep the new engine fast enough to ship.

The leaves

Relevance evaluation

Turning ranking quality into a number. Graded judgments, NDCG and MRR and what each can and cannot see, why the metric choice is part of the gate, and how offline metrics relate to online A/B testing.

Migration

Moving from one backend to another without users noticing. Two engines coexisting, gradual traffic ramp with metrics as the guardrail, and the killswitch and failover paths that make a bad cutover reversible.

Up to the whole map. The build it promotes and rolls back comes from the indexing pipeline. The relevance it must not regress is produced by the lexical, semantic, hybrid, and reranking stages it serves.

Index

Relevance Evaluation. Turning ranking quality into a number you can check. Graded relevance judgments, MRR and NDCG and exactly what each one can and cannot see, why the choice of metric is itself part of the regression gate, and how offline metrics relate to online A/B tests.
Migration. Moving a live search system from one backend to another without users noticing. Two engines coexisting, why you ramp traffic gradually instead of cutting over, the relevance and latency metrics that act as guardrails on the ramp, and the killswitch and failover paths that make a bad cutover reversible rather than an outage.