Model Evaluation and Experiment Design

Rigorous evaluation separates useful models from overfit artifacts. This article covers metrics, splitting strategies, statistical testing, calibration, conformal prediction, and experiment design for both offline evaluation and online A/B testing.


Classification Metrics

Beyond Accuracy

Accuracy is misleading when classes are imbalanced. A classifier that always predicts the majority class achieves high accuracy on a 99/1 split while being useless.

Confusion matrix decomposes predictions into four categories:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Precision: P=TPTP+FPP = \frac{\text{TP}}{\text{TP} + \text{FP}} — of predictions labeled positive, what fraction are correct?

Recall (Sensitivity): R=TPTP+FNR = \frac{\text{TP}}{\text{TP} + \text{FN}} — of actual positives, what fraction are detected?

F1 Score: F1=2PRP+R=2TP2TP+FP+FNF_1 = 2 \cdot \frac{P \cdot R}{P + R} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}} — harmonic mean of precision and recall.

The precision-recall tradeoff. Adjusting the classification threshold trades precision for recall. A lower threshold catches more positives (higher recall) but includes more false positives (lower precision). The optimal threshold depends on the relative cost of false positives vs false negatives.

ROC and PR Curves

ROC curve plots True Positive Rate vs False Positive Rate across all thresholds. AUC-ROC summarizes discrimination ability: 0.5 = random, 1.0 = perfect. AUC-ROC is threshold-independent and prevalence-independent.

Precision-Recall curve plots Precision vs Recall. AUC-PR is more informative than AUC-ROC for imbalanced datasets because it does not credit true negatives. When the positive class is rare (fraud detection, disease screening), a model can achieve high AUC-ROC by correctly classifying the abundant negatives while still performing poorly on the positive class.


Regression Metrics

MetricFormulaProperties
MAE$\frac{1}{N}\sumy_i - \hat{y}_i
MSE1N(yiy^i)2\frac{1}{N}\sum(y_i - \hat{y}_i)^2Penalizes large errors, differentiable
RMSEMSE\sqrt{\text{MSE}}Same units as target
MAPE$\frac{1}{N}\sumy_i - \hat{y}_i
R2R^21SSres/SStot1 - \text{SS}_{\text{res}}/\text{SS}_{\text{tot}}Fraction of variance explained
Spearman ρ\rhoRank correlationMeasures ranking quality, not absolute accuracy

Choosing metrics. The metric should reflect the downstream use case. For the tracker cost model, MAE measures per-request accuracy while weekly aggregation error measures the user-facing metric. These can diverge: a model with higher MAE but unbiased symmetric errors can produce better aggregates than a lower-MAE model with systematic bias.


Cross-Validation

K-Fold Cross-Validation

Partition data into KK folds. For each fold kk:

  1. Train on all folds except kk
  2. Evaluate on fold kk
  3. Report mean and standard deviation across folds
CVK=1Kk=1KL(f^k,Dk)\text{CV}_K = \frac{1}{K}\sum_{k=1}^K \mathcal{L}(\hat{f}_{-k}, \mathcal{D}_k)

Bias-variance of CV. Leave-one-out (K=NK = N) has low bias but high variance (training sets overlap by N2N-2 examples). K=5K = 5 or K=10K = 10 provides a good tradeoff.

Stratified K-Fold

Ensures each fold preserves the class distribution of the full dataset. Essential for imbalanced classification.

Group K-Fold

Ensures all examples from the same group (user, session, domain) are in the same fold. Prevents data leakage when examples within a group are correlated. The tracker cost model uses row-level splits because domain identity is a known feature at inference, but domain-level splits would be appropriate if domain generalization were the goal.

Time Series Cross-Validation

For temporal data, only train on past data to predict future data. Expanding window: fold kk trains on [1,tk][1, t_k] and evaluates on [tk+1,tk+1][t_k + 1, t_{k+1}]. This simulates the actual deployment scenario where the model is always predicting on data from the future relative to its training.


Statistical Significance Testing

Bootstrap Confidence Intervals

Resample the test set with replacement BB times, compute the metric on each resample, and take the 2.5th and 97.5th percentiles:

CI95%=[θ^(0.025),θ^(0.975)]\text{CI}_{95\%} = [\hat{\theta}_{(0.025)}, \hat{\theta}_{(0.975)}]

This is the approach used in the tracker cost model: 1,000 bootstrap resamples produce a 95% CI of [3,314, 3,627] for the XGBoost Tweedie MAE, which does not overlap with the path LUT’s CI of [3,623, 3,984], confirming statistical significance.

Paired Tests

When comparing two models on the same test set, use a paired test to account for per-example correlation:

Paired t-test: Tests whether the mean difference in per-example loss is significantly different from zero. Assumes normally distributed differences.

Wilcoxon signed-rank test: Non-parametric alternative. Tests whether the median difference is zero. More robust to non-normal distributions.

McNemar’s test: For classification. Tests whether two models disagree on the same examples asymmetrically (i.e., one model’s errors are not a subset of the other’s).


Calibration

A model is calibrated if its predicted probabilities match empirical frequencies:

P(Y=1p^=p)=pp[0,1]P(Y = 1 \mid \hat{p} = p) = p \quad \forall p \in [0, 1]

Reliability diagram. Bin predictions by predicted probability, plot predicted vs observed frequency in each bin. A perfectly calibrated model follows the diagonal.

Expected Calibration Error (ECE):

ECE=b=1BnbNacc(b)conf(b)\text{ECE} = \sum_{b=1}^B \frac{n_b}{N} |\text{acc}(b) - \text{conf}(b)|

where acc(b)\text{acc}(b) is the accuracy in bin bb and conf(b)\text{conf}(b) is the mean predicted confidence.

Calibration methods:

  • Platt scaling: Fit a logistic regression on the logits to produce calibrated probabilities. Post-hoc, requires a held-out calibration set.
  • Temperature scaling: Divide logits by a learned temperature TT before softmax. A single scalar parameter, less prone to overfitting.
  • Isotonic regression: Non-parametric. Fit a monotone function from predicted to calibrated probabilities.

Conformal Prediction

Conformal prediction provides finite-sample, distribution-free prediction sets with guaranteed coverage:

P(YnewC(xnew))1αP(Y_{\text{new}} \in C(\mathbf{x}_{\text{new}})) \geq 1 - \alpha

This holds regardless of the true distribution P(X,Y)P(X, Y), requiring only exchangeability (weaker than i.i.d.).

Split Conformal Prediction

  1. Split data into training, calibration, and test sets
  2. Train model f^\hat{f} on training set
  3. Compute conformity scores on calibration set: si=yif^(xi)s_i = |y_i - \hat{f}(x_i)| (for regression)
  4. Find the (1α)(1-\alpha)-quantile of calibration scores: q^\hat{q}
  5. For new inputs: C(x)=[f^(x)q^,f^(x)+q^]C(x) = [\hat{f}(x) - \hat{q}, \hat{f}(x) + \hat{q}]

Coverage guarantee. If calibration and test data are exchangeable, the prediction set contains the true value with probability at least 1α1 - \alpha. No assumptions on the model quality or data distribution.

Adaptive intervals. The basic method produces constant-width intervals. Conformalized quantile regression (CQR) uses quantile regression to produce intervals that are wider where the model is uncertain and narrower where it is confident:

si=max(q^α/2(xi)yi,  yiq^1α/2(xi))s_i = \max(\hat{q}_{\alpha/2}(x_i) - y_i, \; y_i - \hat{q}_{1-\alpha/2}(x_i))

The tracker cost model’s quantile regression achieves 85.6% coverage at the 80% nominal level, slightly conservative, which is the desirable direction for user-facing estimates.


A/B Testing

Design

An A/B test randomly assigns users to control (existing system) and treatment (new model) groups, measures a metric, and tests whether the difference is statistically significant.

Sample size calculation. Required sample size for detecting effect size δ\delta with significance α\alpha and power 1β1 - \beta:

n=(zα/2+zβ)22σ2δ2n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot 2\sigma^2}{\delta^2}

For a 1% lift in a metric with standard deviation 0.5 at α=0.05\alpha = 0.05, β=0.2\beta = 0.2: n39,200n \approx 39,200 per group.

Randomization unit. Randomize at the user level (not page view or session) to avoid within-user contamination. Use consistent hashing (hash of user ID) for deterministic assignment.

Analysis

Two-sample t-test for the difference in means between control and treatment. Report the p-value, confidence interval for the effect, and the observed lift.

Multiple testing correction. When testing multiple metrics simultaneously, apply Bonferroni (α=α/m\alpha' = \alpha/m) or Benjamini-Hochberg (FDR control). Without correction, testing 20 metrics guarantees at least one spurious significant result.

Guardrail metrics. Define metrics that must not degrade (latency, error rate, revenue). Even if the primary metric improves, reject the treatment if guardrail metrics significantly worsen.

Pitfalls

  • Peeking. Checking results before the predetermined sample size inflates the false positive rate. Use sequential testing (group sequential boundaries or always-valid p-values) if early stopping is needed.
  • Simpson’s paradox. An effect can reverse when data is aggregated across subgroups. Segment analysis by key dimensions (platform, country, user cohort).
  • Novelty/primacy effects. Users may behave differently simply because the experience changed. Wait for stabilization before drawing conclusions.

Offline-Online Correlation

A model that improves offline metrics does not necessarily improve online metrics. Gaps arise from:

  1. Proxy mismatch. The offline metric (e.g., AUC) may not correlate with the business metric (e.g., revenue).
  2. Distribution shift. Training/test data may not match production traffic patterns.
  3. Feedback loops. The model’s predictions change user behavior, which changes the data distribution.
  4. Serving effects. Latency, caching, and error handling differ between offline evaluation and production.

Best practice. Establish offline-online correlation empirically: run multiple A/B tests and measure the relationship between offline metric improvements and online metric improvements. This calibration is model-family-specific and must be re-established when the modeling approach changes.


Summary

ComponentKey Concept
MetricsChoose based on downstream use case; ranking vs absolute vs aggregate
Cross-validationK-fold for i.i.d., group-fold for correlated data, time-series for temporal
Statistical testingBootstrap CIs, paired tests; always report confidence intervals
CalibrationPredicted probabilities should match empirical frequencies
Conformal predictionDistribution-free coverage guarantees for prediction sets
A/B testingRandomized control, pre-determined sample size, multiple testing correction
Offline-online gapEstablish empirical correlation; offline improvements don’t guarantee online gains

Evaluation methodology determines whether we ship useful models or overfit artifacts. The tracker cost model paper exemplifies several of these practices: bootstrap confidence intervals for significance testing, temporal holdouts for generalization assessment, and aggregation accuracy as the metric that reflects the user-facing product.