From 16 Model Evaluation and Experiment Design

Model Evaluation and Experiment Design

Rigorous evaluation separates useful models from overfit artifacts. This article covers metrics, splitting strategies, statistical testing, calibration, conformal prediction, and experiment design for both offline evaluation and online A/B testing.

Classification Metrics

Beyond Accuracy

Accuracy is misleading when classes are imbalanced. A classifier that always predicts the majority class achieves high accuracy on a 99/1 split while being useless.

Confusion matrix decomposes predictions into four categories:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Precision: $P = \frac{\text{TP}}{\text{TP} + \text{FP}}$ — of predictions labeled positive, what fraction are correct?

Recall (Sensitivity): $R = \frac{\text{TP}}{\text{TP} + \text{FN}}$ — of actual positives, what fraction are detected?

F1 Score: $F_1 = 2 \cdot \frac{P \cdot R}{P + R} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}$ — harmonic mean of precision and recall.

The precision-recall tradeoff. Adjusting the classification threshold trades precision for recall. A lower threshold catches more positives (higher recall) but includes more false positives (lower precision). The optimal threshold depends on the relative cost of false positives vs false negatives.

ROC and PR Curves

ROC curve plots True Positive Rate vs False Positive Rate across all thresholds. AUC-ROC summarizes discrimination ability: 0.5 = random, 1.0 = perfect. AUC-ROC is threshold-independent and prevalence-independent.

Precision-Recall curve plots Precision vs Recall. AUC-PR is more informative than AUC-ROC for imbalanced datasets because it does not credit true negatives. When the positive class is rare (fraud detection, disease screening), a model can achieve high AUC-ROC by correctly classifying the abundant negatives while still performing poorly on the positive class.

Regression Metrics

Metric	Formula	Properties
MAE	$\frac{1}{N}\sum	y_i - \hat{y}_i
MSE	$\frac{1}{N}\sum(y_i - \hat{y}_i)^2$	Penalizes large errors, differentiable
RMSE	$\sqrt{\text{MSE}}$	Same units as target
MAPE	$\frac{1}{N}\sum	y_i - \hat{y}_i
$R^2$	$1 - \text{SS}_{\text{res}}/\text{SS}_{\text{tot}}$	Fraction of variance explained
Spearman $\rho$	Rank correlation	Measures ranking quality, not absolute accuracy

Choosing metrics. The metric should reflect the downstream use case. For the tracker cost model, MAE measures per-request accuracy while weekly aggregation error measures the user-facing metric. These can diverge: a model with higher MAE but unbiased symmetric errors can produce better aggregates than a lower-MAE model with systematic bias.

Cross-Validation

K-Fold Cross-Validation

Partition data into $K$ folds. For each fold $k$ :

Train on all folds except $k$
Evaluate on fold $k$
Report mean and standard deviation across folds

\text{CV}_K = \frac{1}{K}\sum_{k=1}^K \mathcal{L}(\hat{f}_{-k}, \mathcal{D}_k)

Bias-variance of CV. Leave-one-out ( $K = N$ ) has low bias but high variance (training sets overlap by $N-2$ examples). $K = 5$ or $K = 10$ provides a good tradeoff.

Stratified K-Fold

Ensures each fold preserves the class distribution of the full dataset. Essential for imbalanced classification.

Group K-Fold

Ensures all examples from the same group (user, session, domain) are in the same fold. Prevents data leakage when examples within a group are correlated. The tracker cost model uses row-level splits because domain identity is a known feature at inference, but domain-level splits would be appropriate if domain generalization were the goal.

Time Series Cross-Validation

For temporal data, only train on past data to predict future data. Expanding window: fold $k$ trains on $[1, t_k]$ and evaluates on $[t_k + 1, t_{k+1}]$ . This simulates the actual deployment scenario where the model is always predicting on data from the future relative to its training.

Statistical Significance Testing

Bootstrap Confidence Intervals

Resample the test set with replacement $B$ times, compute the metric on each resample, and take the 2.5th and 97.5th percentiles:

\text{CI}_{95\%} = [\hat{\theta}_{(0.025)}, \hat{\theta}_{(0.975)}]

This is the approach used in the tracker cost model: 1,000 bootstrap resamples produce a 95% CI of [3,314, 3,627] for the XGBoost Tweedie MAE, which does not overlap with the path LUT’s CI of [3,623, 3,984], confirming statistical significance.

Paired Tests

When comparing two models on the same test set, use a paired test to account for per-example correlation:

Paired t-test: Tests whether the mean difference in per-example loss is significantly different from zero. Assumes normally distributed differences.

Wilcoxon signed-rank test: Non-parametric alternative. Tests whether the median difference is zero. More robust to non-normal distributions.

McNemar’s test: For classification. Tests whether two models disagree on the same examples asymmetrically (i.e., one model’s errors are not a subset of the other’s).

Calibration

A model is calibrated if its predicted probabilities match empirical frequencies:

P(Y = 1 \mid \hat{p} = p) = p \quad \forall p \in [0, 1]

Reliability diagram. Bin predictions by predicted probability, plot predicted vs observed frequency in each bin. A perfectly calibrated model follows the diagonal.

Expected Calibration Error (ECE):

\text{ECE} = \sum_{b=1}^B \frac{n_b}{N} |\text{acc}(b) - \text{conf}(b)|

where $\text{acc}(b)$ is the accuracy in bin $b$ and $\text{conf}(b)$ is the mean predicted confidence.

Calibration methods:

Platt scaling: Fit a logistic regression on the logits to produce calibrated probabilities. Post-hoc, requires a held-out calibration set.
Temperature scaling: Divide logits by a learned temperature $T$ before softmax. A single scalar parameter, less prone to overfitting.
Isotonic regression: Non-parametric. Fit a monotone function from predicted to calibrated probabilities.

Conformal Prediction

Conformal prediction provides finite-sample, distribution-free prediction sets with guaranteed coverage:

P(Y_{\text{new}} \in C(\mathbf{x}_{\text{new}})) \geq 1 - \alpha

This holds regardless of the true distribution $P(X, Y)$ , requiring only exchangeability (weaker than i.i.d.).

Split Conformal Prediction

Split data into training, calibration, and test sets
Train model $\hat{f}$ on training set
Compute conformity scores on calibration set: $s_i = |y_i - \hat{f}(x_i)|$ (for regression)
Find the $(1-\alpha)$ -quantile of calibration scores: $\hat{q}$
For new inputs: $C(x) = [\hat{f}(x) - \hat{q}, \hat{f}(x) + \hat{q}]$

Coverage guarantee. If calibration and test data are exchangeable, the prediction set contains the true value with probability at least $1 - \alpha$ . No assumptions on the model quality or data distribution.

Adaptive intervals. The basic method produces constant-width intervals. Conformalized quantile regression (CQR) uses quantile regression to produce intervals that are wider where the model is uncertain and narrower where it is confident:

s_i = \max(\hat{q}_{\alpha/2}(x_i) - y_i, \; y_i - \hat{q}_{1-\alpha/2}(x_i))

The tracker cost model’s quantile regression achieves 85.6% coverage at the 80% nominal level, slightly conservative, which is the desirable direction for user-facing estimates.

A/B Testing

Design

An A/B test randomly assigns users to control (existing system) and treatment (new model) groups, measures a metric, and tests whether the difference is statistically significant.

Sample size calculation. Required sample size for detecting effect size $\delta$ with significance $\alpha$ and power $1 - \beta$ :

n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot 2\sigma^2}{\delta^2}

For a 1% lift in a metric with standard deviation 0.5 at $\alpha = 0.05$ , $\beta = 0.2$ : $n \approx 39,200$ per group.

Randomization unit. Randomize at the user level (not page view or session) to avoid within-user contamination. Use consistent hashing (hash of user ID) for deterministic assignment.

Analysis

Two-sample t-test for the difference in means between control and treatment. Report the p-value, confidence interval for the effect, and the observed lift.

Multiple testing correction. When testing multiple metrics simultaneously, apply Bonferroni ( $\alpha' = \alpha/m$ ) or Benjamini-Hochberg (FDR control). Without correction, testing 20 metrics guarantees at least one spurious significant result.

Guardrail metrics. Define metrics that must not degrade (latency, error rate, revenue). Even if the primary metric improves, reject the treatment if guardrail metrics significantly worsen.

Pitfalls

Peeking. Checking results before the predetermined sample size inflates the false positive rate. Use sequential testing (group sequential boundaries or always-valid p-values) if early stopping is needed.
Simpson’s paradox. An effect can reverse when data is aggregated across subgroups. Segment analysis by key dimensions (platform, country, user cohort).
Novelty/primacy effects. Users may behave differently simply because the experience changed. Wait for stabilization before drawing conclusions.

Offline-Online Correlation

A model that improves offline metrics does not necessarily improve online metrics. Gaps arise from:

Proxy mismatch. The offline metric (e.g., AUC) may not correlate with the business metric (e.g., revenue).
Distribution shift. Training/test data may not match production traffic patterns.
Feedback loops. The model’s predictions change user behavior, which changes the data distribution.
Serving effects. Latency, caching, and error handling differ between offline evaluation and production.

Best practice. Establish offline-online correlation empirically: run multiple A/B tests and measure the relationship between offline metric improvements and online metric improvements. This calibration is model-family-specific and must be re-established when the modeling approach changes.

Summary

Component	Key Concept
Metrics	Choose based on downstream use case; ranking vs absolute vs aggregate
Cross-validation	K-fold for i.i.d., group-fold for correlated data, time-series for temporal
Statistical testing	Bootstrap CIs, paired tests; always report confidence intervals
Calibration	Predicted probabilities should match empirical frequencies
Conformal prediction	Distribution-free coverage guarantees for prediction sets
A/B testing	Randomized control, pre-determined sample size, multiple testing correction
Offline-online gap	Establish empirical correlation; offline improvements don’t guarantee online gains

Evaluation methodology determines whether we ship useful models or overfit artifacts. The tracker cost model paper exemplifies several of these practices: bootstrap confidence intervals for significance testing, temporal holdouts for generalization assessment, and aggregation accuracy as the metric that reflects the user-facing product.