Model Evaluation and Experiment Design
Rigorous evaluation separates useful models from overfit artifacts. This article covers metrics, splitting strategies, statistical testing, calibration, conformal prediction, and experiment design for both offline evaluation and online A/B testing.
Classification Metrics
Beyond Accuracy
Accuracy is misleading when classes are imbalanced. A classifier that always predicts the majority class achieves high accuracy on a 99/1 split while being useless.
Confusion matrix decomposes predictions into four categories:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Precision: — of predictions labeled positive, what fraction are correct?
Recall (Sensitivity): — of actual positives, what fraction are detected?
F1 Score: — harmonic mean of precision and recall.
The precision-recall tradeoff. Adjusting the classification threshold trades precision for recall. A lower threshold catches more positives (higher recall) but includes more false positives (lower precision). The optimal threshold depends on the relative cost of false positives vs false negatives.
ROC and PR Curves
ROC curve plots True Positive Rate vs False Positive Rate across all thresholds. AUC-ROC summarizes discrimination ability: 0.5 = random, 1.0 = perfect. AUC-ROC is threshold-independent and prevalence-independent.
Precision-Recall curve plots Precision vs Recall. AUC-PR is more informative than AUC-ROC for imbalanced datasets because it does not credit true negatives. When the positive class is rare (fraud detection, disease screening), a model can achieve high AUC-ROC by correctly classifying the abundant negatives while still performing poorly on the positive class.
Regression Metrics
| Metric | Formula | Properties |
|---|---|---|
| MAE | $\frac{1}{N}\sum | y_i - \hat{y}_i |
| MSE | Penalizes large errors, differentiable | |
| RMSE | Same units as target | |
| MAPE | $\frac{1}{N}\sum | y_i - \hat{y}_i |
| Fraction of variance explained | ||
| Spearman | Rank correlation | Measures ranking quality, not absolute accuracy |
Choosing metrics. The metric should reflect the downstream use case. For the tracker cost model, MAE measures per-request accuracy while weekly aggregation error measures the user-facing metric. These can diverge: a model with higher MAE but unbiased symmetric errors can produce better aggregates than a lower-MAE model with systematic bias.
Cross-Validation
K-Fold Cross-Validation
Partition data into folds. For each fold :
- Train on all folds except
- Evaluate on fold
- Report mean and standard deviation across folds
Bias-variance of CV. Leave-one-out () has low bias but high variance (training sets overlap by examples). or provides a good tradeoff.
Stratified K-Fold
Ensures each fold preserves the class distribution of the full dataset. Essential for imbalanced classification.
Group K-Fold
Ensures all examples from the same group (user, session, domain) are in the same fold. Prevents data leakage when examples within a group are correlated. The tracker cost model uses row-level splits because domain identity is a known feature at inference, but domain-level splits would be appropriate if domain generalization were the goal.
Time Series Cross-Validation
For temporal data, only train on past data to predict future data. Expanding window: fold trains on and evaluates on . This simulates the actual deployment scenario where the model is always predicting on data from the future relative to its training.
Statistical Significance Testing
Bootstrap Confidence Intervals
Resample the test set with replacement times, compute the metric on each resample, and take the 2.5th and 97.5th percentiles:
This is the approach used in the tracker cost model: 1,000 bootstrap resamples produce a 95% CI of [3,314, 3,627] for the XGBoost Tweedie MAE, which does not overlap with the path LUT’s CI of [3,623, 3,984], confirming statistical significance.
Paired Tests
When comparing two models on the same test set, use a paired test to account for per-example correlation:
Paired t-test: Tests whether the mean difference in per-example loss is significantly different from zero. Assumes normally distributed differences.
Wilcoxon signed-rank test: Non-parametric alternative. Tests whether the median difference is zero. More robust to non-normal distributions.
McNemar’s test: For classification. Tests whether two models disagree on the same examples asymmetrically (i.e., one model’s errors are not a subset of the other’s).
Calibration
A model is calibrated if its predicted probabilities match empirical frequencies:
Reliability diagram. Bin predictions by predicted probability, plot predicted vs observed frequency in each bin. A perfectly calibrated model follows the diagonal.
Expected Calibration Error (ECE):
where is the accuracy in bin and is the mean predicted confidence.
Calibration methods:
- Platt scaling: Fit a logistic regression on the logits to produce calibrated probabilities. Post-hoc, requires a held-out calibration set.
- Temperature scaling: Divide logits by a learned temperature before softmax. A single scalar parameter, less prone to overfitting.
- Isotonic regression: Non-parametric. Fit a monotone function from predicted to calibrated probabilities.
Conformal Prediction
Conformal prediction provides finite-sample, distribution-free prediction sets with guaranteed coverage:
This holds regardless of the true distribution , requiring only exchangeability (weaker than i.i.d.).
Split Conformal Prediction
- Split data into training, calibration, and test sets
- Train model on training set
- Compute conformity scores on calibration set: (for regression)
- Find the -quantile of calibration scores:
- For new inputs:
Coverage guarantee. If calibration and test data are exchangeable, the prediction set contains the true value with probability at least . No assumptions on the model quality or data distribution.
Adaptive intervals. The basic method produces constant-width intervals. Conformalized quantile regression (CQR) uses quantile regression to produce intervals that are wider where the model is uncertain and narrower where it is confident:
The tracker cost model’s quantile regression achieves 85.6% coverage at the 80% nominal level, slightly conservative, which is the desirable direction for user-facing estimates.
A/B Testing
Design
An A/B test randomly assigns users to control (existing system) and treatment (new model) groups, measures a metric, and tests whether the difference is statistically significant.
Sample size calculation. Required sample size for detecting effect size with significance and power :
For a 1% lift in a metric with standard deviation 0.5 at , : per group.
Randomization unit. Randomize at the user level (not page view or session) to avoid within-user contamination. Use consistent hashing (hash of user ID) for deterministic assignment.
Analysis
Two-sample t-test for the difference in means between control and treatment. Report the p-value, confidence interval for the effect, and the observed lift.
Multiple testing correction. When testing multiple metrics simultaneously, apply Bonferroni () or Benjamini-Hochberg (FDR control). Without correction, testing 20 metrics guarantees at least one spurious significant result.
Guardrail metrics. Define metrics that must not degrade (latency, error rate, revenue). Even if the primary metric improves, reject the treatment if guardrail metrics significantly worsen.
Pitfalls
- Peeking. Checking results before the predetermined sample size inflates the false positive rate. Use sequential testing (group sequential boundaries or always-valid p-values) if early stopping is needed.
- Simpson’s paradox. An effect can reverse when data is aggregated across subgroups. Segment analysis by key dimensions (platform, country, user cohort).
- Novelty/primacy effects. Users may behave differently simply because the experience changed. Wait for stabilization before drawing conclusions.
Offline-Online Correlation
A model that improves offline metrics does not necessarily improve online metrics. Gaps arise from:
- Proxy mismatch. The offline metric (e.g., AUC) may not correlate with the business metric (e.g., revenue).
- Distribution shift. Training/test data may not match production traffic patterns.
- Feedback loops. The model’s predictions change user behavior, which changes the data distribution.
- Serving effects. Latency, caching, and error handling differ between offline evaluation and production.
Best practice. Establish offline-online correlation empirically: run multiple A/B tests and measure the relationship between offline metric improvements and online metric improvements. This calibration is model-family-specific and must be re-established when the modeling approach changes.
Summary
| Component | Key Concept |
|---|---|
| Metrics | Choose based on downstream use case; ranking vs absolute vs aggregate |
| Cross-validation | K-fold for i.i.d., group-fold for correlated data, time-series for temporal |
| Statistical testing | Bootstrap CIs, paired tests; always report confidence intervals |
| Calibration | Predicted probabilities should match empirical frequencies |
| Conformal prediction | Distribution-free coverage guarantees for prediction sets |
| A/B testing | Randomized control, pre-determined sample size, multiple testing correction |
| Offline-online gap | Establish empirical correlation; offline improvements don’t guarantee online gains |
Evaluation methodology determines whether we ship useful models or overfit artifacts. The tracker cost model paper exemplifies several of these practices: bootstrap confidence intervals for significance testing, temporal holdouts for generalization assessment, and aggregation accuracy as the metric that reflects the user-facing product.