Evaluation and Error Analysis
The headline metric was 0.825. The honest metric was 0.734. Knowing the difference is the entire evaluation story.
The Inflated Metric
Version 1 of the main_thread_cost model trained on all 4,592 domains, including 2,407 with main_thread_cost = 0.0. Those zeros aren’t real measurements; they’re the absence of Lighthouse data. The model learned that many domains have zero CPU cost, and it predicted low for them. On the full test set, that produced a Spearman rho of 0.825.
But 340 of those test domains had fake 0.0 labels. The model was being rewarded for “correctly” predicting low scores on domains where the ground truth was made up. Strip those out and evaluate only on domains with real Lighthouse CPU data, and the honest metric was 0.734.
Version 2 fixed this by training main_thread_cost only on the 2,185 domains with real Lighthouse data. The honest metric improved to 0.751. The headline number went down (no more easy points from fake zeros), but the model got meaningfully better at the thing that actually matters: ranking domains by their real CPU cost.
| Version | Training data | Full test rho | Honest test rho (real CPU data only) |
|---|---|---|---|
| v1 | All 4,592 domains (including fake 0s) | 0.825 | 0.734 |
| v2 | 2,185 domains (real CPU data only) | — | 0.751 |
The lesson: always compute your metric on the subset where the labels are real. A headline number that includes fake labels will mislead you.
Baselines
Every metric needs context. I compared XGBoost against three baselines, all using the same no-Lighthouse feature set (the features available at inference time). All metrics computed on the honest test set: only domains with real CPU data.
| Model | main_thread_cost rho | network_cost rho |
|---|---|---|
| Mean prediction | 0.0 | 0.0 |
Transfer size only (p50_transfer_bytes percentile rank) | 0.264 | 0.911 |
| Ridge regression | 0.713 | 0.992 |
| XGBoost | 0.751 | 0.999 |
A few things stand out.
Transfer size alone is a terrible CPU predictor. Rho of 0.264 means transfer size gets the CPU cost ranking only slightly better than random. A tiny 258-byte script (bat.bing.com) can trigger 105ms of scripting time. A 50KB image is zero CPU. Transfer size tells you almost nothing about execution cost.
Transfer size alone is an excellent network predictor. Rho of 0.911 for a single feature. Network cost is essentially a function of how many bytes you move.
Ridge regression is a strong baseline for CPU cost. Rho of 0.713 is respectable. XGBoost beats it by ~0.04, which sounds small but represents meaningfully better ranking at the top of the distribution (the expensive trackers that matter most for the UI). The gap comes from feature interactions that ridge can’t capture: “large scripts from high-script-ratio domains” is a different signal than either feature alone.
XGBoost’s advantage is real but modest. On network cost, the difference between ridge (0.992) and XGBoost (0.999) is negligible. On CPU cost, the 0.038-point gap is meaningful. If ridge had matched XGBoost, I’d ship the simpler model. It didn’t.
Error Analysis
The aggregate metrics tell you the model works. The error analysis tells you where and why it fails.
Under-predictions
The most instructive error is platform.loyaltylion.com:
| Actual | Predicted | Error | |
|---|---|---|---|
| main_thread_cost | 0.931 | 0.675 | +0.256 |
This domain has: 2 bytes median transfer size. 0% script request ratio. It looks like a tracking pixel from every feature the model can see. But Lighthouse measured 1,241ms of scripting time. How?
The domain is part of a redirect chain. The initial request is tiny, but it triggers cross-domain script loading that Lighthouse attributes to the originating domain. The model sees the request-level features (tiny, no scripts) and predicts low cost. The actual CPU cost is hidden behind a redirect that the request metadata doesn’t capture.
This is a fundamental feature limitation, not a model bug. No amount of hyperparameter tuning will fix it. The information the model needs (what happens after the redirect) isn’t in the feature set. Fixing this requires new features: redirect chain depth, target domain script behavior, or cross-domain attribution data.
Under-predictions like this are the real failures. The model says “this tracker is cheap” when it’s actually expensive. If Firefox shows users “we prevented 50ms of CPU cost” when the true cost was 300ms, that’s a misleading undercount.
Over-predictions
The largest over-predictions are on domains without CPU data, domains where the label is 0.0 because Lighthouse didn’t measure them, not because they’re truly zero-cost.
Consider servers3.adriver.ru: the model predicts a moderate main_thread_cost. The label is 0.0. That looks like an error. But this domain serves 80% scripts with 16KB median transfer size. It looks like a real ad script. The model might be right and the label might be wrong.
Over-predictions on no-CPU-data domains are predictions scored against bad labels. They inflate the error metrics without representing actual model failures. This is another reason the honest metric (evaluated only on real-data domains) is more trustworthy than the full-set metric.
Error asymmetry
This asymmetry is the most interesting finding from the error analysis:
| Error type | What it means | Is it a real failure? |
|---|---|---|
| Under-prediction | Model says “cheap,” reality says “expensive” | Yes. Feature limitation or model weakness. |
| Over-prediction on real-data domain | Model says “expensive,” reality says “less expensive” | Yes. Model overestimates. |
| Over-prediction on no-data domain | Model says “expensive,” label says 0.0 | Probably not. Label is fake. Model may be right. |
Under-predictions are strictly worse than over-predictions for the downstream use case. If Firefox tells users it blocked an expensive tracker and the tracker was actually cheap, that’s a mild overstatement. If Firefox tells users it blocked a cheap tracker and it was actually expensive, that’s an undercount of the protection Firefox provided. The error asymmetry maps to an asymmetric cost function in the product.
Absolute Error
The mean absolute error on the honest test set (domains with real Lighthouse CPU data):
| Metric | main_thread_cost | network_cost |
|---|---|---|
| MAE | 0.062 | 0.007 |
| RMSE | 0.182 | 0.010 |
For CPU cost, the model is off by about 6 percentage points on average. A domain with true main_thread_cost = 0.70 would typically be predicted somewhere in the range [0.64, 0.76]. That’s accurate enough for the downstream use: correctly bucketing domains into “high,” “moderate,” and “low” CPU cost tiers, which is what the privacy metrics card needs.
The gap between MAE (0.062) and RMSE (0.182) reveals that most predictions are close, but a few are far off. RMSE penalizes large errors quadratically, so it’s dominated by the worst cases (like platform.loyaltylion.com). The typical prediction is better than the worst-case number suggests.
For network cost, both MAE and RMSE are negligible. The model essentially reproduces the target perfectly.