|

Evaluation and Error Analysis

The headline metric was 0.825. The honest metric was 0.734. Knowing the difference is the entire evaluation story.


The Inflated Metric

Version 1 of the main_thread_cost model trained on all 4,592 domains, including 2,407 with main_thread_cost = 0.0. Those zeros aren’t real measurements; they’re the absence of Lighthouse data. The model learned that many domains have zero CPU cost, and it predicted low for them. On the full test set, that produced a Spearman rho of 0.825.

But 340 of those test domains had fake 0.0 labels. The model was being rewarded for “correctly” predicting low scores on domains where the ground truth was made up. Strip those out and evaluate only on domains with real Lighthouse CPU data, and the honest metric was 0.734.

Version 2 fixed this by training main_thread_cost only on the 2,185 domains with real Lighthouse data. The honest metric improved to 0.751. The headline number went down (no more easy points from fake zeros), but the model got meaningfully better at the thing that actually matters: ranking domains by their real CPU cost.

VersionTraining dataFull test rhoHonest test rho (real CPU data only)
v1All 4,592 domains (including fake 0s)0.8250.734
v22,185 domains (real CPU data only)0.751

The lesson: always compute your metric on the subset where the labels are real. A headline number that includes fake labels will mislead you.

Baselines

Every metric needs context. I compared XGBoost against three baselines, all using the same no-Lighthouse feature set (the features available at inference time). All metrics computed on the honest test set: only domains with real CPU data.

Modelmain_thread_cost rhonetwork_cost rho
Mean prediction0.00.0
Transfer size only (p50_transfer_bytes percentile rank)0.2640.911
Ridge regression0.7130.992
XGBoost0.7510.999

A few things stand out.

Transfer size alone is a terrible CPU predictor. Rho of 0.264 means transfer size gets the CPU cost ranking only slightly better than random. A tiny 258-byte script (bat.bing.com) can trigger 105ms of scripting time. A 50KB image is zero CPU. Transfer size tells you almost nothing about execution cost.

Transfer size alone is an excellent network predictor. Rho of 0.911 for a single feature. Network cost is essentially a function of how many bytes you move.

Ridge regression is a strong baseline for CPU cost. Rho of 0.713 is respectable. XGBoost beats it by ~0.04, which sounds small but represents meaningfully better ranking at the top of the distribution (the expensive trackers that matter most for the UI). The gap comes from feature interactions that ridge can’t capture: “large scripts from high-script-ratio domains” is a different signal than either feature alone.

XGBoost’s advantage is real but modest. On network cost, the difference between ridge (0.992) and XGBoost (0.999) is negligible. On CPU cost, the 0.038-point gap is meaningful. If ridge had matched XGBoost, I’d ship the simpler model. It didn’t.

Error Analysis

The aggregate metrics tell you the model works. The error analysis tells you where and why it fails.

Under-predictions

The most instructive error is platform.loyaltylion.com:

ActualPredictedError
main_thread_cost0.9310.675+0.256

This domain has: 2 bytes median transfer size. 0% script request ratio. It looks like a tracking pixel from every feature the model can see. But Lighthouse measured 1,241ms of scripting time. How?

The domain is part of a redirect chain. The initial request is tiny, but it triggers cross-domain script loading that Lighthouse attributes to the originating domain. The model sees the request-level features (tiny, no scripts) and predicts low cost. The actual CPU cost is hidden behind a redirect that the request metadata doesn’t capture.

This is a fundamental feature limitation, not a model bug. No amount of hyperparameter tuning will fix it. The information the model needs (what happens after the redirect) isn’t in the feature set. Fixing this requires new features: redirect chain depth, target domain script behavior, or cross-domain attribution data.

Under-predictions like this are the real failures. The model says “this tracker is cheap” when it’s actually expensive. If Firefox shows users “we prevented 50ms of CPU cost” when the true cost was 300ms, that’s a misleading undercount.

Over-predictions

The largest over-predictions are on domains without CPU data, domains where the label is 0.0 because Lighthouse didn’t measure them, not because they’re truly zero-cost.

Consider servers3.adriver.ru: the model predicts a moderate main_thread_cost. The label is 0.0. That looks like an error. But this domain serves 80% scripts with 16KB median transfer size. It looks like a real ad script. The model might be right and the label might be wrong.

Over-predictions on no-CPU-data domains are predictions scored against bad labels. They inflate the error metrics without representing actual model failures. This is another reason the honest metric (evaluated only on real-data domains) is more trustworthy than the full-set metric.

Error asymmetry

This asymmetry is the most interesting finding from the error analysis:

Error typeWhat it meansIs it a real failure?
Under-predictionModel says “cheap,” reality says “expensive”Yes. Feature limitation or model weakness.
Over-prediction on real-data domainModel says “expensive,” reality says “less expensive”Yes. Model overestimates.
Over-prediction on no-data domainModel says “expensive,” label says 0.0Probably not. Label is fake. Model may be right.

Under-predictions are strictly worse than over-predictions for the downstream use case. If Firefox tells users it blocked an expensive tracker and the tracker was actually cheap, that’s a mild overstatement. If Firefox tells users it blocked a cheap tracker and it was actually expensive, that’s an undercount of the protection Firefox provided. The error asymmetry maps to an asymmetric cost function in the product.

Absolute Error

The mean absolute error on the honest test set (domains with real Lighthouse CPU data):

Metricmain_thread_costnetwork_cost
MAE0.0620.007
RMSE0.1820.010

For CPU cost, the model is off by about 6 percentage points on average. A domain with true main_thread_cost = 0.70 would typically be predicted somewhere in the range [0.64, 0.76]. That’s accurate enough for the downstream use: correctly bucketing domains into “high,” “moderate,” and “low” CPU cost tiers, which is what the privacy metrics card needs.

The gap between MAE (0.062) and RMSE (0.182) reveals that most predictions are close, but a few are far off. RMSE penalizes large errors quadratically, so it’s dominated by the worst cases (like platform.loyaltylion.com). The typical prediction is better than the worst-case number suggests.

For network cost, both MAE and RMSE are negligible. The model essentially reproduces the target perfectly.