Performance Cost Estimation for Tracker Domains via Multi-Target Regression and Uncertainty Quantification
| Document | Internal Research |
| Team | Privacy Engineering, Mozilla Firefox |
| Author | James Han (Privacy Engineering Intern) |
| Advisors | Manuel Bucher (Software Engineer, Privacy) |
| Tim Huang (Sr. Staff Engineer, Privacy) | |
| Period | January to March 2026 |
| arXiv | IPR |
| Status | In Progress |
Abstract
Firefox’s Enhanced Tracking Protection blocks tracker domains using the Disconnect list, but the privacy dashboard treats every blocked tracker equally. A 258-byte tracking pixel gets the same weight as a 170KB JavaScript SDK that burns 305ms of CPU. This paper develops a regression pipeline that scores each tracker domain on two independent axes: main-thread CPU cost and network bandwidth cost.
I constructed ground-truth labels from 348 million HTTP requests and 16 million Lighthouse audits in the June 2024 HTTP Archive crawl, covering 4,592 tracker domains. A planned third target, render-blocking delay, was dropped after data exploration revealed that trackers almost never block rendering (the top 20 render-blocking domains were all first-party CSS and CDNs), a consequence of ad networks optimizing for async loading to preserve their own viewability metrics.
Two independent XGBoost models predict the scores from 19 request-level features. Network cost is essentially solved (Spearman rho 0.999); the real challenge is CPU cost, where request metadata is an indirect proxy for JavaScript execution time. The CPU model achieves rho 0.751, with the single most predictive feature being the size of the largest script a domain serves. Disconnect privacy categories (Advertising, Analytics, Social) contribute almost nothing. The privacy taxonomy and the performance taxonomy are nearly orthogonal.
The pipeline includes three uncertainty quantification techniques. Quantile regression produces per-domain prediction intervals. Semi-supervised self-training, applied to 2,407 unlabeled domains, yielded a clean null result: zero improvement across five rounds, confirming the labeled set was already representative. Conformal prediction calibrates the intervals with distribution-free coverage guarantees and reveals that the quantile model is slightly overconfident on CPU predictions (1.3x correction) but overly cautious on network predictions (3x narrower intervals needed).
The final artifact is not a shipped model but a 125KB static JSON lookup table: 4,592 entries with point estimates, prediction intervals, data provenance flags, and confidence indicators. Firefox loads it into a Map and does O(1) lookups at block time, enabling the privacy dashboard to show users specific, per-axis estimates of the performance cost Firefox prevented, rather than a flat count of domains blocked.
Sections
- Problem Formulation and Target Design
- Data Collection and Ground Truth Construction
- Feature Engineering and Selection
- Model Architecture and Training
- Evaluation and Error Analysis
- Uncertainty Quantification
- Deployment and Delivery
Problem Formulation and Target Design
Firefox knows which domains are trackers. It doesn’t know which ones are expensive, or how they’re expensive.
The Problem
Firefox’s Enhanced Tracking Protection blocks trackers using the Disconnect list1, which categorizes domains by privacy function: Advertising, Analytics, Social, Fingerprinting, Cryptomining. These categories tell you what a tracker does to your privacy, but they say nothing about what it costs your page load. A 1x1 tracking pixel and a 170KB ad SDK are both “Advertising.”
That distinction matters. The privacy metrics card on about:protections shows users how many trackers Firefox blocked, broken down by category. But “47 trackers blocked this week” is just a flat number.2 It treats a tracking pixel from bat.bing.com (258 bytes) the same as a tag manager from googletagmanager.com (92-183KB of JavaScript, 305ms mean scripting time). like those are not the same thing at all. If I could score each tracker by its likely performance cost, the card could show something way more specific: “Firefox saved you 890ms of CPU time and 1.2MB of bandwidth this week.”
The goal was to build a regression model that scores tracker domains by performance cost, enabling Firefox to surface specific, actionable information on the privacy dashboard.
The Progression: Classification to Single Score to Multi-Target
I went through three design iterations before landing on the final architecture.
First attempt: classification. Bucket trackers into Low / Medium / High cost tiers. This is the simplest approach, but where do you draw the boundaries? 45ms of blocking time and 55ms aren’t meaningfully different categories. the thresholds are kind of arbitrary, and changing them means relabeling and retraining.
Second attempt: single regression score. One float in [0.0, 1.0] representing overall performance cost. This was better: continuous, no arbitrary buckets, flexible downstream thresholds. But it was lossy. Consider these two domains:
| Domain | CPU behavior | Network behavior | Composite score |
|---|---|---|---|
cdn.cookielaw.org | 79ms scripting | 84th percentile transfer | 0.81 |
connect.facebook.net | 270ms scripting | 83rd percentile transfer | 0.74 |
With a single score, these look similar. But they’re expensive in qualitatively different ways. The consent script loads in the critical path. The Facebook SDK loads async but burns CPU in the background. a single number just collapses all of that.
Final design: two-target regression. Two independent [0.0, 1.0] scores per domain. This preserves how something is costly:
- The UI can say “blocked a CPU-heavy tracker” vs “blocked a bandwidth-heavy tracker,” with different messages for different costs.
- Downstream consumers can weight dimensions differently. A mobile UI might emphasize network cost; a desktop UI might emphasize CPU.
- You can always derive a single aggregate later:
max(cpu, network)or a weighted sum. You can’t go the other direction.
The render_delay Post-Mortem
I actually designed three targets, not two. The original plan included render_delay to capture sync resources in <head> that block first paint. The hypothesis was that 5-10% of tracker domains would be render-blocking, and that this would be an independent axis worth modeling.
Then I actually queried the data. I pulled the render-blocking-insight audit from Lighthouse across the full 2024-06-01 HTTP Archive crawl. the top 20 render-blocking domains:
| Rank | Domain | Occurrences | What it is |
|---|---|---|---|
| 1 | fonts.googleapis.com | 3,497 | First-party font CDN |
| 2 | static.flashscore.com | 2,330 | First-party static assets |
| 3 | cdnjs.cloudflare.com | 1,206 | Public CDN |
| 4 | cdn.jsdelivr.net | 737 | Public CDN |
| … | … | … | First-party CSS and CDNs |
| 17 | cdn.cookielaw.org | 321 | Consent manager |
All first-party CSS and CDN resources. The only tracker in the top 20 was cdn.cookielaw.org, a consent manager. not ad scripts. not analytics. not social widgets. a consent banner.
This killed the target, but honestly the reason it died is the interesting part. Trackers almost never block rendering because ad networks and analytics vendors learned long ago that blocking rendering kills their own metrics.3 If your ad script delays first paint, the user might bounce before the ad loads, which means zero viewability and zero revenue. the incentive structure basically guarantees that tracker scripts load asynchronously.
Render-blocking is a property of how sites include resources, not of the resources themselves. A site could make Google Tag Manager render-blocking by putting it in a sync <script> tag in <head>, but almost nobody does because it would destroy their Core Web Vitals.4 The render-blocking audit catches the rare exceptions, not a systematic pattern.
Decision: Dropped render_delay. Two targets capture the meaningful variation in tracker performance cost.
The Two Targets
| Target | What it captures | Data source | User-facing meaning |
|---|---|---|---|
main_thread_cost | CPU time: scripting, parse/compile | Lighthouse bootup-time audit | ”used your CPU in the background” |
network_cost | Bandwidth and transfer overhead | HTTP Archive request data | ”used your bandwidth” |
main_thread_cost is built from Lighthouse’s bootup-time audit5, which breaks down per-script CPU time into scripting execution and parse/compile overhead. It captures the cost that directly impacts Total Blocking Time and Interaction to Next Paint, the main thread work that makes pages feel sluggish. The score is a weighted percentile rank:
main_thread_cost = 0.50 * prank(total_cpu_ms)
+ 0.35 * prank(scripting_ms)
+ 0.15 * prank(parse_compile_ms)network_cost is built from HTTP Archive request-level data: median transfer size, total bytes per page from the domain, and request count. It captures the bandwidth competition with first-party resources. The score:
network_cost = 0.50 * prank(p50_transfer_bytes)
+ 0.30 * prank(bytes_per_page)
+ 0.20 * prank(requests_per_page)These weights are design decisions, not learned parameters. I validated them by checking that known domains produced score profiles matching expert intuition. I haven’t done a formal sensitivity analysis on the weights, so take them as reasonable defaults rather than optimal values.
Target Independence
This is the key evidence that two targets was the right call. If CPU and network cost were correlated, one score would suffice. They aren’t.

Every dot is a tracker domain with real Lighthouse data, colored by Disconnect category. The scatter is dispersed, not diagonal.
- Top-left (high CPU, low network):
platform.loyaltylion.com, which burns 1,241ms CPU from a 2-byte request. The “invisible CPU” archetype. - Top-right (high CPU, high network):
googletagmanager.com, heavy on both axes. Worst-case tracker. - Bottom-right (low CPU, high network): CDN-served ad creatives. Big files but not much execution.
- Bottom-left (low CPU, low network): lightweight beacons and pixels.
Advertising domains (red) span the entire space. There is no single “ad tracker” performance profile. Some ads are pixels, some are heavyweight SDKs. This is exactly why Disconnect category alone can’t predict performance cost, and why it barely registers in SHAP feature importance analysis.
Why Regression, Not Classification
Even with two axes, the question of continuous scores vs discrete buckets comes up. The case for regression:
- No arbitrary boundaries. Where do you draw the line between “moderate” and “heavy” CPU cost? 45ms and 55ms aren’t meaningfully different categories.
- Flexible downstream use. The privacy metrics card can threshold at 0.5 for “high impact” today and 0.7 tomorrow without retraining. Each axis can use different thresholds.
- Richer signal. A main_thread_cost of 0.92 vs 0.71 tells you something. “High” vs “high” doesn’t.
Why Per-Domain Scoring
The same tracker domain (e.g., www.google-analytics.com) can serve both a heavyweight analytics script and a lightweight pixel endpoint. I had to decide what granularity to score at.
I chose domain level for three reasons:
- Firefox’s storage is domain-granular.
TrackingDBServicerecords blocked trackers by domain. The privacy metrics card displays per-domain counts. Domain-level scores slot directly into the existing data flow. - At block time, Firefox knows the domain but hasn’t fetched the resource. Per-request features (transfer size, content type) aren’t available because the request was blocked. Domain-level features can be precomputed and shipped as a lookup table.
- Worst-case scoring is the right framing. Each dimension reflects the upper bound of that cost type the user is protected from. A domain that serves heavyweight scripts on 5% of pages gets a high CPU score based on its worst-case behavior. This is the right signal for “Firefox prevented X.”
-
The Disconnect tracking protection list is an open-source classification of tracker domains maintained by Disconnect. Mozilla’s Enhanced Tracking Protection consumes this list to categorize blocked domains. The list is available as
services.jsonand is updated regularly. ↩ -
Ghostery’s “Tracker Tax” study measured the top 500 websites and found pages loaded in ~8.6 seconds with trackers blocked vs. ~19.3 seconds with them present, with each additional tracker adding roughly 2.5% to page load time. For a more granular analysis, see Pourghassemi et al., “adPerf: Characterizing the Performance of Third-party Ads” (ACM SIGMETRICS, 2021), which used browser activity tracing to attribute specific CPU costs to individual ad scripts in Chromium. The HTTP Archive Web Almanac 2022 Third Parties chapter provides a broader census of third-party prevalence and blocking impact across millions of sites. ↩
-
Google’s own Publisher Tag best practices explicitly recommend async loading, and the Google Publisher Tag only supports asynchronous rendering. The IAB’s LEAN Principles (Light, Encrypted, AdChoices-supported, Non-invasive) similarly call for lightweight, non-blocking ad delivery. The economic logic is straightforward: render-blocking ads delay first paint, increasing bounce rates before the ad even loads, which tanks viewability metrics that determine ad revenue. ↩
-
Google’s Core Web Vitals (LCP, INP, CLS) became Search ranking signals in 2021, giving publishers a direct SEO incentive to minimize render-blocking third-party scripts. The thresholds and methodology are documented in Sullivan, McQuade, and Walton, “Defining the Core Web Vitals metrics thresholds” (web.dev, 2020). ↩
-
Lighthouse’s bootup-time audit measures per-script CPU time broken into scripting execution, parse/compile, and other categories. It only reports scripts exceeding a minimum CPU threshold, which is why 52% of tracker domains in our dataset have no CPU data; they didn’t cross the reporting floor. ↩
Data Collection and Ground Truth Construction
There’s no labeled dataset of “tracker domain to performance cost.” So I built one from 348 million HTTP requests and Lighthouse audits on BigQuery.
The Problem
Nobody has published a dataset that maps tracker domains to their performance impact on web pages. The Disconnect list categorizes domains by privacy function. The third-party-web project6 tracks script execution time by entity. but neither produces what I actually needed: per-domain performance cost scores covering CPU and network overhead independently.
I had to construct the ground truth labels from raw web crawl data.
Data Sources
HTTP Archive7 crawls millions of web pages monthly and stores the results in BigQuery. Four data sources feed the pipeline:
| Source | What it provides | Scale |
|---|---|---|
httparchive.crawl.requests | Per-request timing, transfer size, resource type, waterfall position | 348M rows, 374GB (2024-06-01 mobile crawl) |
httparchive.crawl.pages | Per-page Lighthouse audit results as nested JSON | ~16M pages |
httparchive.almanac.third_parties | Entity-to-domain mapping from the third-party-web project | ~2,700 entities |
Disconnect services.json | Privacy categories for tracker domains | 4,387 domains |
I used the 2024-06-01 crawl (most recent with stable Lighthouse v12 data) and the mobile client, which is more representative of constrained devices and covers ~60% of HTTP Archive’s crawl.
Lighthouse Audit Schema
The first surprise came before I even wrote any aggregation logic. my design doc, based on older HTTP Archive documentation and examples, assumed three Lighthouse audit names:
| Design doc assumed | Actual audit name (2024 crawl) |
|---|---|
third-party-summary | third-parties-insight |
render-blocking-resources | render-blocking-insight |
bootup-time | bootup-time (unchanged) |
Lighthouse renamed two of three audits at some point between the documentation I was referencing and the 2024 crawl. The only way to discover this was to explore the JSON structure interactively on BigQuery, extracting keys from the lighthouse column and checking what actually existed.
this is the kind of thing that burns hours if you dont verify early. I was writing queries against audit names that literally no longer existed in the data.
Entity-Level Attribution
The second surprise was worse. the third-parties-insight audit doesn’t use domain names. It uses entity names from the third-party-web project:
{
"entity": "Google Tag Manager",
"mainThreadTime": 305.4,
"transferSize": 98234,
"blockingTime": 142.1
}The key is "Google Tag Manager", not "www.googletagmanager.com". To map these back to domain-level scores, I needed the entity-to-domain mapping from httparchive.almanac.third_parties. This meant the third-party-summary audit couldn’t directly join to the request data on domain name; it needed an intermediate join through entity names.
For the bootup-time audit, the data uses URLs directly ($.url), so domain extraction is straightforward with NET.HOST(). This asymmetry between audits added complexity but was manageable once I understood the schema.
The Extraction Pipeline
Stage 1: Per-Request Feature Extraction
The first query pulls raw request-level signals for every HTTP request to a known tracker domain:
SELECT
NET.HOST(req.url) AS tracker_domain,
req.page,
req.type AS resource_type,
CAST(JSON_VALUE(req.payload, '$._bytesIn') AS INT64) AS transfer_bytes,
CAST(JSON_VALUE(req.payload, '$._load_ms') AS INT64) AS load_ms,
req.index AS waterfall_index,
JSON_VALUE(req.payload, '$._priority') AS chrome_priority,
JSON_VALUE(req.payload, '$._initiator_type') AS initiator_type,
FROM `httparchive.crawl.requests` req
WHERE req.date = '2024-06-01'
AND req.client = 'mobile'
AND req.is_root_page = TRUE
AND NET.HOST(req.url) IN (
SELECT domain FROM `httparchive.almanac.third_parties`
WHERE category IN ('ad', 'analytics', 'social',
'tag-manager', 'consent-provider')
)This scanned 374GB and returned 348 million rows, covering every request to a third-party tracker domain across the entire mobile crawl. Cost: ~$1.87 on BigQuery.
Stage 2: Lighthouse CPU Extraction
The second query pulls per-script CPU breakdown from the bootup-time audit:
SELECT
NET.HOST(JSON_VALUE(item, '$.url')) AS tracker_domain,
COUNT(DISTINCT page) AS pages_with_data,
AVG(CAST(JSON_VALUE(item, '$.scripting') AS FLOAT64))
AS mean_scripting_ms,
AVG(CAST(JSON_VALUE(item, '$.scriptParseCompile') AS FLOAT64))
AS mean_parse_compile_ms,
AVG(CAST(JSON_VALUE(item, '$.total') AS FLOAT64))
AS mean_total_cpu_ms,
FROM `httparchive.crawl.pages`,
UNNEST(JSON_QUERY_ARRAY(lighthouse,
'$.audits.bootup-time.details.items')) AS item
WHERE date = '2024-06-01'
AND client = 'mobile'
AND is_root_page = TRUE
GROUP BY tracker_domainLighthouse only reports scripts that exceed a CPU threshold in the bootup-time audit. Domains that don’t appear here didn’t cross the reporting threshold, meaning “absent” is closer to “negligible CPU” than “unmeasured.” This distinction becomes critical later.
Stage 3: Domain-Level Aggregation
The aggregation collapsed 348 million request rows into 4,592 tracker domains (filtered to pages_seen_on >= 10 for statistical stability). This step runs in Python rather than SQL because it requires the Disconnect list fuzzy matching and percentile rank computation.
Disconnect List Fuzzy Matching
The initial naive join between HTTP Archive domains and the Disconnect list was disappointing:
Exact domain match: 92 matches out of 4,59292 matches. The problem: the Disconnect list uses base domains (doubleclick.net), but HTTP Archive records full subdomains (stats.g.doubleclick.net, td.doubleclick.net, pagead2.googlesyndication.com).
The fix was suffix matching: try progressively shorter domain suffixes until a match is found.
def match_disconnect_domain(tracker_domain, disconnect_domains):
parts = tracker_domain.split(".")
for i in range(len(parts) - 1):
candidate = ".".join(parts[i:])
if candidate in disconnect_domains:
return candidate
return Nonestats.g.doubleclick.net tries stats.g.doubleclick.net (miss), then g.doubleclick.net (miss), then doubleclick.net (hit). After this fix: 3,686 matches (80% of domains). The remaining 906 domains are in HTTP Archive’s third-party categories but not on the Disconnect list.
Coverage Analysis
Not every domain has every type of data. This split defines the ML problem:

| Segment | Count | Meaning |
|---|---|---|
| Domains with Lighthouse CPU data | 2,185 (48%) | Real training labels for main_thread_cost |
| Domains without CPU data | 2,407 (52%) | Model predicts their scores |
| Matched to Disconnect list | 3,686 (80%) | Have privacy category features |
| Not on Disconnect list | 906 (20%) | In HTTP Archive’s tracker categories but not Disconnect |
Every domain has network data (transfer size comes from the request table), so network_cost has real labels for all 4,592 domains. But main_thread_cost only has ground truth for the 2,185 where Lighthouse reported CPU time. The model’s job is to predict CPU cost for the other 2,407 using only request-level features.
Missing Data and Label Noise
This was the most consequential data decision in the project.
The mistake: In my first training run, I filled missing CPU data with 0.0 and trained on all 4,592 domains. The 2,407 domains without CPU data got main_thread_cost = 0.0. The model reported Spearman rho = 0.825. looked great.
The problem: 0.0 means “Lighthouse didn’t report CPU time,” not “zero CPU cost.” the model was learning fake labels. Those 2,407 domains were being treated as ground truth when they were just assumptions. The model trained on “small transfer size correlates with zero CPU” because thats what the fake labels said. But small transfer doesn’t mean zero CPU. bat.bing.com serves 258-byte stubs that trigger 105ms of scripting.
The evidence: When I evaluated only on the 328 test domains with real CPU data, rho dropped from 0.825 to 0.734. The inflated metric was coming from the model “correctly” predicting low scores for domains that had fake 0.0 labels.
The fix: Train main_thread_cost only on the 2,185 domains with real Lighthouse CPU data. Use the model to predict scores for the 2,407 without data.
| Metric | Before (fake 0s) | After (real data only) |
|---|---|---|
| Training set | 4,592 domains | 2,185 domains |
| Spearman rho (honest eval) | 0.734 | 0.767 |
| RMSE | 0.173 | 0.091 |
| MAE on real-data domains | 0.173 | 0.062 |
RMSE nearly halved. MAE improved 3x. and the rho on real data went up, not down, despite training on half the data. the fake labels were actively hurting the model.8
Percentile Rank Construction
Percentile ranks convert raw milliseconds and bytes into [0.0, 1.0] scores that are robust to outliers and interpretable (“worse than X% of trackers”).
But here the fake-0 problem showed up again in the target construction. When I computed percentile ranks across all 4,592 domains, the 2,407 with “zero” CPU all tied at rank 0.0. The remaining 2,185 real domains got compressed into the [0.52, 1.0] range. The model was training on targets bunched in the upper half of [0, 1], wasting half the output space.
The fix: compute ranks only within the 2,185 domains with real CPU data.
has_cpu = df["mean_scripting_ms"].notna()
df.loc[has_cpu, "prank_scripting"] = (
df.loc[has_cpu, "mean_scripting_ms"]
.rank(pct=True, method="min")
)This gives the training set a full [0, 1] range. Domains without CPU data get NaN for main_thread_cost; they’re the prediction targets, not the training labels.
For network_cost, all 4,592 domains have real network data, so ranks are computed over the full set with no issues.
Target Distributions
main_thread_cost (left): The 2,185 domains with Lighthouse CPU data are spread roughly uniformly across [0, 1]. This is what percentile ranking gives you: a flat distribution by construction. The 2,407 without CPU data are not shown; they have no score yet.
network_cost (right): All 4,592 domains have real network data. Distribution is roughly uniform with a slight right skew. Every tracker transfers something, so there’s no spike at zero.
The two distributions have very different shapes in raw space (before percentile ranking). CPU time is extremely right-skewed with a long tail of heavy SDKs. Network cost is more evenly distributed. Percentile ranking normalizes both to [0, 1], which is what the model trains on.
Score Validation
Before training any model, I checked the score profiles against domains where I had strong priors:
| Domain | main_thread_cost | network_cost | Expected |
|---|---|---|---|
googletagmanager.com | 0.92 | 0.90 | Heavy on both axes |
connect.facebook.net | 0.64 | 0.83 | FB SDK: heavy scripts, large transfer |
google-analytics.com | 0.53 | 0.46 | Moderate CPU (higher than I expected) |
bat.bing.com | 0.52 | 0.42 | Moderate CPU despite 258-byte transfer |
cdn.cookielaw.org | 0.79 | 0.84 | Consent: heavy script, big transfer |
The key validation is that the two dimensions produce different profiles for qualitatively different trackers. GTM is heavy on both. Analytics is moderate-moderate. Bing’s pixel is moderate CPU but low network. The profiles match reality, which means the labeling pipeline is working.
-
Patrick Hulce’s third-party-web project maps ~2,700 entity names to their associated domains, categorized by function (ad, analytics, social, etc.). This mapping is consumed by Chrome Lighthouse and DevTools to attribute performance costs to named entities rather than raw domains. ↩
-
HTTP Archive is a community-run project (started by Steve Souders) that crawls millions of URLs monthly using WebPageTest on Chrome, storing HAR files and Lighthouse audits in Google BigQuery. The annual Web Almanac is the canonical publication built on this data. ↩
-
For a thorough treatment of how label noise degrades model performance, see Frenay and Verleysen, “Classification in the Presence of Label Noise: A Survey” (IEEE TNNLS, 2014). Systematic label noise (like replacing missing values with a constant) is particularly damaging because it teaches the model a spurious pattern rather than adding random variance. ↩
Feature Engineering and Selection
Nineteen features that describe how a tracker domain behaves on the web, and which ones actually matter.
The Feature Vector
Each tracker domain gets a fixed-length feature vector aggregated from HTTP Archive request data. The features fall into five groups, designed to capture different aspects of a domain’s behavior without requiring Lighthouse CPU measurements (which aren’t available for all domains).
Size features (5)
| # | Feature | Source |
|---|---|---|
| 1 | p50_transfer_bytes | Median transfer size across all requests |
| 2 | p90_transfer_bytes | 90th percentile transfer size |
| 3 | mean_transfer_bytes | Mean transfer size |
| 4 | max_script_bytes | Largest single script from this domain |
| 5 | bytes_per_page | Total bytes from this domain per page (median across pages) |
max_script_bytes turned out to be the single most important feature for predicting CPU cost (SHAP value 0.056, nearly 2x the next feature). which honestly makes sense, the biggest script a domain serves is the best proxy for how much computation it triggers.
Type composition (4)
| # | Feature | Source |
|---|---|---|
| 6 | script_request_ratio | Fraction of requests that are type=script |
| 7 | image_request_ratio | Fraction that are type=image |
| 8 | requests_per_page | Median requests from this domain per page |
| 9 | distinct_resource_types | Number of unique resource types served |
script_request_ratio is the third most important CPU feature (SHAP 0.019). A domain that’s 95% scripts (like googletagmanager.com) behaves very differently from one that’s 95% images (tracking pixels). But it’s not a perfect signal. bat.bing.com has 0% script ratio in request data yet burns 105ms of CPU, because Lighthouse attributes CPU from redirect chains and cross-domain scripts.
Timing / execution (4)
| # | Feature | Source |
|---|---|---|
| 10 | mean_lh_main_thread_ms | Lighthouse third-parties-insight mainThreadTime |
| 11 | mean_scripting_ms | Lighthouse bootup-time scripting time |
| 12 | mean_parse_compile_ms | Lighthouse bootup-time scriptParseCompile time |
| 13 | p50_load_ms | Median total request time |
Features 10-12 are Lighthouse-derived. They’re available for the 2,185 domains with CPU data and NaN for the rest. XGBoost handles NaN natively via learned default split directions; no imputation needed.
These features are used during training but aren’t required at inference time. The generalization experiment (training without features 10-12) shows the model can still achieve rho=0.751 from request features alone.
Render-blocking (3)
| # | Feature | Source |
|---|---|---|
| 14 | render_block_rate | Fraction of pages where domain appears in render-blocking-insight |
| 15 | mean_render_block_wasted_ms | Average wasted render-blocking ms |
| 16 | mean_render_block_bytes | Average render-blocking resource size |
These turned out to be nearly useless. Data exploration showed that almost no tracker domains appear in the render-blocking audit (see Problem and Target Variables). The features are kept for completeness but contribute almost zero to predictions.
Prevalence / context (3)
| # | Feature | Source |
|---|---|---|
| 17 | pages_seen_on | Number of pages this domain appears on |
| 18 | p10_waterfall_index | 10th percentile waterfall position (how early it loads) |
| 19 | disconnect_category | One-hot encoded into 6 binary features |
disconnect_category is one-hot encoded into: Advertising, Analytics, Social, Content, FingerprintingInvasive, Cryptomining. This adds 6 columns, giving 25 total input features.
The surprise: Disconnect categories contribute almost nothing to predictions. SHAP values for all six category features are near zero. a domain’s privacy category (Advertising vs Analytics vs Social) just doesn’t predict its performance cost. An “Advertising” domain can be a pixel (near-zero cost) or a 170KB SDK (high cost). the model learns from behavioral features (script size, transfer size) not categorical labels.
Preprocessing
Log transform: All byte and millisecond features are log-transformed before training. These distributions are heavily right-skewed: max_script_bytes ranges from 0 to 27MB, and mean_scripting_ms ranges from 0 to 7,300ms. Log1p compresses the range while preserving ordering.
Negative values: One edge case: p50_load_ms had a -1 value in the data (probably a measurement artifact idk). Clipped to 0 before log transform to avoid -inf.
Missing values: Left as NaN. XGBoost learns default split directions for missing values, which is more principled than imputation. The Lighthouse features (10-12) are NaN for ~52% of domains.
The Disconnect List Join
This was harder than expected. The Disconnect list uses base domains (doubleclick.net), but HTTP Archive records full subdomains (stats.g.doubleclick.net, td.doubleclick.net, googleads.g.doubleclick.net).
Initial naive join (exact match on domain): 92 matches out of 4,592. terrible.
Fix: suffix matching. For each tracker domain, try progressively shorter suffixes until a match is found:
def match_disconnect_domain(tracker_domain, disconnect_domains):
parts = tracker_domain.split(".")
for i in range(len(parts) - 1):
candidate = ".".join(parts[i:])
if candidate in disconnect_domains:
return candidate
return NoneAfter fix: 3,686 matches (80% of domains). The remaining 20% are domains in HTTP Archive’s third-party categories but not on Disconnect’s list, reflecting different definitions of “tracker.”
The Percentile Rank Fix
The target scores are percentile ranks (0.0 = cheapest, 1.0 = most expensive). Originally I computed ranks across all 4,592 domains, including the 2,407 with no CPU data that were filled with 0.
This compressed the training data for main_thread_cost into [0.5, 1.0]; the bottom half of the range was all zeros. The model had half the target resolution to work with.
Fix: compute percentile ranks only within the 2,185 domains with real CPU data. The cheapest measured domain gets ~0.0, the most expensive gets ~1.0. Full [0, 1] range for training. Domains without CPU data get NaN for main_thread_cost (they’re predicted by the model, not used for training).
Model Architecture and Training
The biggest script a domain serves is the single best proxy for CPU cost when you don’t have Lighthouse data. Everything else is noise.
Why XGBoost
The task is tabular regression: 19 numeric features in, one float out per target. The dataset is small (~2K-4.5K domains depending on the target). The models need interpretability for the downstream consumer (Firefox engineers deciding whether to trust the scores). I compared four options.
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Ridge regression | Simple, fast, tiny model | Can’t capture feature interactions | Baseline only |
| Random forest | Handles nonlinearity; robust | Large model size; no native missing value handling | Inferior to XGBoost here |
| Small neural net | Could learn feature embeddings | Overkill for 19 features and ~2K samples; poor interpretability; larger model | Wrong tool |
| XGBoost | Best on small tabular data; native missing values; interpretable via SHAP; small ONNX export | No transfer learning | Selected |
The critical factor was feature interactions.9 A linear model can learn “high transfer size = high cost,” but it can’t learn “high script ratio + large size = high cost, but high script ratio + small size = low cost.” That interaction is real: a domain that serves many small scripts (like a retargeting pixel chain) behaves very differently from one that serves a few large scripts (like an ad SDK). XGBoost captures this with tree splits, and it does so efficiently on small datasets where neural nets would overfit.
Ridge regression turned out to be a strong baseline (rho=0.713 for CPU cost), which I’ll return to in the evaluation article. But XGBoost beat it by a meaningful margin on the target that matters most.
Two Independent Models
I train two independent XGBRegressor models, one per target axis. Not a single multi-output model,10 for several reasons:
- Per-target tuning matters.
main_thread_costtrains on 2,185 domains (only those with real Lighthouse CPU data).network_costtrains on all 4,592 (every domain has real network data). Different dataset sizes, different distributions, different optimal hyperparameters. - Independent failure modes. If one model turns out to be unreliable, I can drop it without retraining the other. In practice,
network_costturned out to be trivially solved whilemain_thread_costrequired real effort. - Per-target SHAP analysis is cleaner when each model is self-contained. The features that drive CPU cost are completely different from the features that drive network cost.
Training Data Selection
This was the single most impactful decision in the entire pipeline.
The dataset has 4,592 tracker domains. Of those, only 2,185 have real Lighthouse CPU data (they appear in the bootup-time or third-party-summary audits). The remaining 2,407 have main_thread_cost = 0.0, not because they truly cost zero CPU, but because Lighthouse didn’t measure them. They load after Lighthouse’s observation window, or they don’t trigger audits, or they only serve images.
Version 1 of the model trained on all 4,592 domains, including those fake zeros. It achieved a headline Spearman rho of 0.825 on the full test set. Looked great. But 340 of the test domains had those fake 0.0 labels, and the model “correctly” predicted low for them, inflating the metric.
Version 2 trained main_thread_cost only on the 2,185 domains with real Lighthouse data. The honest metric, evaluated only on domains where the label is real, improved from 0.734 to 0.751. The headline number went down, but the actual model got better. I cover this in detail in the evaluation article.
network_cost has no such problem. Every domain has real network data (transfer size, request count, bytes per page), so it trains on all 4,592 domains.
Hyperparameter Search with Optuna
Each target gets its own Optuna11 study with 5-fold cross-validation, optimizing for Spearman rank correlation. Not RMSE, not R-squared. Spearman rho, because the downstream use case cares about ranking (which trackers are most expensive on each axis), not about predicting exact scores.
import optuna
from scipy.stats import spearmanr
from sklearn.model_selection import KFold
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 30, 200),
'max_depth': trial.suggest_int('max_depth', 3, 7),
'learning_rate': trial.suggest_float(
'learning_rate', 0.01, 0.3, log=True
),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float(
'colsample_bytree', 0.5, 1.0
),
'gamma': trial.suggest_float('gamma', 0, 5),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10, log=True),
}
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for train_idx, val_idx in kf.split(X_train):
model = xgb.XGBRegressor(
**params,
objective='reg:squarederror',
tree_method='hist',
random_state=42,
)
model.fit(
X_train.iloc[train_idx], y_train.iloc[train_idx],
eval_set=[(X_train.iloc[val_idx], y_train.iloc[val_idx])],
verbose=False,
)
preds = model.predict(X_train.iloc[val_idx])
rho, _ = spearmanr(y_train.iloc[val_idx], preds)
scores.append(rho)
return np.mean(scores)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)The search space is deliberately wide on regularization parameters (reg_alpha, reg_lambda, gamma) because small datasets can overfit quickly. Optuna’s tree-structured Parzen estimator navigates this efficiently. The training loss uses reg:squarederror for stable gradient descent, but the tuning objective is Spearman rho, the metric I actually care about.
Results
| Target | Training Domains | Spearman rho | RMSE | MAE |
|---|---|---|---|---|
| main_thread_cost | 2,185 (real CPU data only) | 0.751 | 0.182 | 0.140 |
| network_cost | 4,592 (all domains) | 0.999 | 0.010 | 0.007 |
The asymmetry is striking. Network cost is essentially a solved problem, the model reconstructs the target almost perfectly from transfer size features. CPU cost is the real modeling challenge, because the features available at inference time (request metadata, no Lighthouse) are indirect proxies for the thing you’re trying to predict (JavaScript execution time).
A rho of 0.751 means the model gets the CPU cost ranking right about 75% of the time. I think thats good enough for the downstream use case: correctly identifying which blocked trackers were expensive and which were cheap. it wont perfectly distinguish the 65th percentile domain from the 70th, but it will correctly separate a heavy tag manager from a lightweight pixel. I should note that 2,185 training samples with 19 features is on the thin side, and I’m relying on Optuna’s cross-validation to catch overfitting rather than doing a proper learning curve analysis. something to revisit.
SHAP Analysis
SHAP12 (SHapley Additive exPlanations) decomposes each prediction into per-feature contributions. The summary plots reveal what each model actually learned.
main_thread_cost
The top features by mean absolute SHAP value:
| Rank | Feature | Mean |SHAP| |
|---|---|---|
| 1 | max_script_bytes | 0.056 |
| 2 | bytes_per_page | 0.030 |
| 3 | script_request_ratio | 0.019 |
Everything else is at or near zero. The Disconnect categories (Advertising, Analytics, Social, etc.) barely register. The model learned one dominant pattern: the size of the biggest script a domain serves is the best available proxy for its CPU cost, when you don’t have Lighthouse timing data.
This makes intuitive sense. A large JavaScript file needs to be parsed, compiled, and executed. A domain serving a 170KB script is almost certainly running more expensive logic than one serving 2KB. bytes_per_page adds a cumulative signal (total load from this domain), and script_request_ratio adds a type signal (is this domain mostly scripts or mostly images?).
what surprised me is how little the Disconnect categories contribute. “Advertising” vs “Analytics” doesnt predict CPU cost once you control for script size. The privacy taxonomy and the performance taxonomy are nearly orthogonal.
network_cost
The network cost SHAP is even more concentrated:
| Rank | Feature | Mean |SHAP| |
|---|---|---|
| 1 | p50_transfer_bytes | 0.096 |
| 2 | bytes_per_page | 0.084 |
| 3 | everything else | ~0.000 |
Two features explain essentially all of the model’s predictions. p50_transfer_bytes (median transfer size per request) and bytes_per_page (total bytes from this domain per page) together are network cost. This is why the model achieves rho=0.999: the target variable network_cost is a weighted percentile rank of these same transfer size signals. The model is recovering a near-deterministic relationship.
Feature-Target Relationships

The scatter plots show why the two models have such different accuracy. The network cost features (transfer size, bytes per page) have a clear monotonic relationship with the target; you can almost draw a line through them. The CPU cost features (max script bytes, script ratio) have a much noisier relationship. Large scripts tend to be CPU-expensive, but some large scripts are mostly data (JSON payloads, configuration objects) while some small scripts trigger expensive event listeners, cookie operations, and beacon chains.
Prediction Accuracy
The 45-degree line is the ideal. Points above are over-predictions; points below are under-predictions. The model tracks well through the middle of the distribution but struggles at the extremes. Domains with very high actual CPU cost (above 0.8) are systematically under-predicted. The model doesn’t have strong enough signal from request metadata to identify the most expensive trackers. I examine these failure cases in the evaluation article.
-
Chen and Guestrin, “XGBoost: A Scalable Tree Boosting System” (KDD, 2016). For systematic evidence that tree-based models outperform deep learning on medium-sized tabular data, see Grinsztajn, Oyallon, and Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” (NeurIPS, 2022), which benchmarks across 45 datasets and finds trees dominant below ~10K samples. ↩
-
Spyromitros-Xioufis, Tsoumakas, Groves, and Vlahavas, “Multi-target regression via input space expansion” (Machine Learning, 2016) systematically compare independent single-target models against methods that exploit inter-target correlations. They find independent models are competitive when targets have low correlation, which is exactly the case here, since CPU and network cost are largely independent. ↩
-
Akiba, Sano, Yanase, Ohta, and Koyama, “Optuna: A Next-generation Hyperparameter Optimization Framework” (KDD, 2019). Optuna’s Tree-structured Parzen Estimator (TPE) navigates high-dimensional hyperparameter spaces more efficiently than grid or random search by modeling the conditional probability of good vs. bad configurations. ↩
-
Lundberg and Lee, “A Unified Approach to Interpreting Model Predictions” (NeurIPS, 2017). SHAP connects game-theoretic Shapley values to local model explanations, providing consistent and theoretically grounded feature attributions. The TreeSHAP variant runs in polynomial time on tree ensembles. ↩
Evaluation and Error Analysis
The headline metric was 0.825. The honest metric was 0.734. Knowing the difference is the entire evaluation story.
The Inflated Metric
Version 1 of the main_thread_cost model trained on all 4,592 domains, including 2,407 with main_thread_cost = 0.0. Those zeros aren’t real measurements; they’re the absence of Lighthouse data. The model learned that many domains have zero CPU cost, and it predicted low for them. On the full test set, that produced a Spearman rho of 0.825.
But 340 of those test domains had fake 0.0 labels. The model was being rewarded for “correctly” predicting low scores on domains where the ground truth was made up. Strip those out and evaluate only on domains with real Lighthouse CPU data, and the honest metric was 0.734.
Version 2 fixed this by training main_thread_cost only on the 2,185 domains with real Lighthouse data. The honest metric improved to 0.751. The headline number went down (no more easy points from fake zeros), but the model got meaningfully better at the thing that actually matters: ranking domains by their real CPU cost.
| Version | Training data | Full test rho | Honest test rho (real CPU data only) |
|---|---|---|---|
| v1 | All 4,592 domains (including fake 0s) | 0.825 | 0.734 |
| v2 | 2,185 domains (real CPU data only) | — | 0.751 |
the lesson: always compute your metric on the subset where the labels are real. a headline number that includes fake labels will mislead you.
Baselines
Every metric needs context. I compared XGBoost against three baselines, all using the same no-Lighthouse feature set (the features available at inference time). All metrics computed on the honest test set: only domains with real CPU data.
| Model | main_thread_cost rho | network_cost rho |
|---|---|---|
| Mean prediction | 0.0 | 0.0 |
Transfer size only (p50_transfer_bytes percentile rank) | 0.264 | 0.911 |
| Ridge regression | 0.713 | 0.992 |
| XGBoost | 0.751 | 0.999 |
A few things stand out.
Transfer size alone is a terrible CPU predictor. Rho of 0.264 means transfer size gets the CPU cost ranking only slightly better than random. a tiny 258-byte script (bat.bing.com) can trigger 105ms of scripting time. A 50KB image is zero CPU. transfer size tells you almost nothing about execution cost.
Transfer size alone is an excellent network predictor. Rho of 0.911 for a single feature. Network cost is essentially a function of how many bytes you move.
Ridge regression is a strong baseline for CPU cost. Rho of 0.713 is respectable. XGBoost beats it by ~0.04, which sounds small but represents meaningfully better ranking at the top of the distribution (the expensive trackers that matter most for the UI). The gap comes from feature interactions that ridge can’t capture: “large scripts from high-script-ratio domains” is a different signal than either feature alone.
XGBoost’s advantage is real but modest. On network cost, the difference between ridge (0.992) and XGBoost (0.999) is negligible. On CPU cost, the 0.038-point gap is meaningful. if ridge had matched XGBoost, I’d ship the simpler model. it didn’t.
Error Analysis
The aggregate metrics tell you the model works. The error analysis tells you where and why it fails.
Under-predictions
The most instructive error is platform.loyaltylion.com:
| Actual | Predicted | Error | |
|---|---|---|---|
| main_thread_cost | 0.931 | 0.675 | +0.256 |
This domain has: 2 bytes median transfer size. 0% script request ratio. it looks like a tracking pixel from every feature the model can see. but Lighthouse measured 1,241ms of scripting time. how?
The domain is part of a redirect chain. The initial request is tiny, but it triggers cross-domain script loading that Lighthouse attributes to the originating domain. The model sees the request-level features (tiny, no scripts) and predicts low cost. The actual CPU cost is hidden behind a redirect that the request metadata doesn’t capture.
This is a fundamental feature limitation, not a model bug. No amount of hyperparameter tuning will fix it. The information the model needs (what happens after the redirect) isn’t in the feature set. Fixing this requires new features: redirect chain depth, target domain script behavior, or cross-domain attribution data.
Under-predictions like this are the real failures. The model says “this tracker is cheap” when it’s actually expensive. If Firefox shows users “we prevented 50ms of CPU cost” when the true cost was 300ms, that’s a misleading undercount.
Over-predictions
The largest over-predictions are on domains without CPU data, domains where the label is 0.0 because Lighthouse didn’t measure them, not because they’re truly zero-cost.
Consider servers3.adriver.ru: the model predicts a moderate main_thread_cost. The label is 0.0. That looks like an error. But this domain serves 80% scripts with 16KB median transfer size. It looks like a real ad script. The model might be right and the label might be wrong.
Over-predictions on no-CPU-data domains are predictions scored against bad labels. They inflate the error metrics without representing actual model failures. This is another reason the honest metric (evaluated only on real-data domains) is more trustworthy than the full-set metric.
Error asymmetry
This asymmetry is the most interesting finding from the error analysis:
| Error type | What it means | Is it a real failure? |
|---|---|---|
| Under-prediction | Model says “cheap,” reality says “expensive” | Yes. Feature limitation or model weakness. |
| Over-prediction on real-data domain | Model says “expensive,” reality says “less expensive” | Yes. Model overestimates. |
| Over-prediction on no-data domain | Model says “expensive,” label says 0.0 | Probably not. Label is fake. Model may be right. |
Under-predictions are strictly worse than over-predictions for the downstream use case. If Firefox tells users it blocked an expensive tracker and the tracker was actually cheap, that’s a mild overstatement. If Firefox tells users it blocked a cheap tracker and it was actually expensive, that’s an undercount of the protection Firefox provided. The error asymmetry maps to an asymmetric cost function in the product.
Absolute Error
The mean absolute error on the honest test set (domains with real Lighthouse CPU data):
| Metric | main_thread_cost | network_cost |
|---|---|---|
| MAE | 0.062 | 0.007 |
| RMSE | 0.182 | 0.010 |
For CPU cost, the model is off by about 6 percentage points on average. A domain with true main_thread_cost = 0.70 would typically be predicted somewhere in the range [0.64, 0.76]. That’s accurate enough for the downstream use: correctly bucketing domains into “high,” “moderate,” and “low” CPU cost tiers, which is what the privacy metrics card needs.
The gap between MAE (0.062) and RMSE (0.182) reveals that most predictions are close, but a few are far off. RMSE penalizes large errors quadratically, so it’s dominated by the worst cases (like platform.loyaltylion.com). The typical prediction is better than the worst-case number suggests.
For network cost, both MAE and RMSE are negligible. The model essentially reproduces the target perfectly.
Uncertainty Quantification
The point estimates were decent. But a single number per domain hides something important: how much does the model trust its own prediction?
I applied three techniques on top of the base XGBoost models, each building on the previous one. Quantile regression produces prediction intervals. Self-training tries to exploit the unlabeled data. Conformal prediction audits everything. The most interesting results were the ones I didn’t expect.
Quantile Regression
Instead of predicting one score per domain, predict three: the 10th percentile, median, and 90th percentile. If the interval is narrow, the model is confident. If it’s wide, the model is saying “I’m not sure about this one.”
XGBoost supports this natively.13 Set objective='reg:quantileerror' and pass the quantile level via quantile_alpha. No architecture changes, no custom loss functions. I trained 6 models total: 2 targets x 3 quantiles.
def train_quantile_model(X_train, y_train, quantile, base_params):
model = xgb.XGBRegressor(
**base_params,
objective="reg:quantileerror",
quantile_alpha=quantile,
tree_method="hist",
random_state=42,
)
model.fit(X_train, y_train)
return model
for target in ["main_thread_cost", "network_cost"]:
for q in [0.1, 0.5, 0.9]:
model = train_quantile_model(X_train, y_train, q, params)The p10-p90 interval should contain the true value about 80% of the time. That’s the target coverage.
Results
| Target | Coverage (p10-p90) | Mean width | Median rho |
|---|---|---|---|
| main_thread_cost | 0.680 | 0.413 | 0.745 |
| network_cost | 0.804 | 0.115 | 0.999 |
Two very different stories. Network cost intervals are tight (mean width 0.115) and well-calibrated (coverage 0.804, right on the 0.80 target). The model knows network cost and knows that it knows. Main thread cost intervals are wider (0.413) and under-covering (0.680 vs 0.80 target). The model is slightly overconfident about CPU predictions; it thinks its intervals are wide enough, but they’re not.
That 0.68 coverage is interesting. It means 32% of true CPU scores fall outside the predicted p10-p90 range. The noise in CPU prediction (redirect chains, cross-domain script attribution, dynamic loading) creates surprises that the model can’t flag in advance.
Per-domain interval analysis

The most confident predictions (narrowest intervals) were VK social media domains. Consistent heavy scripts, predictable behavior. The model sees large max_script_bytes, high script_request_ratio, and knows exactly what to predict. Narrow blue bars, actual values landing right inside.
The least confident predictions (widest intervals) were Instagram CDN domains. The same CDN serves lightweight thumbnails on one page and heavyweight video players on another. The model can’t tell which pattern applies, so it hedges. Intervals spanning 0.3-0.4 of the full score range.
Interval width as uncertainty proxy
Does the model actually know when it’s uncertain? The width-error correlation for main_thread_cost is rho=0.280, positive but modest. Wider intervals do tend to correspond to larger prediction errors, but it’s not a strong signal. The model has some self-awareness of its uncertainty, not total self-awareness.
For network cost, the question barely applies. Errors are so small (max 0.039 across all test domains) that there’s nothing meaningful to correlate with interval width.
Semi-Supervised Self-Training
The setup looked textbook.14 1,857 labeled domains (with Lighthouse CPU data) and 2,407 unlabeled domains. Classic semi-supervised opportunity. The idea: train on labeled data, predict on unlabeled data, use the quantile intervals from above to select confident predictions (interval width < 0.30), add those as pseudo-labels, repeat for 5 rounds.
def self_train(df, params, n_rounds=5, max_width=0.30):
for round_num in range(n_rounds):
model = train_model(X_train, y_train, params)
# Use quantile intervals for confidence
q_models = train_quantile_pair(X_train, y_train, params)
pred_lo = q_models["lo"].predict(X_unlabeled)
pred_hi = q_models["hi"].predict(X_unlabeled)
widths = pred_hi - pred_lo
# Only add predictions the model is confident about
confident_mask = widths < max_width
X_train = pd.concat([X_train, X_unlabeled[confident_mask]])
y_train = np.concatenate([y_train, pred_mid[confident_mask]])
X_unlabeled = X_unlabeled[~confident_mask]Results
| Round | Training size | Spearman rho | Added | Remaining unlabeled |
|---|---|---|---|---|
| Baseline | 1,857 | 0.751 | — | 2,407 |
| 1 | 1,867 | 0.751 | 10 | 2,397 |
| 2 | 1,884 | 0.747 | 17 | 2,380 |
| 3 | 1,902 | 0.745 | 18 | 2,362 |
| 4 | 1,949 | 0.749 | 47 | 2,315 |
| 5 | 2,228 | 0.751 | 279 | 2,036 |
Total pseudo-labels added: 371. Improvement in Spearman rho: exactly +0.000. Self-training had no effect.
Not a small effect. Not a marginal improvement. Zero. The metric didn’t move at all across five rounds and 371 additional training examples.
Analysis of null result
Self-training helps when unlabeled data comes from a different region of feature space than the labeled data, when the model would encounter new patterns it hasn’t seen. That’s not what happened here.
The unlabeled domains are unlabeled because they’re simple. Lighthouse didn’t report CPU time for them because they didn’t cross the reporting threshold. They’re tracking pixels, tiny beacons, cookie-sync endpoints, the boring tail of the distribution. The labeled set of 2,185 domains already covers the full spectrum, from near-zero to 7,000ms of CPU time. It contains pixels and heavyweight SDKs and everything in between.
Adding pseudo-labeled versions of “more boring domains” doesn’t teach the model anything new. The model already knows what boring looks like. It needs help with the hard cases, domains where request features don’t clearly indicate CPU cost, and those are exactly the domains where self-training can’t help, because the model’s predictions on them aren’t confident enough to use as labels.
This is a clean negative result. It’s more informative than not trying, because it confirms the labeled set is representative. The 1,857 domains with real Lighthouse data are sufficient to learn the mapping from request features to CPU cost. The gap between 0.751 rho and perfect prediction isn’t a data problem; it’s a feature problem. Request metadata fundamentally can’t capture redirect chains and cross-domain script attribution.
Conformal Prediction
Quantile regression learns intervals from the data. Conformal prediction15 provides a statistical guarantee: the interval will contain the true value at least (1-alpha)% of the time, regardless of the data distribution. No distributional assumptions required.
The procedure is simple:
- Split the labeled data three ways: proper-train (60%), calibration (20%), test (20%).
- Train the model on the proper-train set.
- Compute residuals on the calibration set: how far off is each prediction?
- Take the (1-alpha) quantile of those residuals. This is the conformal quantile.
- Build intervals on the test set: prediction plus-or-minus the conformal quantile.
# Train on proper training set
model.fit(X_train, y_train)
# Compute calibration residuals
cal_preds = model.predict(X_cal)
cal_residuals = np.abs(y_cal - cal_preds)
# Conformal quantile: (1-alpha)(1 + 1/n) quantile of residuals
n_cal = len(cal_residuals)
q_level = np.ceil((1 - alpha) * (n_cal + 1)) / n_cal
conformal_q = np.quantile(cal_residuals, min(q_level, 1.0))
# Intervals: prediction +/- conformal quantile
conf_lo = np.clip(test_preds - conformal_q, 0, 1)
conf_hi = np.clip(test_preds + conformal_q, 0, 1)The key difference from quantile regression: conformal intervals have the same width for every domain. The conformal quantile is a single number derived from the calibration set. This is both the strength (guaranteed coverage) and the weakness (no per-domain adaptation).
Results: main_thread_cost
| Method | Coverage | Mean width |
|---|---|---|
| Conformal (alpha=0.10) | 0.883 | 0.541 |
| Quantile (p10-p90) | 0.709 | 0.417 |
Conformal intervals are 1.3x wider than quantile intervals. The conformal quantile is 0.284, meaning predictions are within plus-or-minus 0.28 of the true value with 90% confidence. That’s a meaningful uncertainty band on a [0, 1] scale.
The 1.3x ratio tells a calibration story: the quantile model is reasonably calibrated but slightly overconfident. Its intervals cover 70.9% instead of the target 80%. Conformal widens them by 30% to hit the coverage guarantee. If the ratio were 3x, I’d be worried. At 1.3x, it’s a mild correction; the quantile model isn’t lying about uncertainty, it’s just a little optimistic.
Results: network_cost
| Method | Coverage | Mean width |
|---|---|---|
| Conformal (alpha=0.10) | 0.888 | 0.037 |
| Quantile (p10-p90) | 0.824 | 0.116 |
Completely different story. Conformal intervals are 3x narrower than quantile intervals. The conformal quantile is just 0.019; predictions are within plus-or-minus 0.02 of truth. Essentially perfect.
The quantile model is being overly cautious on network cost. It learned to produce wider intervals than necessary, probably because the quantile loss function incentivizes covering the noisy tail. Conformal prediction, which doesn’t learn intervals but computes them from residuals, reveals that the actual prediction errors are tiny. The quantile model is overproducing uncertainty for a problem that’s already solved.
Calibration analysis
This is the most valuable insight from conformal prediction. The technique’s real power isn’t just the coverage guarantee; it’s what the comparison between conformal and quantile widths tells you about calibration.
| Target | Conformal vs quantile | Interpretation |
|---|---|---|
| main_thread_cost | Conformal 1.3x wider | Quantile slightly overconfident |
| network_cost | Conformal 3x narrower | Quantile overly cautious |
Different targets, different calibration failures. CPU predictions need wider intervals than the quantile model provides. Network predictions need narrower intervals. Without conformal as a reference, you’d only know the quantile coverages (0.68 and 0.80). You wouldn’t know whether the fix is “widen by 10%” or “widen by 300%.”
In the final lookup table, conformal prediction drives the confidence flag. If a domain’s conformal interval exceeds a threshold, it’s marked “uncertain.” 89% of domains pass; 11% don’t. The uncertain domains are honest about the model’s limitations.
Integration of Techniques
The three techniques form a pipeline, each informed by the previous one:
-
Quantile regression produces per-domain prediction intervals. The model learns that VK domains are easy (narrow intervals) and Instagram CDN domains are hard (wide intervals).
-
Self-training uses those intervals as a confidence criterion to select pseudo-labels from the unlabeled pool. It selects 371 domains with interval width below 0.30, but adding them doesn’t improve the model. Clean negative result: the labeled set was already representative.
-
Conformal prediction validates the quantile intervals. It reveals that CPU intervals are slightly too narrow (1.3x correction) and network intervals are much too wide (3x overcorrection). It provides the final confidence flag for the lookup table.
Each technique added something to the final output:
| Technique | Contribution to lookup table |
|---|---|
| Quantile regression | Per-domain cpu_lo, cpu_hi, network_lo, network_hi fields |
| Self-training | Confirmed labeled set is sufficient (no pseudo-labels needed) |
| Conformal prediction | The confident boolean flag per domain |
-
Quantile regression was introduced by Koenker and Bassett, “Regression Quantiles” (Econometrica, 1978). The asymmetric pinball loss function penalizes over- and under-predictions differently depending on the target quantile. XGBoost added native support via the
reg:quantileerrorobjective in version 2.0 (2023); see the XGBoost quantile regression tutorial. ↩ -
The foundational self-training method goes back to Yarowsky, “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods” (ACL, 1995). The core idea (train on labeled data, predict on unlabeled data, add confident predictions as pseudo-labels, repeat) has been applied widely in NLP and computer vision. It tends to help most when unlabeled data comes from underrepresented regions of feature space, which was not the case here. ↩
-
The theoretical foundations are in Vovk, Gammerman, and Shafer, Algorithmic Learning in a Random World (Springer, 2005). For the specific combination of conformal prediction with quantile regression, see Romano, Patterson, and Candes, “Conformalized Quantile Regression” (NeurIPS, 2019), which provides prediction intervals with guaranteed marginal coverage under no distributional assumptions beyond exchangeability. ↩
Deployment and Delivery
The entire ML pipeline (BigQuery extraction, feature engineering, XGBoost training, quantile regression, conformal prediction) produces one thing: a 125KB JSON file.
Output Artifact
Not an ONNX model running in the browser. Not a TensorFlow Lite inference engine. Not a model versioning system with A/B testing. A static JSON file. 4,592 entries, 960KB raw, 125KB gzipped. Firefox loads it into a Map and does key lookups. That’s the product.
This is a deliberate design decision, not a limitation.
Offline vs Runtime Inference
The obvious approach for “ML in the browser” is shipping a trained model and running inference at request time. ONNX Runtime for Web exists. XGBoost models convert to ONNX cleanly. The three models total maybe 70-100KB gzipped. So why not?
The model has nothing new to learn at runtime. The input features (transfer size, script ratio, waterfall position) come from HTTP Archive crawl data, not from the user’s browsing session. When Firefox blocks a request to googletagmanager.com, it doesn’t know the request’s transfer size or content type, because the request was blocked. All the model’s features are pre-computable from public crawl data. There’s no new information at block time that would change the prediction.
The Disconnect list is finite and known. Firefox blocks domains on the Disconnect list. The list has ~4,600 tracker domains. The lookup table covers all of them. There’s no “new domain at runtime” problem; when a new domain gets added to the Disconnect list, it gets scored in the next pipeline run.
A lookup table is simpler, faster, and more debuggable. A Map lookup is O(1). There’s no inference latency, no ONNX runtime dependency, no model loading on startup. If a score looks wrong, you look it up in the JSON file. If you want to change a score, you edit the JSON. The entire system is transparent.
The ML is the pipeline, not the product. The models exist to generalize from the 2,185 domains with Lighthouse CPU data to the 2,407 without. Once that generalization is done and the scores are written to JSON, the models have served their purpose. Shipping them to users would add complexity for zero benefit.
Lookup Table Structure
4,592 domains. Each entry has point estimates, prediction intervals, data provenance, and a confidence flag:
{
"www.googletagmanager.com": {
"cpu": 0.828, "cpu_lo": 0.67, "cpu_hi": 0.95,
"network": 0.901, "network_lo": 0.87, "network_hi": 0.94,
"source": "measured", "confident": true
},
"bat.bing.com": {
"cpu": 0.524, "cpu_lo": 0.28, "cpu_hi": 0.69,
"network": 0.422, "network_lo": 0.39, "network_hi": 0.45,
"source": "measured", "confident": true
}
}Composition
| Category | Count | Percentage | Meaning |
|---|---|---|---|
| Total domains | 4,592 | 100% | Every tracker on the Disconnect list with sufficient HTTP Archive data |
| Measured (source: “measured”) | 2,185 | 48% | CPU score computed directly from Lighthouse data |
| Predicted (source: “predicted”) | 2,407 | 52% | CPU score predicted by the XGBoost model |
| Confident | 4,110 | 89% | Conformal interval narrow enough to trust |
| Uncertain | 482 | 11% | Wide interval, prediction less reliable |
The 89%/11% confident/uncertain split comes from conformal prediction. A domain is marked “confident” if it has real Lighthouse CPU data, or if the model’s conformal interval for it is below the threshold. The 482 uncertain domains are ones where request features don’t clearly indicate CPU cost; the model is honest about not knowing.
Representative Entries
The scores pass the intuition check. Domains I have strong priors about land where they should.
| Domain | CPU | Network | Story |
|---|---|---|---|
www.googletagmanager.com | 0.83 | 0.90 | Heaviest tracker on both axes. 305ms scripting, 92-183KB bundles. |
cdn.cookielaw.org | 0.79 | 0.84 | Consent manager. Heavy despite not render-blocking. |
connect.facebook.net | 0.64 | 0.83 | Facebook SDK. 100% scripts, large transfers. |
www.google-analytics.com | 0.53 | 0.46 | Not lightweight. ~130ms scripting from session management and GA4. |
bat.bing.com | 0.52 | 0.42 | 258-byte transfer, 105ms CPU. Transfer size lies. |
The bat.bing.com entry is the one I’d highlight in a conversation about this project. It looks harmless in every request-level metric: tiny transfer, single request, no scripts in the resource type field. But Lighthouse measured 105ms of scripting time. The model learned this pattern from the labeled data and can now flag similar deceptive domains.
cdn.cookielaw.org is another good example. It’s the consent manager that was the only tracker in the render-blocking audit’s top 20. I dropped the render_delay target because of how sparse it was, but the CPU and network scores still capture that this domain is expensive. The performance cost didn’t disappear; it just shows up on the CPU axis instead of a dedicated render-delay axis.
Firefox Integration
The integration is a thin module that loads the JSON into a Map and exposes a lookup API. No parsing logic, no inference, no dependencies beyond the JSON file itself.
// TrackerRiskScorer.sys.mjs
const SCORES_URL =
"chrome://browser/content/tracker_risk_scores.json";
let scoresMap = null;
async function ensureLoaded() {
if (scoresMap) return;
const response = await fetch(SCORES_URL);
const data = await response.json();
scoresMap = new Map(Object.entries(data));
}
export async function getTrackerScore(domain) {
await ensureLoaded();
// Try exact match first, then strip subdomains
let entry = scoresMap.get(domain);
if (!entry) {
const parts = domain.split(".");
for (let i = 1; i < parts.length - 1; i++) {
const suffix = parts.slice(i).join(".");
entry = scoresMap.get(suffix);
if (entry) break;
}
}
if (!entry) return null;
return {
cpu: entry.cpu,
cpuInterval: [entry.cpu_lo, entry.cpu_hi],
network: entry.network,
networkInterval: [entry.network_lo, entry.network_hi],
source: entry.source,
confident: entry.confident,
};
}
export async function summarizeBlockedTrackers(domains) {
await ensureLoaded();
let totalCpu = 0;
let totalNetwork = 0;
let scored = 0;
for (const domain of domains) {
const score = await getTrackerScore(domain);
if (score) {
totalCpu += score.cpu;
totalNetwork += score.network;
scored++;
}
}
return { totalCpu, totalNetwork, scored, total: domains.length };
}The summarizeBlockedTrackers function is what the privacy metrics card would call. Pass it the list of domains Firefox blocked this week, get back aggregate CPU and network cost. The card could display something like: “Firefox blocked 47 trackers this week, preventing an estimated 890ms of background CPU usage and 2.3MB of network transfers.”
The suffix-stripping in getTrackerScore handles the same subdomain matching problem from the original Disconnect list join. HTTP Archive records stats.g.doubleclick.net; the lookup table might have the entry under doubleclick.net or stats.g.doubleclick.net. Try the exact match first, then progressively strip subdomains.
Privacy Metrics Card Integration
The current about:protections page shows a count: “47 trackers blocked this week.” With the lookup table, the card could show per-axis summaries:
What the user sees (hypothetically):
- “Firefox blocked 47 trackers this week”
- “Prevented ~890ms of background CPU usage”
- “Saved ~2.3MB of network bandwidth”
I’m still not sure how much to trust these aggregate numbers. The per-domain error is ~6 percentile points on average, but when you sum across 47 trackers the errors could compound or cancel out. Probably needs some simulation work before these go into production UI copy.
What happens behind the scenes:
TrackingDBServiceprovides the list of blocked domains for the time period.- For each domain, look up the score in the Map.
- Convert the [0, 1] scores back to approximate real units using reference values (e.g., a CPU score of 0.83 maps to roughly 305ms based on the percentile rank calibration).
- Sum across all blocked domains per axis.
The per-axis breakdown lets the card emphasize different things on different devices. On mobile (where bandwidth matters more), highlight the network savings. On desktop (where CPU matters more), highlight the CPU savings. The two-axis design makes this possible without any changes to the scoring pipeline.
The confidence flag could drive the messaging precision. If most blocked trackers are “confident,” show exact numbers. If many are “uncertain,” soften the language: “approximately” or “at least.”
Maintenance and Update Cycle
The lookup table should be regenerated periodically as web behavior changes. The pipeline reruns on fresh HTTP Archive data:
- Monthly: Pull new crawl data from BigQuery. HTTP Archive publishes monthly crawls.
- On Disconnect list updates: When new domains are added to the Disconnect list, run the pipeline to score them.
- Ship via Remote Settings: Firefox’s existing Remote Settings infrastructure delivers the updated JSON. No binary update needed.
The pipeline is fully automated: BigQuery queries, feature extraction, model prediction, lookup table generation. The only manual step is triggering it.
Between updates, the lookup table is static. A domain’s score doesn’t change based on the user’s browsing. This is fine; tracker behavior is stable over weeks. googletagmanager.com doesn’t suddenly become lightweight. The monthly refresh catches gradual shifts.