|

Discussion and Future Work

I started this project assuming trackers were uniformly expensive and that a model would be the hard part. I was wrong on both counts.


Key Findings

1. Render-blocking trackers are a myth (almost)

I designed a third target variable, render_delay, to capture sync scripts that block first paint. Then I queried the data. The render-blocking audit was dominated by fonts.googleapis.com, cdnjs.cloudflare.com, cdn.jsdelivr.net: CDNs and first-party resources. The only tracker in the top 20 was cdn.cookielaw.org, a consent manager.

Trackers load async. Not because they’re well-behaved, but because ad networks and analytics vendors learned long ago that blocking rendering kills their own viewability and engagement metrics. The incentives aligned. Render-blocking is a property of how sites include resources, not of the resources themselves. I killed the target. Two axes, CPU and network, capture the meaningful variation.

2. Transfer size is a terrible proxy for CPU cost

bat.bing.com serves 258-byte stubs. Looks harmless by every request-level metric. Lighthouse measured 105ms of scripting time. The tiny script triggers event listeners, cookie operations, and beacon logic that burns CPU far out of proportion to its transfer size.

This is the central tension of the project. Request metadata, the stuff you can observe without running Lighthouse, tells you how big something is, not how expensive it is to execute. max_script_bytes is the #1 SHAP feature for CPU prediction, but it’s a noisy proxy. The gap between “big” and “expensive” is where the model struggles and where the rho=0.751 ceiling comes from.

3. The Disconnect taxonomy is orthogonal to performance

SHAP feature importance shows Disconnect categories (Advertising, Analytics, Social, Fingerprinting, Cryptomining) barely register as predictors for either target. Advertising domains span the entire 2D space from lightweight pixels to heavyweight SDKs. “Analytics” includes both Google Analytics (~130ms scripting) and invisible beacons with zero measurable cost.

The privacy taxonomy and the performance taxonomy are independent dimensions. A tracker’s category tells you what it does to your privacy. It tells you almost nothing about what it costs your page load.

4. Half of tracker domains have no measurable CPU cost

2,407 out of 4,592 domains don’t appear in Lighthouse’s bootup-time audit. The distribution is bimodal: either a tracker uses meaningful CPU or it uses effectively none. There’s no smooth gradient; it’s a step function with a long tail on the right.

This isn’t a data quality issue. Lighthouse only reports scripts above a CPU threshold. Absence from the bootup-time audit means “below the threshold,” which for practical purposes means “negligible.” The labeled/unlabeled split in my dataset is a measurement artifact, and the bimodal distribution is real.

5. Network cost is trivially solved

Spearman rho 0.999. Transfer size is network cost. The p50_transfer_bytes feature alone gets you to 0.911. Adding bytes_per_page and requests_per_page pushes it to 0.999. The XGBoost model for network cost is essentially f(transfer_size, bytes_per_page), and everything else contributes nothing.

This isn’t a modeling achievement; it’s a statement about the target variable. Network cost, defined as the percentile rank of transfer overhead, is directly observable from the same data used to construct the features. The model doesn’t generalize here; it memorizes a near-identity mapping. But that’s fine. The network cost scores are accurate, and that’s what matters for the lookup table.

6. Main thread cost is the hard problem

Spearman rho 0.751. Meaningful signal from max_script_bytes, bytes_per_page, and script_request_ratio. But a 25% error rate in ranking. The model gets the broad strokes right (it knows GTM is more expensive than a tracking pixel), but it struggles in the 0.4-0.8 range where “moderately expensive” and “very expensive” are hard to distinguish from request metadata alone.

The irreducible error comes from three sources: redirect chains that transfer CPU cost across domains, cross-domain script attribution where Lighthouse credits CPU to a different domain than the one that initiated the script, and dynamic loading patterns that are invisible in static request metadata. To do better, you’d need Lighthouse data, which is the whole reason the model exists.

7. Self-training failed cleanly

1,857 labeled domains, 2,407 unlabeled. Five rounds of self-training with quantile-confidence selection added 371 pseudo-labels and moved the needle by exactly +0.000 Spearman rho. The most perfect null result I’ve ever seen.

The explanation is elegant: the unlabeled domains are unlabeled because they are boring. Lighthouse didn’t report CPU for them because they didn’t cross the threshold. They’re pixels and beacons, the easy tail of the distribution. The labeled set already covers the full spectrum from near-zero to 7,000ms CPU. Adding more easy examples doesn’t teach the model anything about hard cases.

This is better than not trying. It confirms the labeled set is representative and that the model’s ceiling is a feature limitation, not a data limitation.

8. Conformal prediction is a calibration diagnostic

Conformal prediction’s headline feature is coverage guarantees. But the most useful thing it told me was how well-calibrated the quantile regression intervals were, and the answer was different per target. For CPU, conformal intervals were 1.3x wider than quantile intervals, a mild correction, the quantile model was slightly overconfident. For network: conformal intervals were 3x narrower, meaning the quantile model was being overly cautious on a problem that’s already solved.

Without conformal as a reference, I’d only know the raw coverages (0.68 for CPU, 0.80 for network). I wouldn’t know the direction or magnitude of the calibration error.


Challenges and Iterations

Premature target design

The original design had three targets: main_thread_cost, render_delay, network_cost. I wrote the SQL queries, designed the feature vector, planned the evaluation strategy, all for three axes. Then the data showed render-blocking is almost exclusively a first-party problem. The target survived about two hours of data exploration before I killed it.

The lesson isn’t “do more research before designing.” The lesson is “design should be cheap enough to throw away.” The three-target architecture was easy to reduce to two because I built independent models per target. Dropping render_delay required deleting one model from the loop, not restructuring the pipeline.

Label noise from missing Lighthouse data

Domains without Lighthouse CPU data were initially scored 0.0 in the ground truth. But 0.0 means “Lighthouse didn’t report,” not “zero CPU cost.” The first model version trained on all 4,592 domains with these fake zeros, achieving rho=0.825 overall. Looked great. Then I evaluated only on domains with real CPU data: rho=0.734. The inflated metric was hiding the problem.

Worse, the model was learning “small transfer = score 0” from the bad labels. Its worst “over-predictions” were all domains without CPU data where the model predicted high scores. But the model was probably right; these domains likely do have CPU cost. The ground truth was wrong, not the model.

The fix was straightforward: train main_thread_cost only on the 2,185 domains with real Lighthouse data. Use the model to predict the rest. RMSE nearly halved.

Undocumented schema changes in Lighthouse audits

The design doc assumed third-party-summary and render-blocking-resources based on older HTTP Archive documentation. The actual audit names in the 2024 crawl were third-parties-insight and render-blocking-insight. The third-parties-insight audit uses entity names (“Google Tag Manager”) not domains, requiring an additional mapping step through the third-party-web entity list.

I discovered this by running exploratory queries against the actual data instead of trusting the docs. Small annoyance, but it would have been a show-stopper if I’d built the entire pipeline from documentation and only tested at the end.

Domain matching required suffix-based join

Naive exact-match join between HTTP Archive domains and the Disconnect list: 92 matches. The Disconnect list has doubleclick.net; HTTP Archive has stats.g.doubleclick.net, td.doubleclick.net, securepubads.g.doubleclick.net. Suffix matching (progressively stripping subdomains until a match is found) brought coverage to 3,686 domains (80%).

The remaining 20% are domains in HTTP Archive’s third-party tracker taxonomy that don’t appear on the Disconnect list. They still get scored (the model uses behavioral features, not Disconnect category), but they don’t have a privacy category label.


Design Decisions

Four decisions shaped the project more than any modeling choice.

Train on real data only, not fake zeros. The single biggest improvement in model quality came from removing 2,407 domains with fabricated 0.0 CPU labels and training on the 1,857 with real Lighthouse data. This was a data decision, not a modeling decision. RMSE halved. The “honest” Spearman rho (evaluated on real data only) went from 0.734 to 0.767.

Score at domain level, not request level. Firefox’s storage is domain-granular. TrackingDBService records blocked trackers by domain. The privacy metrics card displays per-domain counts. Domain-level scoring slots directly into the existing data flow. And at block time, Firefox knows the domain but hasn’t fetched the resource, so per-request features aren’t available because the request was blocked.

Offline pipeline, not runtime model. The model runs on my machine. Firefox loads a static JSON file. No ONNX, no inference latency, no model versioning in production. The Disconnect list is finite and known. The features are pre-computable from public crawl data. There’s nothing to learn at runtime. The ML is the pipeline, not the product.

Two targets, not three. Data killed the third target (render_delay). Instead of forcing a sparse, unlearnable target into the model, I dropped it. Two targets capture the meaningful variation in tracker performance cost. The decision was driven by the data, not by a desire for simplicity.


Discussion

If I had to explain this project in a conversation, here’s the arc.

Design progression

The target variable evolved through four stages, each informed by data:

  1. Classification (heavy/medium/light). Too lossy. Where do you draw the boundaries?
  2. Single regression score. Better, but collapses qualitatively different costs. A consent banner and an analytics SDK get similar scores for different reasons.
  3. Three targets (CPU, render-delay, network). The right decomposition in theory.
  4. Two targets (CPU, network). Data showed render-blocking is a first-party problem. Dropped it.

Each step was a decision, not a retreat. The final two-target design reflects what the data actually supports, not what I hoped it would support.

Generalization

The core technical question is: can request features predict Lighthouse CPU measurements? The answer is rho=0.751. Decent but not perfect. The model gets the ranking right 75% of the time. It knows GTM is expensive and tracking pixels are cheap. It struggles with the loyaltylion pattern: domains that look tiny in request data but burn CPU through invisible mechanisms.

The gap between 0.751 and 1.0 is the fundamental limit of predicting execution cost from transfer metadata. Redirect chains, cross-domain script attribution, and dynamic loading are invisible to the features I have. To do better, you’d need Lighthouse data, which is exactly what the model is approximating for the 2,407 domains that don’t have it.

Engineering judgment

“I chose not to ship a model in the browser.” This is a line I’d use in an interview because it demonstrates engineering judgment. The instinct is to ship the model (ONNX, runtime inference, the works). But when the input space is finite (the Disconnect list), the features are pre-computable (HTTP Archive), and there’s nothing to learn at runtime, a lookup table is the right answer. The ML is infrastructure, not product surface area.

“I dropped the third target when data showed it was unlearnable.” Demonstrates willingness to let data kill a design assumption. The three-target architecture was elegant. The data said render-blocking is a first-party problem. I killed it.

“Self-training didn’t help, and that’s a finding.” Demonstrates understanding of when techniques should work, why they didn’t, and what the null result means. The labeled set was already representative. The model’s ceiling is a feature limitation, not a data limitation.

Uncertainty quantification pipeline

Quantile regression, self-training, and conformal prediction aren’t three independent experiments. They form a chain:

  1. Quantile gives per-domain intervals that capture model uncertainty.
  2. Self-training uses those intervals to select confident pseudo-labels from the unlabeled pool. Null result: the labeled set was sufficient.
  3. Conformal validates the quantile intervals. Reveals calibration is off in different directions for different targets.

The narrative is: I built a pipeline of increasingly sophisticated uncertainty quantification, where each technique informed the next. The final lookup table has per-domain intervals and confidence flags because of this chain.


The hardest part of an ML project is the data, not the model. More than half the total project time went into BigQuery exploration, ground truth construction, and label debugging. The fake-0 problem, the Lighthouse audit name changes, the Disconnect suffix matching, each was a full investigation that shaped the final pipeline. XGBoost training was an afternoon. The data work was weeks.

Design decisions compound. Choosing per-domain scoring (instead of per-request) meant the lookup table worked as a delivery mechanism. Choosing independent models per target meant I could drop render_delay without restructuring. Choosing regression (instead of classification) meant the downstream UI could threshold at any level. Each early decision created or foreclosed options downstream. The ones I got right weren’t the clever ones; they were the ones that preserved flexibility.

Negative results require the same rigor as positive ones. Self-training’s +0.000 improvement needed the same evaluation pipeline as the main model. Render_delay’s sparsity needed the same BigQuery exploration as the other targets. The temptation is to move quickly past things that didn’t work. But the null results (self-training doesn’t help, render-blocking isn’t a tracker problem, Disconnect categories don’t predict performance) are some of the most interesting findings.

Know when to stop. The model achieves rho=0.751 for CPU and 0.999 for network. CPU could theoretically improve with better features: Lighthouse data at inference time, or features derived from script AST analysis. But the marginal effort for the next 0.05 rho would be enormous, and 0.751 is probably good enough for a lookup table that distinguishes “expensive” from “cheap.” The 125KB JSON file covers all 4,592 Disconnect domains with calibrated uncertainty. Whether the uncertainty calibration is tight enough for user-facing messaging is the main thing I’d want to validate next.