Firefox ONNX Inference
Deployment Architecture
The cost estimation pipeline is decoupled from the blocking decision. Enhanced Tracking Protection blocks requests based on Disconnect list membership, a deterministic check that completes before the network request is issued. Cost prediction runs asynchronously after the block, accumulates estimates in the background, and updates the weekly tally without affecting browsing performance. The user never waits on inference.
The deployment artifact is an XGBoost ensemble (500 trees) exported to ONNX format. The ONNX file is approximately 500KB. Single-prediction inference completes in microseconds, well within budget for a background task that does not gate any user-visible operation.
Two static artifacts ship alongside the ONNX model: the TF-IDF vocabulary and SVD projection matrix (~2MB), which transform URL path tokens into the 50-dimensional dense features the model expects, and the target encoding lookup table (~2,000 entries), which maps each tracker domain to its median transfer statistics from training data. Both are precomputed from HTTP Archive crawl data and loaded at browser startup.
Model updates are delivered through Remote Settings, the same infrastructure Firefox uses to distribute the Disconnect list itself. The planned cadence is monthly retraining on fresh HTTP Archive crawls, which addresses the temporal drift documented in the previous articles (30% MAE degradation over three months without retraining).
The deployment is scheduled for Firefox release in May 2026, reaching approximately 250 million daily active users. The about:protections dashboard and new tab widget will surface aggregate predictions as a privacy metrics card: “Firefox saved you approximately 2.3MB of bandwidth this week.”
Feature Availability at Block Time
The model requires only features that are available before the server responds. Firefox’s network stack exposes these through well-defined interfaces at the point where Enhanced Tracking Protection intercepts the request:
| Firefox interface | Data available | Model feature mapping |
|---|---|---|
nsIChannel::GetURI() | Full URL including path and query parameters | All URL features: TF-IDF/SVD path tokens, file_extension, path_depth, url_length, num_query_params |
nsILoadInfo::GetExternalContentPolicyType() | Resource classification: TYPE_SCRIPT, TYPE_IMAGE, TYPE_XMLHTTPREQUEST, etc. | resource_type |
nsILoadInfo::GetLoadingPrincipal() | The principal that initiated the request | initiator_type |
nsIHttpChannel::GetRequestMethod() | HTTP method (GET, POST) | http_method |
nsIHttpChannel::GetProtocolVersion() | Protocol version (HTTP/2, HTTP/3) | http_version |
nsIClassifiedChannel | Tracker category (advertising, analytics, social, fingerprinting) | Classification metadata |
The domain-level target encoding features (domain_median_bytes, domain_type_median) are not derived from the request itself. At block time, Firefox looks up the request’s domain in the precomputed lookup table and passes the result as an additional feature vector. For domains absent from the table, the model receives global median values as fallback.
Why Domain-Level Estimation Does Not Require Machine Learning
A natural first approach to cost estimation is to assign each tracker domain a static score derived from its aggregate HTTP Archive statistics. This formulation is appealing because it avoids the complexity of per-request inference entirely: a lookup table indexed by domain (or domain-plus-resource-type) could be shipped as a few kilobytes of JSON.
We built this pipeline. For 4,592 tracker domains, we computed 19 aggregate features (median transfer size, mean script bytes, request count distributions by resource type, etc.) and trained XGBoost to predict domain-level cost. The results were striking in a way that argued against further investment: Spearman correlation reached 0.751 for CPU cost and 0.999 for network cost. The network prediction is essentially a lookup table with negligible benefit from learned nonlinearities.
The reason is a data availability asymmetry. Of 4,592 domains in the dataset, only 2,185 (48%) have Lighthouse CPU timing data. The remaining 2,407 domains fall below approximately 50ms of main-thread impact, placing them below Lighthouse’s reporting threshold. Kolmogorov-Smirnov tests confirm that these two populations are disjoint on every feature (p < 0.001 for all 19 features). The median max_script_bytes for domains with timing data is 20,338 bytes; for domains without, it is 585 bytes. These are not overlapping distributions with a missing-at-random mechanism. They are structurally different populations.
We attempted semi-supervised self-training to recover: train on the labeled subset, generate pseudo-labels for the unlabeled subset, and iterate. Self-training produced zero improvement. The feature-space disjointness prevented the initial model from generating confident pseudo-labels on the unlabeled population, so no information transferred across the boundary.
This is a negative finding, but it carries a constructive implication. It establishes a diagnostic criterion for when ML adds value over a lookup table. The domain-level formulation fails the test: cost labels are available (or trivially imputable) for both populations, and a lookup table captures the mapping with near-perfect fidelity. The per-request formulation passes the test: blocked requests have no response-side labels by definition, and URL-level variance within a single domain is too large for any fixed lookup to capture. The gap between googletagmanager.com/gtag/js (93KB) and googletagmanager.com/collect (0 bytes) is not a domain-level phenomenon.
Why Tweedie Outperforms Squared Error
The transfer size distribution has two defining properties: 39.5% of requests are exact zeros (tracking pixels, beacons, cookie-sync endpoints that return empty responses), and the remaining 60.5% span five orders of magnitude from tens of bytes to megabytes. Standard squared-error loss treats a 100-byte prediction error on a beacon identically to a 100-byte error on a 170KB JavaScript SDK. Both contribute the same quantity to the loss gradient.
Tweedie loss with variance power p > 1 corrects this misalignment. The variance function V(mu) = mu^p scales the penalty by y^(2-p), down-weighting predictions near zero and up-weighting predictions at high magnitudes. The model allocates capacity where it matters for the downstream aggregate metric.
This explains both the 23% per-request MAE gap (3,466 bytes vs 4,527 bytes for squared error on identical features and architecture) and the even larger advantage in weekly aggregation accuracy. Beacons that squared error over-optimizes for contribute negligibly to weekly bandwidth totals. Scripts and SDK bundles that Tweedie prioritizes dominate the aggregate. A model that is slightly less accurate on zero-byte beacons but substantially more accurate on 50-100KB scripts produces better weekly estimates, which is the quantity the user actually sees.
The predicted-versus-actual scatter (Figure 11 in the paper) confirms this: the model tracks the diagonal across five orders of magnitude, with the tightest clustering in the 10KB-100KB range where the highest-impact tracker resources concentrate.
Limitations
Temporal stability. The model degrades approximately 30% in MAE over three months as tracker SDK versions change, new tracking endpoints appear, and existing ones shift their payload profiles. Quarterly retraining on fresh HTTP Archive crawls is the minimum recommended cadence. The temporal evaluation in Section 7.3 of the paper characterizes this degradation in detail.
Timing metric validity. Transfer size is server-determined and browser-independent: the same request returns the same number of bytes regardless of the client’s hardware or network. Timing metrics (download duration, CPU execution time) are not. HTTP Archive collects timing data using a fixed test environment (Moto G4, simulated 3G), meaning timing predictions are standardized-condition estimates, not personalized predictions for the user’s device. We recommend deploying the transfer size model only. Timing predictions could be calibrated to device classes in future work, but the current model does not support this.
Training-deployment population shift. The model is trained on HTTP Archive’s third-party entity classification, but Firefox blocks against the Disconnect list. The overlap between these two taxonomies is substantial but not complete. Domains that appear on the Disconnect list but not in the HTTP Archive training set receive global median fallback values for their target-encoded features, degrading prediction quality for those requests.
Server-side variation. A/B tests, geo-targeting, and SDK versioning cause the same URL to return different payloads across requests. This is irreducible noise from the model’s perspective. However, aggregation across many requests per week mitigates the impact: server-side variation that is uncorrelated across requests cancels in the sum, and the weekly aggregate estimate remains accurate even when individual predictions are noisy.