Data Exploration and Visualization

The Prediction Gap

Firefox’s Enhanced Tracking Protection blocks third-party tracker requests before they complete. At block time, the browser observes a feature vector $\mathbf{x} \in \mathcal{X}_{\text{pre}}$ : the request URL, resource type, initiator, priority, and other metadata available from the network stack. The response quantities --- transfer size, download duration, server timing --- are structurally unobservable. They never arrive. The product goal is to predict transfer size $y \in \mathbb{R}_{\geq 0}$ from $\mathbf{x}$ alone, so the privacy dashboard can report concrete bandwidth savings rather than a flat count of blocked requests.

This is not a missing-data problem. The response is not censored, dropout, or partially observed --- it is prevented by design. It is not counterfactual estimation: we are not asking what would have happened under a different treatment assignment. And it is not standard domain adaptation: the training and deployment distributions differ in $P(X)$ but share $P(Y \mid X)$ , since transfer size is server-determined. The same URL returns the same payload whether Chrome or Firefox issues the request. Training on HTTP Archive data (collected by Chrome) and deploying in Firefox constitutes covariate shift, not concept drift.

The formal task:

$\hat{y} = f(\mathbf{x}), \quad \mathbf{x} \in \mathcal{X}_{\text{pre}}, \quad y \in \mathbb{R}_{\geq 0}$

where $f$ is learned from completed requests in the HTTP Archive and applied to requests that Firefox blocks before completion.

Why Per-Domain Scoring Fails

CPU cost vs network cost scatter plot showing target independence

The naive approach assigns each tracker domain a static cost score based on aggregated HTTP Archive data. This has a structural limitation: the same domain serves resources of vastly different sizes. googletagmanager.com/gtag/js returns a 93KB script bundle; googletagmanager.com/collect returns a 0-byte beacon. A domain-level median cannot distinguish between these.

The within-domain coefficient of variation for transfer_size quantifies this heterogeneity: median 0.94, 75th percentile 3.0, 90th percentile 29.0. Most tracker domains exhibit order-of-magnitude variation in transfer size across their request populations. By contrast, TTFB and load_time show low within-domain variance (CV 0.38), indicating these quantities are network-dominated and not URL-predictable. Transfer size is the right target for a URL-conditioned model.

Lookup Table Ceiling

Before introducing a model, we establish the accuracy ceiling of progressively more granular lookup tables on the test set.

LUT granularity	Entries	MAE (bytes)	Reduction vs. global
Global median	1	13,661	---
Domain median	~3,700	7,878	-42.3%
Domain + resource type	~5,100	6,597	-51.7%
Domain + URL path	~227,000	3,797	-72.2%

Domain identity alone buys 42% error reduction. Adding resource type yields another 9 percentage points. Adding the exact URL path pushes error down 72%, but the table grows to 227K entries --- and 75% of unique test paths are unseen in training, even within the same crawl month. A path-level table is infeasible in production and immediately stale as tracker SDKs update their URL structures. A model must generalize from URL structure rather than memorize specific paths.

Data

Transfer size distribution and breakdown by resource type

Training data is drawn from the HTTP Archive June 2024 crawl (mobile client). The source table httparchive.crawl.requests contains 348 million tracker requests across 5,186 third-party domains in five Disconnect list categories: advertising, analytics, social, tag-manager, and consent-provider.

We extract a 1% deterministic sample via hash bucketing:

MOD(ABS(FARM_FINGERPRINT(page || url)), 100) = 0

This yields 3,490,824 requests across 3,723 domains. Deterministic sampling ensures reproducibility and avoids selection bias --- any request whose (page, URL) pair hashes to bucket 0 is included, regardless of domain or category.

Target distribution. The transfer size distribution is bimodal: 39.5% of requests are exact zeros (beacons returning empty responses), with a secondary mode around 90KB (JavaScript bundles). The median is 43 bytes, the mean 13,607 bytes, and the maximum 8.6MB. This zero-inflation and extreme right skew motivate the Tweedie loss explored in Article 2.

Train/val/test split. 70/15/15 by random row sampling. Row-level splitting matches the deployment scenario: Firefox will have HTTP Archive statistics for all Disconnect list domains at inference time, so domain identity is a known feature at prediction time. There is no need for domain-level holdout.

Feature Engineering

Eighty features organized into five groups, all observable at Firefox’s block time.

Domain identity (2 features)

Target-encoded statistics from training data: the domain’s median transfer_bytes, and the (domain, resource_type) pair’s median transfer_bytes. These replace the domain name with a meaningful numeric signal --- integer encoding is inappropriate because tree models would treat domain codes as ordinal. Target encoding is recomputed per cross-validation fold to prevent leakage.

URL structure (12 features)

Path depth (number of / segments), total URL length, query parameter count, file extension (grouped: js, gif, png, jpg, html, php, json, css, other, none), path length, and query string length. These capture coarse structural properties of the URL without memorizing specific paths.

URL content via TF-IDF + SVD (50 features)

The URL path is tokenized on delimiters (/, ., ?, &, =, -, _) and camelCase boundaries. We fit a TF-IDF vectorizer with sublinear term frequency, unigram + bigram vocabulary (50K terms, minimum document frequency 5), then reduce to 50 dimensions via truncated SVD, explaining 54.5% of total variance.

The leading SVD components are semantically interpretable. Component 1 separates /collect endpoints (beacons, near-zero transfer) from script-serving paths. Component 2 captures /gtag/js and analytics.js patterns (large JavaScript bundles). This decomposition gives the model a continuous representation of URL semantics without requiring exact path matching.

Request metadata (10 features)

Resource type (one-hot: script, image, other, html, text, css, video, xml, font), initiator type (one-hot: script, parser, other, preflight), Chrome priority (ordinal 0—4), HTTP method, HTTP version, waterfall index, and HTTPS flag.

Regex pattern indicators (6 features)

Binary indicators from regular expressions applied to the URL path, encoding known tracker URL conventions:

Feature	Pattern	Signal
`path_has_js`	`\.js\|sdk\|gtm\|gtag`	JavaScript bundles
`path_has_collect`	`collect\|beacon\|pixel\|track`	Beacons (zero-byte)
`path_has_image`	`\.gif\|\.png\|pixel\|1x1`	Tracking pixels
`path_has_sync`	`sync\|match\|cookie`	Cookie-syncing
`path_has_ad`	`/ad/\|adserver\|pagead`	Ad serving
`path_has_api`	`api`	API endpoints

These act as coarse priors that the tree model can refine. They are redundant with the TF-IDF representation by design: the regex features provide explicit signal for known patterns, while TF-IDF captures the long tail.

Target. Raw transfer_bytes with no log transform. Predictions are clipped to $\geq 0$ .

The next article covers model selection, Tweedie loss, and evaluation, where these features are combined with ten model architectures and assessed on both per-request accuracy and the aggregation accuracy that matters for the product.