Hand-Crafted Features and Tree Models

Methods

We train an XGBoost gradient boosted tree regressor on 523,624 labeled HTTP Archive requests (June 2024). The target is transfer_bytes, which is zero-inflated (39.5% zeros) and right-skewed (median 43 bytes, mean 13,607 bytes, max 8.6MB). The feature set comprises 56 dimensions: 6 hand-crafted regex features extracted from URL structure and 50 TF-IDF SVD embedding dimensions over URL path tokens.

Hyperparameters are tuned via Optuna (40 trials, 5-fold CV, minimizing MAE on raw byte values). Final configuration: 500 trees, max depth 8, learning rate 0.05, row subsample 0.8, column subsample 0.7, min child weight 10, early stopping patience 20. The variance power p of the Tweedie loss is tuned jointly with tree parameters.

Tweedie loss requires strictly positive targets; we offset transfer_bytes by +1 during training and subtract 1 from predictions.

Baseline Comparison

We evaluate five approaches on the held-out test set (Table 3). MAE is in bytes; Spearman rho measures ranking quality.

Method	MAE	Spearman rho	95% CI (MAE)
Global median	13,661	—	—
Domain LUT	7,878	—	—
Domain+type LUT	6,597	0.926	—
Path LUT (with fallback)	3,797	0.934	[3,623, 3,984]
XGBoost Tweedie	3,466	0.945	[3,314, 3,627]

All models comparison: MAE and Spearman rank correlation

The XGBoost Tweedie model achieves the lowest MAE (3,466 bytes) and the highest ranking quality (Spearman 0.945). It improves over the strongest LUT baseline (path LUT, MAE 3,797) by 8.7%, and over the domain+type LUT by 47.5%. The model simultaneously improves both absolute accuracy and ranking, which is unusual since these objectives often trade off.

Loss Function Ablation

The choice of loss function dominates all other design decisions (Table 5).

Loss function	MAE	Spearman rho
Squared error	4,527	0.738
Tweedie p=1.2	3,486	0.937
Tweedie p=1.5	3,466	0.945
Tweedie p=1.8	3,597	0.949
Huber (delta=10,000)	diverged	—

Tweedie at p=1.5 outperforms squared error by 23% in MAE and lifts Spearman from 0.738 to 0.945. Huber loss diverged entirely. The Tweedie power parameter is robust: p∈[1.2, 1.8] all outperform squared error by at least 17%. The 20-byte difference between p=1.2 (MAE 3,486) and p=1.5 (MAE 3,466) is within bootstrap variance; any p∈[1.2, 1.5] is effectively equivalent, and p=1.5 is used as a representative value.

The mechanism is straightforward. With 39.5% zeros in the target distribution, squared error allocates capacity uniformly across all training examples, spending optimization budget on near-zero beacons that contribute nothing to the user-facing aggregate. Tweedie loss weights the gradient by the predicted magnitude ( $\hat{y}^{1-p}$ ), down-weighting near-zero predictions and concentrating capacity on high-cost scripts. Loss function selection matters more than architecture.

Feature Ablation

We ablate the two feature families independently (Table 6).

Feature set	Dimensions	MAE
Regex only	6	4,251
TF-IDF SVD only	50	3,548
Both	56	3,466

TF-IDF features alone improve 16.5% over regex-only features. Adding regex features on top of TF-IDF yields a marginal 2.3% further gain. TF-IDF subsumes the hand-crafted patterns: tokens like sdk, bundle, collect, and pixel are discovered automatically. The regex features contribute only residual signal not captured by the learned vocabulary.

Path Coverage Analysis

We stratify the test set by whether the exact URL path was observed during training (Table 4).

Subset	% of test	n	Path LUT MAE	Model MAE
Path seen	91.6%	479,775	1,448	1,346
Path unseen	8.4%	43,849	29,491	26,655

The model outperforms the path LUT on both subsets. On seen paths, the model beats pure memorization (1,346 vs 1,448), indicating it learns structure beyond lookup. On unseen paths, the LUT falls back to coarser aggregates and MAE degrades to 29,491; the model degrades to 26,655, maintaining a 9.6% advantage through generalization from URL features and domain-level statistics.

Smoothed LUT comparison. Laplace-smoothed (add-k) path LUTs do not close the gap. The best variant (k=0.5) achieves MAE 4,011 — 4.9% worse than the unsmoothed path LUT (3,797) and 15.7% worse than the model (3,466). Smoothing degrades overall accuracy because it shrinks predictions for the 91.6% of matched paths toward the global mean, increasing error on the majority to marginally reduce error on the minority. The model’s advantage on seen paths (MAE 1,346 vs. 1,686 for k=0.5) confirms it learns genuine cross-path representations rather than merely regularizing noisy low-count memorization.

Resource Type Analysis

Per-type MAE reduction (Table 7), comparing the domain+type LUT to the XGBoost Tweedie model:

Resource type	LUT MAE	Model MAE	Improvement
Script	12,996	3,135	+75.9%
CSS	6,222	2,084	+66.5%
HTML	2,246	1,294	+42.4%
Image	8,330	7,428	+10.8%
Other (beacon)	16	16	+4.6%
Text	292	303	-3.8%

Error by resource type on log scale, showing improvement percentages

The model’s value concentrates on high-cost, high-variance types. Scripts see a 75.9% MAE reduction; CSS sees 66.5%. For text resources, the model is slightly worse than the LUT (-3.8%) because the LUT’s constant near-zero prediction is already close to optimal. This is the correct failure mode: the types where the model underperforms contribute negligibly to the user-facing aggregate.

Feature Importance

domain_type_median dominates with 8.65% gain, followed by rt_other (5.87%), path_has_sync (4.88%), and ext_html (4.08%). Eight of the top 15 features are TF-IDF embedding dimensions, confirming that the learned token representations carry substantial predictive signal beyond domain identity and hand-crafted patterns.

SHAP beeswarm plot showing per-feature impact on predictions across 5,000 test requests

SHAP analysis confirms the gain-based ordering. domain_type_median accounts for the majority of model output, with high values pushing predictions strongly upward. rt_other pushes predictions downward when present, correctly identifying near-zero beacons. The TF-IDF dimensions distribute signal across URL structural patterns that no single hand-crafted feature captures.

Aggregation Accuracy

Users see a weekly aggregate (“Firefox saved you 2.3MB”), not individual predictions. Per-request errors cancel under aggregation if they are approximately symmetric. We simulate this by uniformly sampling N requests from the test set, summing predictions, comparing to the true sum, and repeating 1,000 times (Table 8).

N requests	Model: median % error	Model: within 10%	LUT: median % error	LUT: within 10%
200	6.0%	67.2%	21.9%	13.6%
500	5.1%	75.5%	23.2%	4.2%

Weekly aggregate accuracy: model vs LUT at different browsing volumes

The gap widens with more requests. At N=200, the model’s median error is 6.0% versus the LUT’s 21.9%. At N=500, the model improves to 5.1% while the LUT worsens to 23.2%. The LUT’s systematic bias (always predicting the conditional median) compounds under aggregation. The model’s roughly symmetric errors cancel.

Correlated Browsing

Uniform sampling assumes independent requests. Real browsing exhibits domain correlation: a user visiting a news site generates multiple requests to the same tracker domains. We simulate this by sampling 15 domains, then drawing N requests exclusively from those domains (Table 9).

N requests	Model (correlated)	Model (uniform)	LUT (correlated)	LUT (uniform)
200	12.9%	6.3%	36.4%	22.2%

Both methods degrade under correlation, but the model’s advantage increases. Under uniform sampling, the model is 3.5x better than the LUT (6.3% vs 22.2%). Under correlated sampling, the model is 2.8x better (12.9% vs 36.4%). The model’s relative robustness to domain concentration reflects its use of within-domain URL features, which provide discriminative signal even when the domain distribution is narrow.

Temporal Generalization

We train on June 2024 data and evaluate on September 2024 to measure drift (Table 11).

Method	June MAE	September MAE	Degradation
XGBoost Tweedie	3,466	4,517	+30.3%
Path LUT	3,797	4,966	+30.8%
Domain+type LUT	6,597	6,723	+1.9%

The model degrades 30.3% over three months, comparable to the path LUT’s 30.8%. The domain+type LUT is stable (+1.9%) but permanently worse in absolute terms. The model retains a 32.8% advantage over the domain+type LUT in September. Ranking quality is nearly unchanged (Spearman 0.945 in June, 0.942 in September).

Drift analysis: 11.8% of September domains are new; 85.2% of rows share paths with the training set. URL path churn, not domain churn, drives the degradation. Per-type analysis confirms this: the model’s script improvement drops from +75.9% in June to +50.6% in September, consistent with script bundles receiving new versioned paths over time.

Multi-Target Results

We train separate models for four additional HTTP timing metrics (Table 10).

Target	Per-request improvement	Aggregation error
transfer_bytes	+47.5%	6.0%
download_ms	+23.7%	16.6%
load_ms	-3.1%	5.6%
ttfb_ms	-12.7%	4.0%

Content-dependent metrics (transfer_bytes, download_ms) show strong per-request improvement because URL features predict content size, which determines transfer and download duration. Network-dependent metrics (load_ms, ttfb_ms) show no per-request improvement or slight degradation because latency depends on server location, CDN configuration, and network conditions, none of which are observable from the URL.

However, load_ms and ttfb_ms both achieve strong aggregation accuracy (5.6% and 4.0% respectively), outperforming the LUT in aggregate despite losing per-request. The model’s symmetric errors cancel under summation even when its per-request discrimination is weak.

Calibration Analysis

We evaluate per-bin prediction accuracy by partitioning test rows into transfer size ranges and measuring the fraction of predictions within 25% of the true value.

Actual size range	n	Predicted mean	Within 25%
0–100 B	26,462	20 B	51.3%
100–500 B	3,476	212 B	72.2%
500 B–1 KB	1,117	697 B	42.5%
1–5 KB	3,250	2,459 B	49.6%
5–10 KB	1,709	7,169 B	66.1%
10–25 KB	4,044	17,146 B	59.1%
25–50 KB	1,363	34,425 B	37.3%
50–100 KB	4,777	82,147 B	86.6%
100–250 KB	576	145,427 B	74.7%
250 KB+	120	671,042 B	25.8%

The model is well-calibrated for the two dominant populations: near-zero beacons (0–100B, 51.3% within 25%, actual median 0) and large JavaScript bundles (50–100KB, 86.6%), which is the secondary mode of the transfer size distribution. The 500B–5KB range shows systematic underprediction: in the 500B–1KB bin, the predicted mean is 697 bytes against an actual mean of 2,144 (factor of 3). These mid-range responses — small JSON API responses, CSS files, tracking pixel redirects — sit in the transition between the zero-mass spike and the heavy tail, where the Tweedie loss gradient signal is weakest. Aggregation across N requests partially cancels this bias, producing 6.0% median weekly error at N=200 despite the per-request underprediction in this range.

Limitations

Transfer size only. The model predicts transfer size but not CPU execution time. CPU cost depends on JavaScript engine internals and device hardware, which differ between Chrome (training data) and Firefox (deployment).

Feature ceiling. TF-IDF over URL tokens captures lexical patterns but not semantic content. Two URLs with different token distributions can serve identical payloads; two URLs with similar tokens can serve different payloads depending on server-side A/B tests or user state. The limitations of hand-crafted and bag-of-words features motivate the learned representation approach in Article 3.

Calibration bias in the mid-range. The model systematically underpredicts transfer size for responses in the 500B–5KB range, where predicted means are 2–3x below actuals. This affects JSON API responses and small CSS files that fall between the zero-mass spike and the heavy right tail — the region where Tweedie gradient signal is weakest. Per-request accuracy in this range should be treated as order-of-magnitude rather than precise; aggregation across requests substantially mitigates this bias.

Temporal decay. The 30.3% degradation over three months implies a retraining cadence of at most quarterly to maintain accuracy. URL path churn in script bundles is the primary driver and is not addressable through feature engineering alone.