Hand-Crafted Features and Tree Models

Methods

We train an XGBoost gradient boosted tree regressor on 523,624 labeled HTTP Archive requests (June 2024). The target is transfer_bytes, which is zero-inflated (39.5% zeros) and right-skewed (median 43 bytes, mean 13,607 bytes, max 8.6MB). The feature set comprises 56 dimensions: 6 hand-crafted regex features extracted from URL structure and 50 TF-IDF SVD embedding dimensions over URL path tokens.

Hyperparameters are tuned via Optuna (40 trials, 5-fold CV, minimizing MAE on raw byte values). Final configuration: 500 trees, max depth 8, learning rate 0.05, row subsample 0.8, column subsample 0.7, min child weight 10, early stopping patience 20. The variance power p of the Tweedie loss is tuned jointly with tree parameters.

Tweedie loss requires strictly positive targets; we offset transfer_bytes by +1 during training and subtract 1 from predictions.

Baseline Comparison

We evaluate five approaches on the held-out test set (Table 3). MAE is in bytes; Spearman rho measures ranking quality.

MethodMAESpearman rho95% CI (MAE)
Global median13,661โ€”โ€”
Domain LUT7,878โ€”โ€”
Domain+type LUT6,5970.926โ€”
Path LUT (with fallback)3,7970.934[3,623, 3,984]
XGBoost Tweedie3,4660.945[3,314, 3,627]
Complete model leaderboard
Complete model leaderboard
All models comparison: MAE and Spearman rank correlation
All models comparison: MAE and Spearman rank correlation

The XGBoost Tweedie model achieves the lowest MAE (3,466 bytes) and the highest ranking quality (Spearman 0.945). It improves over the strongest LUT baseline (path LUT, MAE 3,797) by 8.7%, and over the domain+type LUT by 47.5%. The model simultaneously improves both absolute accuracy and ranking, which is unusual since these objectives often trade off.

Loss Function Ablation

The choice of loss function dominates all other design decisions (Table 5).

Loss functionMAESpearman rho
Squared error4,5270.738
Tweedie p=1.23,4860.937
Tweedie p=1.53,4660.945
Tweedie p=1.83,5970.949
Huber (delta=10,000)divergedโ€”

Tweedie at p=1.5 outperforms squared error by 23% in MAE and lifts Spearman from 0.738 to 0.945. Huber loss diverged entirely. The Tweedie power parameter is robust: pโˆˆ[1.2, 1.8] all outperform squared error by at least 17%. The 20-byte difference between p=1.2 (MAE 3,486) and p=1.5 (MAE 3,466) is within bootstrap variance; any pโˆˆ[1.2, 1.5] is effectively equivalent, and p=1.5 is used as a representative value.

The mechanism is straightforward. With 39.5% zeros in the target distribution, squared error allocates capacity uniformly across all training examples, spending optimization budget on near-zero beacons that contribute nothing to the user-facing aggregate. Tweedie loss weights the gradient by the predicted magnitude (y^1โˆ’p\hat{y}^{1-p}), down-weighting near-zero predictions and concentrating capacity on high-cost scripts. Loss function selection matters more than architecture.

Feature Ablation

We ablate the two feature families independently (Table 6).

Feature setDimensionsMAE
Regex only64,251
TF-IDF SVD only503,548
Both563,466

TF-IDF features alone improve 16.5% over regex-only features. Adding regex features on top of TF-IDF yields a marginal 2.3% further gain. TF-IDF subsumes the hand-crafted patterns: tokens like sdk, bundle, collect, and pixel are discovered automatically. The regex features contribute only residual signal not captured by the learned vocabulary.

Path Coverage Analysis

We stratify the test set by whether the exact URL path was observed during training (Table 4).

Subset% of testnPath LUT MAEModel MAE
Path seen91.6%479,7751,4481,346
Path unseen8.4%43,84929,49126,655

The model outperforms the path LUT on both subsets. On seen paths, the model beats pure memorization (1,346 vs 1,448), indicating it learns structure beyond lookup. On unseen paths, the LUT falls back to coarser aggregates and MAE degrades to 29,491; the model degrades to 26,655, maintaining a 9.6% advantage through generalization from URL features and domain-level statistics.

Smoothed LUT comparison. Laplace-smoothed (add-k) path LUTs do not close the gap. The best variant (k=0.5) achieves MAE 4,011 โ€” 4.9% worse than the unsmoothed path LUT (3,797) and 15.7% worse than the model (3,466). Smoothing degrades overall accuracy because it shrinks predictions for the 91.6% of matched paths toward the global mean, increasing error on the majority to marginally reduce error on the minority. The modelโ€™s advantage on seen paths (MAE 1,346 vs. 1,686 for k=0.5) confirms it learns genuine cross-path representations rather than merely regularizing noisy low-count memorization.

Resource Type Analysis

Per-type MAE reduction (Table 7), comparing the domain+type LUT to the XGBoost Tweedie model:

Resource typeLUT MAEModel MAEImprovement
Script12,9963,135+75.9%
CSS6,2222,084+66.5%
HTML2,2461,294+42.4%
Image8,3307,428+10.8%
Other (beacon)1616+4.6%
Text292303-3.8%
Error by resource type on log scale, showing improvement percentages
Error by resource type on log scale, showing improvement percentages

The modelโ€™s value concentrates on high-cost, high-variance types. Scripts see a 75.9% MAE reduction; CSS sees 66.5%. For text resources, the model is slightly worse than the LUT (-3.8%) because the LUTโ€™s constant near-zero prediction is already close to optimal. This is the correct failure mode: the types where the model underperforms contribute negligibly to the user-facing aggregate.

Feature Importance

Feature importance for the Tweedie model, showing the top 15 features by gain
Feature importance for the Tweedie model, showing the top 15 features by gain

domain_type_median dominates with 8.65% gain, followed by rt_other (5.87%), path_has_sync (4.88%), and ext_html (4.08%). Eight of the top 15 features are TF-IDF embedding dimensions, confirming that the learned token representations carry substantial predictive signal beyond domain identity and hand-crafted patterns.

SHAP beeswarm plot showing per-feature impact on predictions across 5,000 test requests
SHAP beeswarm plot showing per-feature impact on predictions across 5,000 test requests
Mean absolute SHAP value per feature
Mean absolute SHAP value per feature

SHAP analysis confirms the gain-based ordering. domain_type_median accounts for the majority of model output, with high values pushing predictions strongly upward. rt_other pushes predictions downward when present, correctly identifying near-zero beacons. The TF-IDF dimensions distribute signal across URL structural patterns that no single hand-crafted feature captures.

Aggregation Accuracy

Users see a weekly aggregate (โ€œFirefox saved you 2.3MBโ€), not individual predictions. Per-request errors cancel under aggregation if they are approximately symmetric. We simulate this by uniformly sampling N requests from the test set, summing predictions, comparing to the true sum, and repeating 1,000 times (Table 8).

N requestsModel: median % errorModel: within 10%LUT: median % errorLUT: within 10%
2006.0%67.2%21.9%13.6%
5005.1%75.5%23.2%4.2%
Weekly aggregate accuracy: model vs LUT at different browsing volumes
Weekly aggregate accuracy: model vs LUT at different browsing volumes

The gap widens with more requests. At N=200, the modelโ€™s median error is 6.0% versus the LUTโ€™s 21.9%. At N=500, the model improves to 5.1% while the LUT worsens to 23.2%. The LUTโ€™s systematic bias (always predicting the conditional median) compounds under aggregation. The modelโ€™s roughly symmetric errors cancel.

Correlated Browsing

Uniform sampling assumes independent requests. Real browsing exhibits domain correlation: a user visiting a news site generates multiple requests to the same tracker domains. We simulate this by sampling 15 domains, then drawing N requests exclusively from those domains (Table 9).

N requestsModel (correlated)Model (uniform)LUT (correlated)LUT (uniform)
20012.9%6.3%36.4%22.2%

Both methods degrade under correlation, but the modelโ€™s advantage increases. Under uniform sampling, the model is 3.5x better than the LUT (6.3% vs 22.2%). Under correlated sampling, the model is 2.8x better (12.9% vs 36.4%). The modelโ€™s relative robustness to domain concentration reflects its use of within-domain URL features, which provide discriminative signal even when the domain distribution is narrow.

Temporal Generalization

We train on June 2024 data and evaluate on September 2024 to measure drift (Table 11).

MethodJune MAESeptember MAEDegradation
XGBoost Tweedie3,4664,517+30.3%
Path LUT3,7974,966+30.8%
Domain+type LUT6,5976,723+1.9%

The model degrades 30.3% over three months, comparable to the path LUTโ€™s 30.8%. The domain+type LUT is stable (+1.9%) but permanently worse in absolute terms. The model retains a 32.8% advantage over the domain+type LUT in September. Ranking quality is nearly unchanged (Spearman 0.945 in June, 0.942 in September).

Drift analysis: 11.8% of September domains are new; 85.2% of rows share paths with the training set. URL path churn, not domain churn, drives the degradation. Per-type analysis confirms this: the modelโ€™s script improvement drops from +75.9% in June to +50.6% in September, consistent with script bundles receiving new versioned paths over time.

Multi-Target Results

We train separate models for four additional HTTP timing metrics (Table 10).

TargetPer-request improvementAggregation error
transfer_bytes+47.5%6.0%
download_ms+23.7%16.6%
load_ms-3.1%5.6%
ttfb_ms-12.7%4.0%

Content-dependent metrics (transfer_bytes, download_ms) show strong per-request improvement because URL features predict content size, which determines transfer and download duration. Network-dependent metrics (load_ms, ttfb_ms) show no per-request improvement or slight degradation because latency depends on server location, CDN configuration, and network conditions, none of which are observable from the URL.

However, load_ms and ttfb_ms both achieve strong aggregation accuracy (5.6% and 4.0% respectively), outperforming the LUT in aggregate despite losing per-request. The modelโ€™s symmetric errors cancel under summation even when its per-request discrimination is weak.

Calibration Analysis

We evaluate per-bin prediction accuracy by partitioning test rows into transfer size ranges and measuring the fraction of predictions within 25% of the true value.

Actual size rangenPredicted meanWithin 25%
0โ€“100 B26,46220 B51.3%
100โ€“500 B3,476212 B72.2%
500 Bโ€“1 KB1,117697 B42.5%
1โ€“5 KB3,2502,459 B49.6%
5โ€“10 KB1,7097,169 B66.1%
10โ€“25 KB4,04417,146 B59.1%
25โ€“50 KB1,36334,425 B37.3%
50โ€“100 KB4,77782,147 B86.6%
100โ€“250 KB576145,427 B74.7%
250 KB+120671,042 B25.8%

The model is well-calibrated for the two dominant populations: near-zero beacons (0โ€“100B, 51.3% within 25%, actual median 0) and large JavaScript bundles (50โ€“100KB, 86.6%), which is the secondary mode of the transfer size distribution. The 500Bโ€“5KB range shows systematic underprediction: in the 500Bโ€“1KB bin, the predicted mean is 697 bytes against an actual mean of 2,144 (factor of 3). These mid-range responses โ€” small JSON API responses, CSS files, tracking pixel redirects โ€” sit in the transition between the zero-mass spike and the heavy tail, where the Tweedie loss gradient signal is weakest. Aggregation across N requests partially cancels this bias, producing 6.0% median weekly error at N=200 despite the per-request underprediction in this range.

Limitations

Transfer size only. The model predicts transfer size but not CPU execution time. CPU cost depends on JavaScript engine internals and device hardware, which differ between Chrome (training data) and Firefox (deployment).

Feature ceiling. TF-IDF over URL tokens captures lexical patterns but not semantic content. Two URLs with different token distributions can serve identical payloads; two URLs with similar tokens can serve different payloads depending on server-side A/B tests or user state. The limitations of hand-crafted and bag-of-words features motivate the learned representation approach in Article 3.

Calibration bias in the mid-range. The model systematically underpredicts transfer size for responses in the 500Bโ€“5KB range, where predicted means are 2โ€“3x below actuals. This affects JSON API responses and small CSS files that fall between the zero-mass spike and the heavy right tail โ€” the region where Tweedie gradient signal is weakest. Per-request accuracy in this range should be treated as order-of-magnitude rather than precise; aggregation across requests substantially mitigates this bias.

Temporal decay. The 30.3% degradation over three months implies a retraining cadence of at most quarterly to maintain accuracy. URL path churn in script bundles is the primary driver and is not addressable through feature engineering alone.

Predicted vs actual transfer size on log scale, with residual distribution
Predicted vs actual transfer size on log scale, with residual distribution
Calibration analysis: predicted vs actual mean per bin, and prediction accuracy by size range
Calibration analysis: predicted vs actual mean per bin, and prediction accuracy by size range