Hand-Crafted Features and Tree Models

Methods

We train an XGBoost gradient boosted tree regressor on 523,624 labeled HTTP Archive requests (June 2024). The target is transfer_bytes, which is zero-inflated (39.5% zeros) and right-skewed (median 43 bytes, mean 13,607 bytes, max 8.6MB). The feature set comprises 56 dimensions: 6 hand-crafted regex features extracted from URL structure and 50 TF-IDF SVD embedding dimensions over URL path tokens.

Hyperparameters are tuned via Optuna (40 trials, 5-fold CV, minimizing MAE on raw byte values). Final configuration: 500 trees, max depth 8, learning rate 0.05, row subsample 0.8, column subsample 0.7, min child weight 10, early stopping patience 20. The variance power p of the Tweedie loss is tuned jointly with tree parameters.

Tweedie loss requires strictly positive targets; we offset transfer_bytes by +1 during training and subtract 1 from predictions.

Baseline Comparison

We evaluate five approaches on the held-out test set (Table 3). MAE is in bytes; Spearman rho measures ranking quality.

MethodMAESpearman rho95% CI (MAE)
Global median13,661
Domain LUT7,878
Domain+type LUT6,5970.926
Path LUT (with fallback)3,7970.934[3,623, 3,984]
XGBoost Tweedie3,4660.945[3,314, 3,627]
Complete model leaderboard
Complete model leaderboard
All models comparison: MAE and Spearman rank correlation
All models comparison: MAE and Spearman rank correlation

The XGBoost Tweedie model achieves the lowest MAE (3,466 bytes) and the highest ranking quality (Spearman 0.945). It improves over the strongest LUT baseline (path LUT, MAE 3,797) by 8.7%, and over the domain+type LUT by 47.5%. The model simultaneously improves both absolute accuracy and ranking, which is unusual since these objectives often trade off.

Loss Function Ablation

The choice of loss function dominates all other design decisions (Table 5).

Loss functionMAESpearman rho
Squared error4,5270.738
Tweedie p=1.23,4860.937
Tweedie p=1.53,4660.945
Tweedie p=1.83,5970.949
Huber (delta=10,000)diverged

Tweedie at p=1.5 outperforms squared error by 23% in MAE and lifts Spearman from 0.738 to 0.945. Huber loss diverged entirely.

The mechanism is straightforward. With 39.5% zeros in the target distribution, squared error allocates capacity uniformly across all training examples, spending optimization budget on near-zero beacons that contribute nothing to the user-facing aggregate. Tweedie loss weights the gradient by the predicted magnitude (y^1p\hat{y}^{1-p}), down-weighting near-zero predictions and concentrating capacity on high-cost scripts. Loss function selection matters more than architecture.

Feature Ablation

We ablate the two feature families independently (Table 6).

Feature setDimensionsMAE
Regex only64,251
TF-IDF SVD only503,548
Both563,466

TF-IDF features alone improve 16.5% over regex-only features. Adding regex features on top of TF-IDF yields a marginal 2.3% further gain. TF-IDF subsumes the hand-crafted patterns: tokens like sdk, bundle, collect, and pixel are discovered automatically. The regex features contribute only residual signal not captured by the learned vocabulary.

Path Coverage Analysis

We stratify the test set by whether the exact URL path was observed during training (Table 4).

Subset% of testnPath LUT MAEModel MAE
Path seen91.6%479,7751,4481,346
Path unseen8.4%43,84929,49126,655

The model outperforms the path LUT on both subsets. On seen paths, the model beats pure memorization (1,346 vs 1,448), indicating it learns structure beyond lookup. On unseen paths, the LUT falls back to coarser aggregates and MAE degrades to 29,491; the model degrades to 26,655, maintaining a 9.6% advantage through generalization from URL features and domain-level statistics.

Resource Type Analysis

Per-type MAE reduction (Table 7), comparing the domain+type LUT to the XGBoost Tweedie model:

Resource typeLUT MAEModel MAEImprovement
Script12,9963,135+75.9%
CSS6,2222,084+66.5%
HTML2,2461,294+42.4%
Image8,3307,428+10.8%
Other (beacon)1616+4.6%
Text292303-3.8%
Error by resource type on log scale, showing improvement percentages
Error by resource type on log scale, showing improvement percentages

The model’s value concentrates on high-cost, high-variance types. Scripts see a 75.9% MAE reduction; CSS sees 66.5%. For text resources, the model is slightly worse than the LUT (-3.8%) because the LUT’s constant near-zero prediction is already close to optimal. This is the correct failure mode: the types where the model underperforms contribute negligibly to the user-facing aggregate.

Feature Importance

Feature importance for the Tweedie model, showing the top 15 features by gain
Feature importance for the Tweedie model, showing the top 15 features by gain

domain_type_median dominates with 8.65% gain, followed by rt_other (5.87%), path_has_sync (4.88%), and ext_html (4.08%). Eight of the top 15 features are TF-IDF embedding dimensions, confirming that the learned token representations carry substantial predictive signal beyond domain identity and hand-crafted patterns.

SHAP beeswarm plot showing per-feature impact on predictions across 5,000 test requests
SHAP beeswarm plot showing per-feature impact on predictions across 5,000 test requests
Mean absolute SHAP value per feature
Mean absolute SHAP value per feature

SHAP analysis confirms the gain-based ordering. domain_type_median accounts for the majority of model output, with high values pushing predictions strongly upward. rt_other pushes predictions downward when present, correctly identifying near-zero beacons. The TF-IDF dimensions distribute signal across URL structural patterns that no single hand-crafted feature captures.

Aggregation Accuracy

Users see a weekly aggregate (“Firefox saved you 2.3MB”), not individual predictions. Per-request errors cancel under aggregation if they are approximately symmetric. We simulate this by uniformly sampling N requests from the test set, summing predictions, comparing to the true sum, and repeating 1,000 times (Table 8).

N requestsModel: median % errorModel: within 10%LUT: median % errorLUT: within 10%
2006.0%67.2%21.9%13.6%
5005.0%23.2%
Weekly aggregate accuracy: model vs LUT at different browsing volumes
Weekly aggregate accuracy: model vs LUT at different browsing volumes

The gap widens with more requests. At N=200, the model’s median error is 6.0% versus the LUT’s 21.9%. At N=500, the model improves to 5.0% while the LUT worsens to 23.2%. The LUT’s systematic bias (always predicting the conditional median) compounds under aggregation. The model’s roughly symmetric errors cancel.

Correlated Browsing

Uniform sampling assumes independent requests. Real browsing exhibits domain correlation: a user visiting a news site generates multiple requests to the same tracker domains. We simulate this by sampling 15 domains, then drawing N requests exclusively from those domains (Table 9).

N requestsModel (correlated)Model (uniform)LUT (correlated)LUT (uniform)
20012.9%6.3%36.4%22.2%

Both methods degrade under correlation, but the model’s advantage increases. Under uniform sampling, the model is 3.5x better than the LUT (6.3% vs 22.2%). Under correlated sampling, the model is 2.8x better (12.9% vs 36.4%). The model’s relative robustness to domain concentration reflects its use of within-domain URL features, which provide discriminative signal even when the domain distribution is narrow.

Temporal Generalization

We train on June 2024 data and evaluate on September 2024 to measure drift (Table 11).

MethodJune MAESeptember MAEDegradation
XGBoost Tweedie3,4664,517+30.3%
Path LUT3,7974,966+30.8%
Domain+type LUT6,5976,723+1.9%

The model degrades 30.3% over three months, comparable to the path LUT’s 30.8%. The domain+type LUT is stable (+1.9%) but permanently worse in absolute terms. The model retains a 32.8% advantage over the domain+type LUT in September. Ranking quality is nearly unchanged (Spearman 0.945 in June, 0.942 in September).

Drift analysis: 11.8% of September domains are new; 85.2% of rows share paths with the training set. URL path churn, not domain churn, drives the degradation. Per-type analysis confirms this: the model’s script improvement drops from +75.9% in June to +50.6% in September, consistent with script bundles receiving new versioned paths over time.

Multi-Target Results

We train separate models for four additional HTTP timing metrics (Table 10).

TargetPer-request improvementAggregation error
transfer_bytes+47.5%6.0%
download_ms+23.7%16.6%
load_ms-3.1%5.6%
ttfb_ms-12.7%4.0%

Content-dependent metrics (transfer_bytes, download_ms) show strong per-request improvement because URL features predict content size, which determines transfer and download duration. Network-dependent metrics (load_ms, ttfb_ms) show no per-request improvement or slight degradation because latency depends on server location, CDN configuration, and network conditions, none of which are observable from the URL.

However, load_ms and ttfb_ms both achieve strong aggregation accuracy (5.6% and 4.0% respectively), outperforming the LUT in aggregate despite losing per-request. The model’s symmetric errors cancel under summation even when its per-request discrimination is weak.

Limitations

Transfer size only. The model predicts transfer size but not CPU execution time. CPU cost depends on JavaScript engine internals and device hardware, which differ between Chrome (training data) and Firefox (deployment).

Feature ceiling. TF-IDF over URL tokens captures lexical patterns but not semantic content. Two URLs with different token distributions can serve identical payloads; two URLs with similar tokens can serve different payloads depending on server-side A/B tests or user state. The limitations of hand-crafted and bag-of-words features motivate the learned representation approach in Article 3.

Temporal decay. The 30.3% degradation over three months implies a retraining cadence of at most quarterly to maintain accuracy. URL path churn in script bundles is the primary driver and is not addressable through feature engineering alone.

Predicted vs actual transfer size on log scale, with residual distribution
Predicted vs actual transfer size on log scale, with residual distribution
Calibration analysis: predicted vs actual mean per bin, and prediction accuracy by size range
Calibration analysis: predicted vs actual mean per bin, and prediction accuracy by size range