Feature Engineering and Selection
Nineteen features that describe how a tracker domain behaves on the web, and which ones actually matter.
The Feature Vector
Each tracker domain gets a fixed-length feature vector aggregated from HTTP Archive request data. The features fall into five groups, designed to capture different aspects of a domain’s behavior without requiring Lighthouse CPU measurements (which aren’t available for all domains).
Size features (5)
| # | Feature | Source |
|---|---|---|
| 1 | p50_transfer_bytes | Median transfer size across all requests |
| 2 | p90_transfer_bytes | 90th percentile transfer size |
| 3 | mean_transfer_bytes | Mean transfer size |
| 4 | max_script_bytes | Largest single script from this domain |
| 5 | bytes_per_page | Total bytes from this domain per page (median across pages) |
max_script_bytes turned out to be the single most important feature for predicting CPU cost (SHAP value 0.056, nearly 2x the next feature). The intuition: the biggest script a domain serves is the best proxy for how much computation it triggers.
Type composition (4)
| # | Feature | Source |
|---|---|---|
| 6 | script_request_ratio | Fraction of requests that are type=script |
| 7 | image_request_ratio | Fraction that are type=image |
| 8 | requests_per_page | Median requests from this domain per page |
| 9 | distinct_resource_types | Number of unique resource types served |
script_request_ratio is the third most important CPU feature (SHAP 0.019). A domain that’s 95% scripts (like googletagmanager.com) behaves very differently from one that’s 95% images (tracking pixels). But it’s not a perfect signal. bat.bing.com has 0% script ratio in request data yet burns 105ms of CPU, because Lighthouse attributes CPU from redirect chains and cross-domain scripts.
Timing / execution (4)
| # | Feature | Source |
|---|---|---|
| 10 | mean_lh_main_thread_ms | Lighthouse third-parties-insight mainThreadTime |
| 11 | mean_scripting_ms | Lighthouse bootup-time scripting time |
| 12 | mean_parse_compile_ms | Lighthouse bootup-time scriptParseCompile time |
| 13 | p50_load_ms | Median total request time |
Features 10-12 are Lighthouse-derived. They’re available for the 2,185 domains with CPU data and NaN for the rest. XGBoost handles NaN natively via learned default split directions; no imputation needed.
These features are used during training but aren’t required at inference time. The generalization experiment (training without features 10-12) shows the model can still achieve rho=0.751 from request features alone.
Render-blocking (3)
| # | Feature | Source |
|---|---|---|
| 14 | render_block_rate | Fraction of pages where domain appears in render-blocking-insight |
| 15 | mean_render_block_wasted_ms | Average wasted render-blocking ms |
| 16 | mean_render_block_bytes | Average render-blocking resource size |
These turned out to be nearly useless. Data exploration showed that almost no tracker domains appear in the render-blocking audit (see Problem and Target Variables). The features are kept for completeness but contribute almost zero to predictions.
Prevalence / context (3)
| # | Feature | Source |
|---|---|---|
| 17 | pages_seen_on | Number of pages this domain appears on |
| 18 | p10_waterfall_index | 10th percentile waterfall position (how early it loads) |
| 19 | disconnect_category | One-hot encoded into 6 binary features |
disconnect_category is one-hot encoded into: Advertising, Analytics, Social, Content, FingerprintingInvasive, Cryptomining. This adds 6 columns, giving 25 total input features.
The surprise: Disconnect categories contribute almost nothing to predictions. SHAP values for all six category features are near zero. A domain’s privacy category (Advertising vs Analytics vs Social) doesn’t predict its performance cost. An “Advertising” domain can be a pixel (near-zero cost) or a 170KB SDK (high cost). The model learns from behavioral features (script size, transfer size) not categorical labels.
Preprocessing
Log transform: All byte and millisecond features are log-transformed before training. These distributions are heavily right-skewed: max_script_bytes ranges from 0 to 27MB, and mean_scripting_ms ranges from 0 to 7,300ms. Log1p compresses the range while preserving ordering.
Negative values: One edge case: p50_load_ms had a -1 value in the data (likely a measurement artifact). Clipped to 0 before log transform to avoid -inf.
Missing values: Left as NaN. XGBoost learns default split directions for missing values, which is more principled than imputation. The Lighthouse features (10-12) are NaN for ~52% of domains.
The Disconnect List Join
This was harder than expected. The Disconnect list uses base domains (doubleclick.net), but HTTP Archive records full subdomains (stats.g.doubleclick.net, td.doubleclick.net, googleads.g.doubleclick.net).
Initial naive join (exact match on domain): 92 matches out of 4,592. Terrible.
Fix: suffix matching. For each tracker domain, try progressively shorter suffixes until a match is found:
def match_disconnect_domain(tracker_domain, disconnect_domains):
parts = tracker_domain.split(".")
for i in range(len(parts) - 1):
candidate = ".".join(parts[i:])
if candidate in disconnect_domains:
return candidate
return NoneAfter fix: 3,686 matches (80% of domains). The remaining 20% are domains in HTTP Archive’s third-party categories but not on Disconnect’s list, reflecting different definitions of “tracker.”
The Percentile Rank Fix
The target scores are percentile ranks (0.0 = cheapest, 1.0 = most expensive). Originally I computed ranks across all 4,592 domains, including the 2,407 with no CPU data that were filled with 0.
This compressed the training data for main_thread_cost into [0.5, 1.0]; the bottom half of the range was all zeros. The model had half the target resolution to work with.
Fix: compute percentile ranks only within the 2,185 domains with real CPU data. The cheapest measured domain gets ~0.0, the most expensive gets ~1.0. Full [0, 1] range for training. Domains without CPU data get NaN for main_thread_cost (they’re predicted by the model, not used for training).