|

Feature Engineering and Selection

Nineteen features that describe how a tracker domain behaves on the web, and which ones actually matter.


The Feature Vector

Each tracker domain gets a fixed-length feature vector aggregated from HTTP Archive request data. The features fall into five groups, designed to capture different aspects of a domain’s behavior without requiring Lighthouse CPU measurements (which aren’t available for all domains).

Size features (5)

#FeatureSource
1p50_transfer_bytesMedian transfer size across all requests
2p90_transfer_bytes90th percentile transfer size
3mean_transfer_bytesMean transfer size
4max_script_bytesLargest single script from this domain
5bytes_per_pageTotal bytes from this domain per page (median across pages)

max_script_bytes turned out to be the single most important feature for predicting CPU cost (SHAP value 0.056, nearly 2x the next feature). The intuition: the biggest script a domain serves is the best proxy for how much computation it triggers.

Type composition (4)

#FeatureSource
6script_request_ratioFraction of requests that are type=script
7image_request_ratioFraction that are type=image
8requests_per_pageMedian requests from this domain per page
9distinct_resource_typesNumber of unique resource types served

script_request_ratio is the third most important CPU feature (SHAP 0.019). A domain that’s 95% scripts (like googletagmanager.com) behaves very differently from one that’s 95% images (tracking pixels). But it’s not a perfect signal. bat.bing.com has 0% script ratio in request data yet burns 105ms of CPU, because Lighthouse attributes CPU from redirect chains and cross-domain scripts.

Timing / execution (4)

#FeatureSource
10mean_lh_main_thread_msLighthouse third-parties-insight mainThreadTime
11mean_scripting_msLighthouse bootup-time scripting time
12mean_parse_compile_msLighthouse bootup-time scriptParseCompile time
13p50_load_msMedian total request time

Features 10-12 are Lighthouse-derived. They’re available for the 2,185 domains with CPU data and NaN for the rest. XGBoost handles NaN natively via learned default split directions; no imputation needed.

These features are used during training but aren’t required at inference time. The generalization experiment (training without features 10-12) shows the model can still achieve rho=0.751 from request features alone.

Render-blocking (3)

#FeatureSource
14render_block_rateFraction of pages where domain appears in render-blocking-insight
15mean_render_block_wasted_msAverage wasted render-blocking ms
16mean_render_block_bytesAverage render-blocking resource size

These turned out to be nearly useless. Data exploration showed that almost no tracker domains appear in the render-blocking audit (see Problem and Target Variables). The features are kept for completeness but contribute almost zero to predictions.

Prevalence / context (3)

#FeatureSource
17pages_seen_onNumber of pages this domain appears on
18p10_waterfall_index10th percentile waterfall position (how early it loads)
19disconnect_categoryOne-hot encoded into 6 binary features

disconnect_category is one-hot encoded into: Advertising, Analytics, Social, Content, FingerprintingInvasive, Cryptomining. This adds 6 columns, giving 25 total input features.

The surprise: Disconnect categories contribute almost nothing to predictions. SHAP values for all six category features are near zero. A domain’s privacy category (Advertising vs Analytics vs Social) doesn’t predict its performance cost. An “Advertising” domain can be a pixel (near-zero cost) or a 170KB SDK (high cost). The model learns from behavioral features (script size, transfer size) not categorical labels.

Preprocessing

Log transform: All byte and millisecond features are log-transformed before training. These distributions are heavily right-skewed: max_script_bytes ranges from 0 to 27MB, and mean_scripting_ms ranges from 0 to 7,300ms. Log1p compresses the range while preserving ordering.

Negative values: One edge case: p50_load_ms had a -1 value in the data (likely a measurement artifact). Clipped to 0 before log transform to avoid -inf.

Missing values: Left as NaN. XGBoost learns default split directions for missing values, which is more principled than imputation. The Lighthouse features (10-12) are NaN for ~52% of domains.

The Disconnect List Join

This was harder than expected. The Disconnect list uses base domains (doubleclick.net), but HTTP Archive records full subdomains (stats.g.doubleclick.net, td.doubleclick.net, googleads.g.doubleclick.net).

Initial naive join (exact match on domain): 92 matches out of 4,592. Terrible.

Fix: suffix matching. For each tracker domain, try progressively shorter suffixes until a match is found:

def match_disconnect_domain(tracker_domain, disconnect_domains):
    parts = tracker_domain.split(".")
    for i in range(len(parts) - 1):
        candidate = ".".join(parts[i:])
        if candidate in disconnect_domains:
            return candidate
    return None

After fix: 3,686 matches (80% of domains). The remaining 20% are domains in HTTP Archive’s third-party categories but not on Disconnect’s list, reflecting different definitions of “tracker.”

The Percentile Rank Fix

The target scores are percentile ranks (0.0 = cheapest, 1.0 = most expensive). Originally I computed ranks across all 4,592 domains, including the 2,407 with no CPU data that were filled with 0.

This compressed the training data for main_thread_cost into [0.5, 1.0]; the bottom half of the range was all zeros. The model had half the target resolution to work with.

Fix: compute percentile ranks only within the 2,185 domains with real CPU data. The cheapest measured domain gets ~0.0, the most expensive gets ~1.0. Full [0, 1] range for training. Domains without CPU data get NaN for main_thread_cost (they’re predicted by the model, not used for training).