Learned URL Representations

The Feature Engineering Problem

The tree-based models from the previous article depend on six hand-crafted regex features that flag patterns like .js, collect, pixel, and sync in URL paths, alongside structural measures (path depth, URL length, query parameter count). These features encode domain knowledge about tracker URL conventions. They work, but they impose a ceiling: they can only capture patterns the engineer anticipated, and they collapse a high-dimensional URL string into a handful of binary flags and integer counts.

The question this article addresses is whether data-driven URL representations can replace or subsume these manually engineered features. We evaluate two approaches: TF-IDF embeddings reduced via truncated SVD (deployed in the final model), and a character-level CNN (experimental).

TF-IDF + SVD Embeddings

Tokenization and Vocabulary

The URL path is tokenized on standard delimiters (/, ?, &, =, ., -, _) and camelCase boundaries. All tokens are lowercased, and tokens shorter than 2 characters are discarded. From this token stream, we extract unigram and bigram features with sublinear term frequency weighting ( $\text{tf}' = \log(1 + \text{tf})$ ), which prevents high-frequency tokens like js or api from dominating the representation. The vocabulary is capped at 50,000 terms with a minimum document frequency of 5, filtering out URL-specific noise (session IDs, cache-busting hashes) that appears in fewer than 5 URLs across the corpus.

Dimensionality Reduction

The resulting sparse TF-IDF matrix is reduced to 50 dimensions via truncated SVD. These 50 components explain 54.5% of the total variance in the URL vocabulary space. The remaining variance is largely attributable to the long tail of rare tokens that carry little predictive signal for transfer size estimation.

Semantic Interpretability

The leading SVD components are semantically interpretable, and the patterns they capture correspond precisely to distinctions that the hand-crafted regex features were designed to encode:

Component 1 separates /collect endpoints (measurement beacons) from all other URL types. This is the single most important axis of variation in tracker URLs, distinguishing fire-and-forget pixel requests from resource-fetching requests.
Component 2 captures /gtag/js and analytics.js patterns, isolating script bundle URLs from non-script requests.
Components 3—5 distinguish tracking pixels, ad-server paths, and cookie synchronization endpoints from one another.

The critical observation is that these distinctions emerge automatically from the data, without any domain knowledge encoded in the feature extraction pipeline. The SVD components discover the same categorical structure that motivated the regex features, but they also capture gradations and combinations that binary flags cannot express.

Feature Ablation

Table 6 in the paper quantifies the contribution of each feature set:

Feature Set	Dimensions	Test MAE (bytes)	vs Regex Only
Regex only	6	4,251	—
TF-IDF SVD only	50	3,548	+16.5%
Regex + TF-IDF SVD	56	3,466	+18.5%

The TF-IDF embeddings alone reduce MAE by 16.5% relative to the regex-only baseline, nearly matching the full combined model. Adding regex features on top of TF-IDF contributes only a marginal 2.3% additional reduction (from 3,548 to 3,466). The implication is clear: the SVD components already encode the information that the regex features capture, plus additional signal that regex cannot represent. In the deployed model, we retain both feature sets because the marginal cost of six additional features is negligible, but the ablation demonstrates that the regex features are largely redundant once TF-IDF embeddings are available.

Character-Level CNN

Architecture

As an alternative to the bag-of-tokens approach, we evaluate a character-level CNN that processes the raw URL string as a sequence. The architecture is as follows:

Character embedding. Each character maps to a 32-dimensional learned embedding. The vocabulary covers 95 ASCII printable characters plus padding and unknown tokens. URLs are truncated or padded to 200 characters.
Parallel convolutions. Three convolutional layers with kernel sizes 3, 5, and 7 operate on the character sequence simultaneously. Kernel size 3 captures character trigrams (.js, sdk, img). Kernel size 5 captures longer tokens (pixel, track). Kernel size 7 captures compound patterns (collect, beacon/). Each layer produces 64 feature maps.
Global max pooling. Each convolutional layer’s output is reduced to a single vector via max pooling across the sequence dimension, yielding a 192-dimensional URL representation regardless of input length.
Concatenation with tabular features. The 192-dimensional URL embedding is concatenated with the same tabular features used by the tree models (domain target encoding, resource type, initiator type), minus the hand-crafted regex features that the CNN replaces.
MLP regression head. Two fully connected layers (128 and 64 units) with ReLU, batch normalization, and dropout (0.3) produce the final transfer size prediction.

The model has 72,705 trainable parameters.

Training

Training uses AdamW (learning rate $10^{-3}$ , weight decay $10^{-4}$ ) with cosine annealing over 50 epochs and early stopping (patience 10). The loss function operates in log space:

$\mathcal{L} = \text{MSE}(\log(1 + \hat{y}), \log(1 + y))$

This penalizes relative rather than absolute errors, which is appropriate for the zero-inflated, right-skewed transfer size distribution. A 5KB error on a 100KB script and a 50-byte error on a 1KB beacon incur approximately equal loss.

Results

Model	Test MAE (bytes)	vs LUT	Spearman $\rho$
LUT baseline	6,944	—	0.925
XGBoost Tweedie (hand-crafted)	4,366	+37.1%	0.944
Character-level CNN	5,320	+23.4%	0.977

The CNN achieves the highest Spearman rank correlation of any model evaluated (0.977), exceeding both the LUT baseline and the XGBoost model with hand-crafted features. This indicates that learned character-level representations capture ordering information that neither regex features nor TF-IDF embeddings fully express. The CNN can distinguish between URL patterns that our other feature sets cannot represent, such as the difference between /sdk/v3.2/tracker.min.js (production bundle, large) and /sdk/v3.2/debug.js (debug build, smaller).

However, the CNN’s MAE of 5,320 bytes is 21.8% worse than XGBoost Tweedie’s 4,366 bytes. The model ranks requests more accurately but produces less precise absolute byte estimates.

The Ranking—Accuracy Tradeoff

Two factors explain this gap. First, data scarcity: 72,705 parameters trained on 296,572 examples yields roughly 4 examples per parameter, well below the ratio at which neural networks typically outperform tree ensembles on tabular data. The CNN’s strong ranking despite this constraint suggests the learned representations are genuinely useful. Second, loss function mismatch: log-space MSE optimizes for relative accuracy (favorable for ranking) while XGBoost’s Tweedie loss directly optimizes a magnitude-proportional penalty better suited to absolute error minimization.

Scaling Considerations

The full HTTP Archive dataset contains 348 million tracker requests. Our current 0.1% sample provides 348,909 training examples. A 10% sample (35 million requests) would increase training data by two orders of magnitude, placing the CNN firmly in the regime where neural networks are expected to match or exceed tree models on complex pattern recognition tasks.

A hybrid architecture presents a natural extension: the CNN learns URL embeddings end-to-end, and an XGBoost model with Tweedie loss performs the final regression over those embeddings plus tabular features. This would combine the CNN’s representation learning capacity with the tree model’s distributional assumptions, potentially achieving both strong ranking and strong absolute accuracy.

Broader Implications

The central finding across both approaches is that data-driven URL representations subsume hand-crafted regex features. The TF-IDF embeddings deliver a 17% MAE reduction over regex alone, and the SVD components are interpretable enough to confirm that they discover the same categorical structure that motivated the regex features in the first place. The character-level CNN pushes further, achieving the best ranking performance of any model by learning representations directly from the raw character sequence.

For URL-based prediction tasks more generally, the implication is that investing engineering effort in manual regex pattern design yields diminishing returns relative to learning representations from data. A vocabulary of 50,000 URL tokens reduced to 50 SVD dimensions outperforms six carefully designed regex features. The regex features are not wrong — they capture real structure — but they are incomplete, and the effort required to close that gap manually scales poorly compared to the statistical alternative.