Feature Engineering and Embeddings

Features determine the ceiling; model selection determines how close you get. A sophisticated model cannot compensate for missing signal, but a well-engineered feature set can make even simple models competitive. This article covers categorical encoding, text featurization, learned embeddings, dimensionality reduction, feature interactions, and selection — grounded in the tracker cost model, which uses 80 features across 5 groups to predict HTTP transfer sizes.


Why Features Matter

The tracker cost model predicts transfer_bytes for individual HTTP requests using XGBoost. Its 80 features decompose into five groups:

GroupCountDescription
Target-encoded domains2Conditional medians of transfer size per domain
TF-IDF + SVD50URL text compressed to 50 latent dimensions
Regex flags6Binary indicators for URL patterns
Categorical14Resource type, method, protocol, etc.
Numeric8Content length hints, URL length, query param count

The single most important feature group is domain target encoding, contributing an 8.65% MAE reduction when added to the base feature set. This is unsurprising: the domain identifies the server, and servers behave consistently. But the key insight is that raw domain identity (3,723 unique values) is useless to the model without proper encoding. The representation determines the signal.


Categorical Encoding

One-Hot Encoding

Replace a categorical variable with KK binary indicator columns, one per category:

xone-hot=[0,0,,1,,0]{0,1}K\mathbf{x}_{\text{one-hot}} = [0, 0, \ldots, 1, \ldots, 0] \in \{0,1\}^K

One-hot encoding makes no assumptions about category relationships. It is lossless for the training categories. The drawbacks: dimensionality explodes with cardinality (K=3,723K = 3{,}723 for tracker domains), the representation is extremely sparse, and unseen categories at inference time have no representation.

Label / Ordinal Encoding

Map each category to an integer: ci{0,1,,K1}c \mapsto i \in \{0, 1, \ldots, K-1\}. This is compact (one column) but implies an ordering that may not exist. Tree-based models can recover arbitrary category splits from label encoding, but linear models will treat the integers as ordered and continuous. Ordinal encoding is appropriate only when the categories have a natural order (e.g., education level, severity rating).

Target Encoding

Replace each category with a statistic of the target variable conditioned on that category:

TE(c)=E^[yx=c]\text{TE}(c) = \hat{\mathbb{E}}[y \mid x = c]

This compresses arbitrary cardinality into a single numeric feature that directly encodes the category’s relationship with the target. The conditional expectation can be replaced with any statistic — median, variance, quantiles — depending on the problem.

Regularization. Naive target encoding overfits categories with few observations. A category seen once gets TE(c)=y1\text{TE}(c) = y_1, which is pure noise. The standard remedy is smoothing with the global mean:

TEsmooth(c)=ncE^[yx=c]+myˉnc+m\text{TE}_{\text{smooth}}(c) = \frac{n_c \cdot \hat{\mathbb{E}}[y \mid x=c] + m \cdot \bar{y}}{n_c + m}

where ncn_c is the count of category cc and mm is a smoothing parameter. When ncmn_c \gg m, the category-specific estimate dominates; when ncmn_c \ll m, the encoding falls back to the global mean.

Preventing leakage. Target encoding must be computed on training data only. In cross-validation, this means computing encodings per fold: for each fold kk, the encoding is derived from the out-of-fold data. Using the full training set to compute encodings before splitting introduces target leakage and inflates performance estimates.

The tracker model’s approach. The model encodes 3,723 tracker domains as two features: the median transfer_bytes per domain, and the median transfer_bytes per (domain, resource_type) pair. Using the median rather than the mean provides robustness to the heavy-tailed distribution of transfer sizes. These two features alone reduce MAE by 8.65%, making them the single most valuable feature group in the model.


Text and URL Features

Bag of Words

Represent a document as a vector of word counts, discarding order:

xBoW=[count(w1),count(w2),,count(wV)]\mathbf{x}_{\text{BoW}} = [\text{count}(w_1), \text{count}(w_2), \ldots, \text{count}(w_V)]

where VV is the vocabulary size. Bag of words is simple and effective for many tasks, but it treats all terms equally regardless of discriminative power.

TF-IDF

Term frequency-inverse document frequency upweights terms that are frequent in a document but rare across the corpus:

TF-IDF(t,d)=TF(t,d)IDF(t)\text{TF-IDF}(t, d) = \text{TF}(t, d) \cdot \text{IDF}(t)

Term frequency with sublinear scaling dampens the effect of high-frequency terms:

TF(t,d)=log(1+tft,d)\text{TF}(t, d) = \log(1 + \text{tf}_{t,d})

Inverse document frequency downweights terms that appear in many documents:

IDF(t)=logNdft\text{IDF}(t) = \log\frac{N}{\text{df}_t}

where NN is the total number of documents and dft\text{df}_t is the number of documents containing term tt. A term appearing in every document has IDF=0\text{IDF} = 0; a term appearing in one document has IDF=logN\text{IDF} = \log N.

The tracker model’s TF-IDF pipeline. URLs are not natural language, so tokenization requires care. The model tokenizes on delimiters (/, ., -, _, ?, &, =) and camelCase boundaries (e.g., pageView becomes page and view). The pipeline uses unigrams and bigrams, a vocabulary capped at 50,000 terms, and a minimum document frequency of 5. This produces a sparse matrix of dimension N×50,000N \times 50{,}000, which is then compressed via truncated SVD.


Dimensionality Reduction

PCA vs Truncated SVD

PCA finds orthogonal directions of maximum variance in centered data. It requires computing the mean and subtracting it, which densifies a sparse matrix and can be prohibitively expensive.

Truncated SVD factorizes the term-document matrix directly without centering:

XUkΣkVk\mathbf{X} \approx \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^\top

where kk is the number of retained components, Uk\mathbf{U}_k contains the left singular vectors (document embeddings), Σk\boldsymbol{\Sigma}_k contains the singular values, and Vk\mathbf{V}_k contains the right singular vectors (term embeddings). Truncated SVD operates directly on sparse matrices, making it the standard choice for reducing TF-IDF features.

Applied to TF-IDF, this is known as latent semantic analysis (LSA). The latent dimensions capture co-occurrence patterns: terms that frequently appear together in the same documents are mapped to similar directions in the reduced space.

The tracker model reduces 50,000 TF-IDF features to 50 SVD components, explaining 54.5% of the total variance. While 54.5% may seem low, the discarded variance is largely noise from rare term combinations. The SVD components are semantically interpretable: component 1 loads heavily on /collect endpoints (measurement pixels), component 2 on /gtag/js paths (Google tag manager scripts). These components capture functional categories of tracker behavior that are predictive of transfer size.

t-SNE

t-distributed stochastic neighbor embedding maps high-dimensional data to 2 or 3 dimensions for visualization. It preserves local neighborhood structure by modeling pairwise similarities as conditional probabilities in both the high-dimensional and low-dimensional spaces, then minimizing the KL divergence between them. t-SNE is useful for exploring cluster structure but is non-parametric (no transform for new data), sensitive to hyperparameters (perplexity), and should not be used for feature engineering.

UMAP

Uniform Manifold Approximation and Projection is a faster alternative to t-SNE that also preserves more global structure. Unlike t-SNE, UMAP can produce a parametric mapping applicable to new data. It is increasingly used as a general-purpose dimensionality reduction method, not just for visualization.

When to reduce. Dimensionality reduction is most valuable when features are high-dimensional and sparse (TF-IDF, one-hot encodings), when you need to visualize high-dimensional data, or when denoising is beneficial. It is less useful when features are already low-dimensional and dense, or when the model (e.g., tree ensembles) can handle high dimensionality natively.


Learned Embeddings

Word2Vec

Word2Vec learns dense vector representations of words from co-occurrence patterns in a corpus.

CBOW (Continuous Bag of Words) predicts a target word from its context window:

w^t=argmaxwP(wwtk,,wt1,wt+1,,wt+k)\hat{w}_t = \arg\max_{w} P(w \mid w_{t-k}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+k})

Skip-gram inverts this: predict context words from the target word:

maxt=1Tkjk,j0logP(wt+jwt)\max \sum_{t=1}^{T} \sum_{-k \leq j \leq k, j \neq 0} \log P(w_{t+j} \mid w_t)

The learned embeddings capture semantic relationships. The canonical example: vec(king)vec(man)+vec(woman)vec(queen)\text{vec}(\text{king}) - \text{vec}(\text{man}) + \text{vec}(\text{woman}) \approx \text{vec}(\text{queen}). Skip-gram with negative sampling (SGNS) is equivalent to implicitly factorizing a shifted PMI matrix, connecting Word2Vec to classical matrix factorization methods.

Embedding Layers in Neural Networks

An embedding layer is a lookup table: each category c{1,,K}c \in \{1, \ldots, K\} is mapped to a trainable vector ecRd\mathbf{e}_c \in \mathbb{R}^d. This is mathematically equivalent to one-hot encoding followed by a linear layer (matrix multiplication with the embedding matrix ERK×d\mathbf{E} \in \mathbb{R}^{K \times d}), but the lookup implementation avoids materializing the sparse one-hot vector.

Embedding layers are trained end-to-end with the task objective. The dimensionality dd is a hyperparameter — common heuristics include dmin(50,K/2)d \approx \min(50, K/2) or dK1/4d \approx K^{1/4}.

Character-Level Embeddings

For domains like URLs where the vocabulary is open-ended and subword structure is informative, character-level embeddings bypass tokenization entirely. Each character in the input string is mapped to a dd-dimensional vector, and a CNN or RNN processes the sequence.

The tracker model’s CNN baseline uses 32-dimensional character embeddings over raw URL strings. Characters like /, ., ?, and digits develop distinct embedding patterns that reflect URL structure. The CNN applies 1D convolutions over these character embeddings to capture local patterns (path segments, query parameters) before pooling to a fixed-length representation.

Pretrained Embeddings

GloVe (Global Vectors) trains on global word-word co-occurrence statistics rather than local context windows. The objective is:

mini,jf(Xij)(wiw~j+bi+b~jlogXij)2\min \sum_{i,j} f(X_{ij}) \left( \mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2

where XijX_{ij} is the co-occurrence count and ff is a weighting function that caps the influence of very frequent pairs.

fastText extends Word2Vec by representing each word as a bag of character n-grams, enabling embeddings for out-of-vocabulary words by summing their n-gram vectors.

Sentence and document embeddings. BERT-based bi-encoders (e.g., Sentence-BERT) produce fixed-length embeddings for variable-length text by encoding each text independently and comparing via cosine similarity. These are the foundation of modern RAG systems: documents are embedded offline, queries are embedded at retrieval time, and nearest-neighbor search in embedding space identifies relevant passages.


Feature Interaction and Engineering

Polynomial Features

For a feature vector x=[x1,x2]\mathbf{x} = [x_1, x_2], degree-2 polynomial expansion produces:

ϕ(x)=[1,x1,x2,x12,x1x2,x22]\phi(\mathbf{x}) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]

This allows linear models to capture nonlinear relationships. The number of features grows as (d+pp)\binom{d + p}{p} for degree pp and dd original features, which becomes intractable for large dd.

Feature Crosses

Explicitly create features from combinations of categorical variables. For categories AA and BB, the cross A×BA \times B creates a new category for each (a,b)(a, b) pair. This is common in large-scale linear models (e.g., ad click prediction) where interaction effects are critical but the model cannot learn them implicitly.

Domain-Specific Features

Engineered features that encode domain knowledge often provide signal that generic methods miss. Regular expressions, heuristic rules, and structural parsing can extract targeted information from raw inputs.

The tracker model uses 6 binary regex features that flag URL patterns associated with specific tracker behaviors:

FeaturePatternIntent
is_js.js extensionJavaScript payload
is_collect/collect pathMeasurement pixel
is_imageImage extensionsTracking pixels
is_sync/sync pathCookie syncing
is_adAd-related pathsAd serving
is_api/api pathAPI calls

These features encode the intuition that tracker function (pixel vs script vs API call) is predictive of transfer size. However, ablation analysis reveals that the TF-IDF features largely subsume these regex features — the TF-IDF vocabulary already contains the same tokens that the regex patterns match. The regex features provide marginal improvement (< 0.5% MAE reduction) beyond TF-IDF, suggesting that hand-engineered features are most valuable when the model lacks access to the raw text from which they were derived.


Feature Selection

Filter Methods

Evaluate features independently of the model using statistical tests.

Mutual information measures the dependency between a feature XjX_j and the target YY:

I(Xj;Y)=x,yp(x,y)logp(x,y)p(x)p(y)I(X_j; Y) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}

High mutual information indicates that knowing the feature reduces uncertainty about the target. MI captures nonlinear dependencies but requires density estimation for continuous variables.

Chi-squared test measures the dependence between categorical features and categorical targets. It compares observed co-occurrence counts to expected counts under independence.

Wrapper Methods

Use the model itself to evaluate feature subsets.

Recursive feature elimination (RFE) trains the model, ranks features by importance, removes the least important, and repeats. RFE is expensive (O(d)O(d) training rounds for dd features) but accounts for feature interactions.

Embedded Methods

Feature selection is built into the training process.

L1 regularization (Lasso) adds λw1\lambda \|\mathbf{w}\|_1 to the loss. The L1 penalty drives small coefficients to exactly zero, producing a sparse model. Features with zero coefficients are effectively removed.

Tree-based importance. Gradient boosted trees compute feature importance from the total gain (reduction in loss) contributed by splits on each feature. This is fast and accounts for nonlinear effects but is biased toward high-cardinality features and features with many possible split points.

Permutation Importance vs SHAP

Permutation importance measures the increase in prediction error when a feature’s values are randomly shuffled, breaking its relationship with the target:

PIj=L(f^,Xπj,y)L(f^,X,y)\text{PI}_j = \mathcal{L}(\hat{f}, \mathbf{X}_{\pi_j}, \mathbf{y}) - \mathcal{L}(\hat{f}, \mathbf{X}, \mathbf{y})

This is model-agnostic and computed on held-out data, avoiding the biases of tree-based importance. However, it can underestimate the importance of correlated features (shuffling one leaves the correlated partner intact).

SHAP (SHapley Additive exPlanations) computes each feature’s contribution to individual predictions using Shapley values from cooperative game theory:

ϕj=SF{j}S!(FS1)!F![f(S{j})f(S)]\phi_j = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!(|F|-|S|-1)!}{|F|!} \left[ f(S \cup \{j\}) - f(S) \right]

SHAP provides both global importance (mean absolute SHAP values) and local explanations (per-prediction attributions). TreeSHAP computes exact Shapley values for tree ensembles in polynomial time.

Feature Ablation

The most direct method: train the model with and without feature groups and measure the impact. The tracker cost model’s ablation study reveals the contribution of each group:

ConfigurationMAERelative to Full
Full model (80 features)3,466baseline
TF-IDF + SVD only (50 features)3,548+2.4%
Without target encoding3,766+8.65%
Without TF-IDF4,102+18.4%
Numeric + categorical only5,231+50.9%

The key finding: TF-IDF alone (MAE 3,548) nearly matches the full 80-feature model (MAE 3,466). The 50 SVD components capture most of the predictive signal, with domain target encoding providing the largest marginal gain among the remaining features. This suggests that for this problem, URL text is the primary information source — which makes sense, since the URL encodes the endpoint, parameters, and often the tracker’s identity.


Practical Guidelines

Start simple, add complexity only when validated. Baseline with raw features and a simple model. Add engineered features one group at a time, measuring each group’s marginal contribution via ablation. This prevents accumulating features that add complexity without improving performance.

Target encoding is powerful but leakage-prone. Always compute target statistics on out-of-fold data during cross-validation. Use smoothing to regularize low-count categories. Monitor for suspiciously high cross-validation scores, which may indicate leakage.

TF-IDF + SVD is a strong baseline before neural embeddings. For tabular models operating on text-derived features, TF-IDF with truncated SVD is fast, interpretable, and often competitive with learned embeddings. The tracker model’s TF-IDF pipeline (50 SVD components) captures 54.5% of variance and provides the largest single feature group contribution. Consider neural embeddings only when this baseline is insufficient or when end-to-end training is feasible.

Feature importance should inform iteration. After training, examine feature importance to identify (1) which features carry signal and which are dead weight, (2) whether engineered features provide value beyond raw inputs, and (3) where to invest further engineering effort. In the tracker model, importance analysis revealed that regex features were redundant with TF-IDF — effort better spent elsewhere.

Match encoding to model. One-hot encoding suits linear models and neural networks. Label encoding suits tree-based models. Target encoding suits any model but requires careful regularization. Embedding layers require neural network training. Choose the encoding that aligns with your model’s inductive biases.

Consider the inference-time cost. Features that require expensive computation at inference time (large TF-IDF vocabularies, pretrained transformer embeddings) may be impractical in latency-sensitive applications. The tracker model’s 50-dimensional SVD projection is fast at inference because the SVD transform is a single matrix multiplication, while the TF-IDF vectorization requires only a vocabulary lookup and sparse dot product.