Gradient Boosting

Gradient boosting constructs an additive model by performing gradient descent in function space. Where neural networks optimize a fixed-architecture model by updating parameters, gradient boosting optimizes by adding new functions — each one a small tree that corrects the errors of the ensemble so far. This reframing, from parameter space to function space, is the central insight.


Additive Modeling and Functional Gradient Descent

We want to find a function FF^* that minimizes the expected loss:

F=argminFEx,y[L(y,F(x))]F^* = \arg\min_F \mathbb{E}_{x,y}[L(y, F(\mathbf{x}))]

In practice we minimize the empirical risk over nn training examples:

F=argminFi=1nL(yi,F(xi))F^* = \arg\min_F \sum_{i=1}^{n} L(y_i, F(\mathbf{x}_i))

Gradient boosting approximates FF^* as a sum of base learners:

FM(x)=F0(x)+m=1Mηγmhm(x)F_M(\mathbf{x}) = F_0(\mathbf{x}) + \sum_{m=1}^{M} \eta \, \gamma_m \, h_m(\mathbf{x})

where F0F_0 is a constant initialization, each hmh_m is a regression tree, γm\gamma_m is a step size found by line search, and η(0,1]\eta \in (0, 1] is the shrinkage (learning rate).

The key insight is that at each iteration, we want to add the function hh that most reduces the loss. If we could take a functional gradient, the steepest descent direction at the current model Fm1F_{m-1} is:

L(yi,F(xi))F(xi)F=Fm1-\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \bigg|_{F = F_{m-1}}

This is a vector of nn values — one per training example — representing the direction in “prediction space” that most decreases the loss. Since we cannot store an arbitrary function, we fit a tree hmh_m to these negative gradients. The tree acts as a parametric approximation of the functional gradient, one that generalizes to unseen points.

This is why gradient boosting is sometimes called “stage-wise additive modeling via gradient descent in function space” (Friedman, 2001). Each tree is not fitting the original targets — it is fitting the direction of steepest improvement.


The Algorithm

Initialization. Set F0(x)=argminci=1nL(yi,c)F_0(\mathbf{x}) = \arg\min_c \sum_{i=1}^n L(y_i, c). For squared error this is the mean yˉ\bar{y}. For Tweedie deviance it is the log of the mean.

For m=1,2,,Mm = 1, 2, \ldots, M:

  1. Compute pseudo-residuals. For each training example ii:
rim=L(yi,F(xi))F(xi)F=Fm1r_{im} = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \bigg|_{F = F_{m-1}}
  1. Fit a regression tree hmh_m to the pseudo-residuals {(xi,rim)}i=1n\{(\mathbf{x}_i, r_{im})\}_{i=1}^n, yielding terminal regions {Rjm}j=1J\{R_{jm}\}_{j=1}^J.

  2. Compute optimal leaf values via line search within each terminal region:

γjm=argminγxiRjmL(yi,Fm1(xi)+γ)\gamma_{jm} = \arg\min_\gamma \sum_{\mathbf{x}_i \in R_{jm}} L(y_i, F_{m-1}(\mathbf{x}_i) + \gamma)
  1. Update the model:
Fm(x)=Fm1(x)+ηj=1Jγjm1(xRjm)F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \sum_{j=1}^{J} \gamma_{jm} \, \mathbf{1}(\mathbf{x} \in R_{jm})

Output FM(x)F_M(\mathbf{x}).

Squared Error Walkthrough

For L(y,F)=12(yF)2L(y, F) = \frac{1}{2}(y - F)^2, the pseudo-residual is:

rim=F12(yiF)2F=Fm1(xi)=yiFm1(xi)r_{im} = -\frac{\partial}{\partial F}\frac{1}{2}(y_i - F)^2 \bigg|_{F = F_{m-1}(\mathbf{x}_i)} = y_i - F_{m-1}(\mathbf{x}_i)

These are literal residuals. The tree in each round fits the current errors, and the leaf values γjm\gamma_{jm} are simply the mean residual within each leaf. This is the special case that makes gradient boosting intuitive — but the framework generalizes to any differentiable loss.


Loss Functions in Depth

The choice of loss function is the single most impactful decision in a gradient boosting pipeline. In my experience deploying XGBoost at scale — serving predictions to 250M Firefox users — the loss function mattered more than architecture, hyperparameter tuning, or feature engineering. A 23% gap in MAE separated the best and worst loss functions on the same model architecture.

Squared Error (L2)

L(y,F)=12(yF)2,ri=yiF(xi)L(y, F) = \frac{1}{2}(y - F)^2, \qquad r_i = y_i - F(\mathbf{x}_i)

Simple and well-understood. The pseudo-residuals are literal residuals. The problem: it treats a 100KB error on a 50KB script the same as a 100KB error on a 500KB script. For data spanning multiple orders of magnitude, this is a poor inductive bias — the model chases large absolute errors, which are almost always from the heavy tail.

Huber Loss

Lδ(y,F)={12(yF)2if yFδδyF12δ2otherwiseL_\delta(y, F) = \begin{cases} \frac{1}{2}(y - F)^2 & \text{if } |y - F| \le \delta \\ \delta |y - F| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Quadratic near zero, linear in the tails. The transition point δ\delta controls robustness to outliers. The pseudo-residual is:

ri={yiF(xi)if yiF(xi)δδsign(yiF(xi))otherwiser_i = \begin{cases} y_i - F(\mathbf{x}_i) & \text{if } |y_i - F(\mathbf{x}_i)| \le \delta \\ \delta \cdot \text{sign}(y_i - F(\mathbf{x}_i)) & \text{otherwise} \end{cases}

This clips the gradient for large residuals, preventing outliers from dominating the tree fits. In practice, set δ\delta to the α\alpha-quantile (e.g., 90th percentile) of the absolute residuals.

Tweedie Loss

This is the loss function that matters most for zero-inflated, right-skewed data — exactly the distribution seen in web resource transfer sizes.

The Tweedie family has variance function Var(Yμ)=ϕμp\text{Var}(Y|\mu) = \phi \mu^p where pp is the power parameter. For 1<p<21 < p < 2, the distribution is a compound Poisson-Gamma: a Poisson number of Gamma-distributed claims. The key property is a point mass at zero plus a continuous right tail.

The Tweedie deviance loss (what XGBoost minimizes when you set objective='reg:tweedie') is:

L(y,F)=ye(1p)F1p+e(2p)F2pL(y, F) = -y \frac{e^{(1-p)F}}{1-p} + \frac{e^{(2-p)F}}{2-p}

where F=log(μ)F = \log(\mu) is the log-link prediction. The gradient and Hessian are:

LF=ye(1p)F+e(2p)F\frac{\partial L}{\partial F} = -y \, e^{(1-p)F} + e^{(2-p)F} 2LF2=y(1p)e(1p)F+(2p)e(2p)F\frac{\partial^2 L}{\partial F^2} = -y(1-p) \, e^{(1-p)F} + (2-p) \, e^{(2-p)F}

Why Tweedie works for tracker cost data. The Firefox tracker cost dataset has 39.5% exact zeros (beacons, pixels, empty responses) and a range spanning 5 orders of magnitude (0 to ~1.5MB). With p=1.5p = 1.5, the variance scales as μ1.5\mu^{1.5}. This means:

  • For near-zero predictions (beacons): the expected variance is tiny, so modest absolute errors still produce large deviance. But because the predictions themselves are small, the model does not waste capacity chasing them.
  • For large predictions (scripts, stylesheets): the allowed variance grows with μ1.5\mu^{1.5}, so the model tolerates proportionally larger absolute errors. A 10KB error on a 500KB script is fine; a 10KB error on a 1KB resource is not.

This is exactly the right inductive bias. The model learns to be proportionally accurate across the entire range rather than minimizing absolute error (which would over-focus on the heavy tail) or relative error (which would over-focus on the near-zero mass).

LossPseudo-residualBest for
Squared erroryFy - FSymmetric, light-tailed data
HuberClipped at δ\deltaOutlier-contaminated data
Tweedieye(1p)F+e(2p)F-y e^{(1-p)F} + e^{(2-p)F}Zero-inflated, right-skewed
Quantileτ1(y<F)\tau - \mathbf{1}(y < F)Prediction intervals
Log-coshtanh(yF)\tanh(y - F)Smooth Huber approximation

Quantile Loss

For the τ\tau-th quantile, the pinball loss is:

Lτ(y,F)={τ(yF)if yF(1τ)(Fy)if y<FL_\tau(y, F) = \begin{cases} \tau (y - F) & \text{if } y \ge F \\ (1-\tau)(F - y) & \text{if } y < F \end{cases}

The pseudo-residual is ri=τr_i = \tau if yi>F(xi)y_i > F(\mathbf{x}_i) and ri=(1τ)r_i = -(1-\tau) otherwise. Fitting models at τ=0.1\tau = 0.1 and τ=0.9\tau = 0.9 gives an 80% prediction interval. Useful for uncertainty quantification in deployment, where point predictions are insufficient.

Log-Cosh

L(y,F)=log(cosh(yF))L(y, F) = \log(\cosh(y - F))

The gradient is tanh(yF)\tanh(y - F), which is approximately (yF)(y - F) for small errors and ±1\pm 1 for large errors. This is a smooth, twice-differentiable approximation to Huber loss, which is convenient for second-order methods like XGBoost that need a well-behaved Hessian.


XGBoost: Regularized Gradient Boosting

XGBoost (Chen & Guestrin, 2016) adds three key innovations to Friedman’s gradient boosting: a regularized objective, a second-order Taylor approximation for efficient split finding, and systems-level optimizations.

Regularized Objective

The objective at round mm is:

L(m)=i=1nL(yi,y^i(m1)+hm(xi))+Ω(hm)\mathcal{L}^{(m)} = \sum_{i=1}^n L(y_i, \hat{y}_i^{(m-1)} + h_m(\mathbf{x}_i)) + \Omega(h_m)

where the regularization term penalizes tree complexity:

Ω(h)=γT+12λj=1Twj2+αj=1Twj\Omega(h) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 + \alpha \sum_{j=1}^T |w_j|

Here TT is the number of leaves, wjw_j is the weight (prediction) of leaf jj, γ\gamma penalizes adding more leaves, λ\lambda is L2 regularization on leaf weights, and α\alpha is L1 regularization. This is the tree analogue of elastic net regularization on the leaf outputs.

Second-Order Approximation

XGBoost takes a second-order Taylor expansion of the loss around the current predictions y^i(m1)\hat{y}_i^{(m-1)}:

L(m)i=1n[L(yi,y^i(m1))+gihm(xi)+12hihm(xi)2]+Ω(hm)\mathcal{L}^{(m)} \approx \sum_{i=1}^n \left[ L(y_i, \hat{y}_i^{(m-1)}) + g_i h_m(\mathbf{x}_i) + \frac{1}{2} h_i \, h_m(\mathbf{x}_i)^2 \right] + \Omega(h_m)

where gi=Ly^(m1)g_i = \frac{\partial L}{\partial \hat{y}^{(m-1)}} is the gradient and hi=2L(y^(m1))2h_i = \frac{\partial^2 L}{\partial (\hat{y}^{(m-1)})^2} is the Hessian, both evaluated at the current prediction. Dropping the constant term and grouping by leaf:

L~(m)=j=1T[Gjwj+12(Hj+λ)wj2]+γT\tilde{\mathcal{L}}^{(m)} = \sum_{j=1}^T \left[ G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \gamma T

where Gj=iIjgiG_j = \sum_{i \in I_j} g_i and Hj=iIjhiH_j = \sum_{i \in I_j} h_i are the sum of gradients and Hessians for examples falling in leaf jj.

Setting L~wj=0\frac{\partial \tilde{\mathcal{L}}}{\partial w_j} = 0 gives the optimal leaf weight:

wj=GjHj+λw_j^* = -\frac{G_j}{H_j + \lambda}

Substituting back gives the optimal objective value:

L~=12j=1TGj2Hj+λ+γT\tilde{\mathcal{L}}^* = -\frac{1}{2}\sum_{j=1}^T \frac{G_j^2}{H_j + \lambda} + \gamma T

The Split Gain Formula

When considering splitting a leaf into left (LL) and right (RR) children, the reduction in objective is:

Gain=12[GL2HL+λ+GR2HR+λ(GL+GR)2HL+HR+λ]γ\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma

This is elegant. The first two terms measure how well the left and right children fit (in a Newton-step sense), the third term measures how well the unsplit leaf fit, and γ\gamma is the cost of adding a split. If Gain0\text{Gain} \le 0, the split is not made — this provides built-in pre-pruning.

The Hessian HH acts as a natural weighting: examples where the loss is more curved (higher hih_i) contribute more to the denominator, effectively down-weighting uncertain regions. The min_child_weight parameter sets a threshold on HjH_j — a leaf must accumulate at least this much Hessian sum, preventing splits that rely on too few or too uncertain examples.

Systems Optimizations

  • Histogram-based splits. Rather than evaluating every possible split point, XGBoost (and LightGBM) bucket continuous features into discrete bins (default 256). This reduces split-finding complexity from O(nd)O(n \cdot d) to O(binsd)O(\text{bins} \cdot d) and improves cache locality.
  • Column subsampling. Sampling a fraction of features at each tree or split level, analogous to random forests. Reduces correlation between trees and provides regularization.
  • Row subsampling. Training each tree on a random subset of examples (subsample<1.0\texttt{subsample} < 1.0). Introduces stochasticity that helps generalization.
  • Missing value handling. For each split, XGBoost learns a default direction (left or right) for missing values by trying both and choosing the one that maximizes gain. This is learned, not heuristic.
  • Sparsity-aware computation. XGBoost only iterates over non-missing values when computing split gains, making it efficient on sparse data (common in NLP and recommender systems).

LightGBM and CatBoost

LightGBM

LightGBM (Ke et al., 2017) introduced two algorithmic innovations and a different tree growth strategy that make it significantly faster than XGBoost on large datasets.

Leaf-wise (best-first) growth. XGBoost grows trees level-wise: all nodes at depth dd are split before moving to depth d+1d+1. LightGBM grows leaf-wise: at each step, it splits the leaf with the largest gain reduction, regardless of depth. Leaf-wise growth produces more complex, asymmetric trees that can fit the same loss with fewer leaves. The risk is overfitting on small datasets, mitigated by max_depth and num_leaves constraints.

Gradient-based One-Side Sampling (GOSS). Not all training examples are equally informative. Examples with large gradients (under-fit by the current model) contribute more to the gain computation than examples with small gradients (already well-fit). GOSS keeps all examples with large gradients and randomly samples a fraction of examples with small gradients, up-weighting the sampled examples to maintain unbiased gradient estimates. This is stochastic gradient descent applied to the split-finding step.

Exclusive Feature Bundling (EFB). In high-dimensional sparse data, many features are mutually exclusive (rarely nonzero simultaneously). EFB bundles such features into single features by adding offsets, reducing the effective number of features and the cost of histogram construction. This is particularly effective for one-hot encoded categoricals.

CatBoost

CatBoost (Prokhoreva et al., 2018) addresses a subtle but important source of overfitting: target leakage in categorical feature encoding.

The problem. Standard approaches compute target statistics (e.g., mean target per category) using the entire training set, then use these statistics as features. But the target value of example ii was used to compute the statistic that example ii then sees as a feature — this is target leakage, and it biases the model toward overfitting on frequent categories.

Ordered boosting. CatBoost uses a random permutation of training examples. For example ii in the permutation, target statistics are computed using only examples that precede it in the permutation. This eliminates target leakage at the cost of increased variance, which CatBoost addresses by averaging over multiple permutations.

Oblivious trees. CatBoost defaults to symmetric (oblivious) decision trees: at each depth level, the same split condition is applied to all nodes. This acts as strong regularization and enables efficient prediction via bit manipulation.

FrameworkTree growthKey innovationBest for
XGBoostLevel-wiseSecond-order splits, regularized objectiveGeneral-purpose, structured data
LightGBMLeaf-wiseGOSS + EFBLarge datasets, high dimensionality
CatBoostSymmetricOrdered boostingCategorical-heavy data

Regularization

Gradient boosting is a greedy procedure with no built-in capacity constraint (you can always add more trees). Regularization is not optional — it is what makes the method work.

Shrinkage (learning rate η\eta). Scale each tree’s contribution by η(0,1]\eta \in (0, 1]. Smaller η\eta requires more trees but produces better generalization. The intuition: small steps in function space allow later trees to correct mistakes of earlier trees more effectively. A common heuristic is to set η\eta between 0.01 and 0.1 and use early stopping to determine MM.

Early stopping. Monitor validation loss and stop adding trees when it has not improved for kk consecutive rounds. This is the most important regularizer in practice — it directly controls the bias-variance tradeoff by limiting the number of boosting rounds.

Max depth. Limiting tree depth to dd constrains the interaction order of features. A depth-dd tree can model interactions between at most dd features. Typical values range from 4 to 8. Deeper trees reduce bias but increase variance and training time.

L1 and L2 on leaf weights. The α\alpha and λ\lambda terms in the XGBoost objective shrink leaf weights toward zero. L2 (λ\lambda) smooths predictions across leaves; L1 (α\alpha) encourages sparse leaf weights (some leaves predict exactly zero after regularization). In the gain formula, λ\lambda appears in the denominator Hj+λH_j + \lambda, damping the influence of leaves with low Hessian sums.

Minimum samples per leaf. In XGBoost, min_child_weight sets a threshold on the Hessian sum HjH_j. For squared error (hi=1h_i = 1 for all ii), this is equivalent to a minimum sample count. For other losses where the Hessian varies, it is more nuanced — it requires minimum “confidence” rather than minimum count.

Subsampling. Row subsampling (subsample) and column subsampling (colsample_bytree, colsample_bylevel) inject stochasticity. This decorrelates the trees in the ensemble and reduces variance, similar in spirit to the random feature selection in random forests.


SHAP for Interpretability

A gradient boosting model with 500 trees of depth 8 has on the order of 500×28128,000500 \times 2^8 \approx 128{,}000 leaves. Understanding what such a model has learned requires a principled attribution method.

Shapley Values

SHAP (SHapley Additive exPlanations) applies Shapley values from cooperative game theory to machine learning predictions. For a prediction f(x)f(\mathbf{x}), the Shapley value of feature jj is the average marginal contribution of feature jj across all possible coalitions of features:

ϕj=SF{j}S!(FS1)!F![f(S{j})f(S)]\phi_j = \sum_{S \subseteq \mathcal{F} \setminus \{j\}} \frac{|S|!\,(|\mathcal{F}| - |S| - 1)!}{|\mathcal{F}|!} \left[ f(S \cup \{j\}) - f(S) \right]

where F\mathcal{F} is the set of all features and f(S)f(S) is the model’s expected output conditioned on the features in SS being fixed. Computing this exactly requires 2F2^{|\mathcal{F}|} evaluations — exponential in the number of features.

TreeSHAP

Lundberg et al. (2020) showed that for tree-based models, exact Shapley values can be computed in O(TLD2)O(TLD^2) time, where TT is the number of trees, LL is the maximum number of leaves, and DD is the maximum depth. The key insight is that tree structure constrains the coalitions that matter: a feature only interacts with features on its path from root to leaf.

TreeSHAP recursively tracks the proportion of all possible feature orderings that are consistent with each path through the tree. This avoids the exponential enumeration by exploiting the tree’s conditional independence structure.

SHAP Properties

SHAP satisfies three axiomatic properties:

  1. Local accuracy. f(x)=ϕ0+j=1Fϕj(x)f(\mathbf{x}) = \phi_0 + \sum_{j=1}^{|\mathcal{F}|} \phi_j(\mathbf{x}), where ϕ0=E[f(x)]\phi_0 = \mathbb{E}[f(\mathbf{x})]. The attributions sum to the prediction.
  2. Missingness. If feature jj is missing (not used by the model), then ϕj=0\phi_j = 0.
  3. Consistency. If a model changes so that feature jj‘s marginal contribution increases or stays the same for all possible coalitions, then ϕj\phi_j does not decrease.

These properties are uniquely satisfied by Shapley values (Shapley, 1953). No other attribution method satisfies all three.

Application to Tracker Cost Prediction

In the Firefox tracker cost model, SHAP analysis revealed that domain_type_median — the historical median transfer size for each tracker domain — dominated all other features at 8.65% gain. This makes intuitive sense: a domain that historically serves 200KB scripts will probably continue to serve 200KB scripts. The SHAP dependence plot showed a near-linear relationship between domain_type_median and the SHAP value, with the model using other features (content type, domain frequency, compressed size) to adjust for deviations from the median.


Hyperparameter Tuning

Gradient boosting has many interacting hyperparameters. Grid search is infeasible; random search is wasteful. Bayesian optimization (e.g., Optuna, Hyperopt) is the standard approach.

Key Hyperparameters

ParameterRangeEffect
n_estimators100—5000Number of boosting rounds. Use early stopping.
learning_rate (η\eta)0.01—0.3Step size. Smaller = more rounds needed, better generalization.
max_depth3—10Maximum tree depth. Controls interaction order.
min_child_weight1—100Minimum Hessian sum per leaf. Prevents splits on small groups.
subsample0.5—1.0Row subsampling fraction.
colsample_bytree0.5—1.0Feature subsampling per tree.
lambda (L2 reg)0—10L2 penalty on leaf weights.
alpha (L1 reg)0—10L1 penalty on leaf weights.
gamma0—5Minimum gain for a split.

Bayesian Optimization with Optuna

Optuna uses a Tree-structured Parzen Estimator (TPE) to model the relationship between hyperparameters and validation loss. At each trial, TPE fits two density estimators — one over hyperparameter configurations that produced good results (l(x)l(\mathbf{x})) and one over configurations that produced bad results (g(x)g(\mathbf{x})) — and selects the next configuration by maximizing l(x)/g(x)l(\mathbf{x}) / g(\mathbf{x}).

The typical setup:

import optuna
import xgboost as xgb

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 100),
        'lambda': trial.suggest_float('lambda', 1e-3, 10.0, log=True),
        'objective': 'reg:tweedie',
        'tweedie_variance_power': 1.5,
    }
    cv_results = xgb.cv(params, dtrain, num_boost_round=1000,
                        nfold=5, early_stopping_rounds=50,
                        metrics='mae')
    return cv_results['test-mae-mean'].min()

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=40)

With 40 trials and 5-fold cross-validation, this explores the hyperparameter space efficiently. The log-scale sampling for learning_rate and lambda reflects their multiplicative effect on model behavior.


Connection to the Tracker Cost Model

The Firefox Content Blocking team needed to estimate the network cost of third-party trackers blocked before they load. The challenge: predicting transfer sizes for resources that are never fetched, using only metadata available at block time (domain, content type, protocol, etc.).

Architecture

XGBoost with Tweedie loss (p=1.5p = 1.5), trained on observed (non-blocked) tracker transfers and applied to blocked ones. The key configuration:

ParameterValue
n_estimators500
max_depth8
learning_rate0.05
objectivereg:tweedie
tweedie_variance_power1.5
subsample0.8
Tuning methodOptuna, 40 trials, 5-fold CV

Results

ModelMAE (bytes)Improvement over LUT
XGBoost + Tweedie3,46647.5%
XGBoost + Squared Error4,07438.3%
XGBoost + Huber3,99739.4%
Lookup Table (baseline)6,601

The loss function choice produced a 23% gap in MAE between Tweedie and squared error on the same architecture. This is a larger effect than any hyperparameter or feature engineering decision. The reason: Tweedie’s variance scaling matches the data-generating process. Tracker transfer sizes are not normally distributed — they are zero-inflated and right-skewed, with beacons at 0 bytes and scripts exceeding 1MB. Squared error wastes model capacity chasing large absolute errors on heavy scripts; Tweedie allocates capacity proportionally.

Aggregation

The per-request MAE of 3,466 bytes sounds large, but the model was evaluated on aggregate accuracy at the page level. Across 7,376 test pages, predictions of total tracker cost per page had a median error of 2.3%. Individual errors cancel under aggregation — a consequence of the law of large numbers when errors are approximately unbiased. This is the production-relevant metric: Firefox reports total tracker cost savings per page, not per-request estimates.

The model runs in the Nimbus experimentation platform and serves cost estimates in the Enhanced Tracking Protection panel for approximately 250 million Firefox users.