Gradient Boosting
Gradient boosting constructs an additive model by performing gradient descent in function space. Where neural networks optimize a fixed-architecture model by updating parameters, gradient boosting optimizes by adding new functions — each one a small tree that corrects the errors of the ensemble so far. This reframing, from parameter space to function space, is the central insight.
Additive Modeling and Functional Gradient Descent
We want to find a function that minimizes the expected loss:
In practice we minimize the empirical risk over training examples:
Gradient boosting approximates as a sum of base learners:
where is a constant initialization, each is a regression tree, is a step size found by line search, and is the shrinkage (learning rate).
The key insight is that at each iteration, we want to add the function that most reduces the loss. If we could take a functional gradient, the steepest descent direction at the current model is:
This is a vector of values — one per training example — representing the direction in “prediction space” that most decreases the loss. Since we cannot store an arbitrary function, we fit a tree to these negative gradients. The tree acts as a parametric approximation of the functional gradient, one that generalizes to unseen points.
This is why gradient boosting is sometimes called “stage-wise additive modeling via gradient descent in function space” (Friedman, 2001). Each tree is not fitting the original targets — it is fitting the direction of steepest improvement.
The Algorithm
Initialization. Set . For squared error this is the mean . For Tweedie deviance it is the log of the mean.
For :
- Compute pseudo-residuals. For each training example :
-
Fit a regression tree to the pseudo-residuals , yielding terminal regions .
-
Compute optimal leaf values via line search within each terminal region:
- Update the model:
Output .
Squared Error Walkthrough
For , the pseudo-residual is:
These are literal residuals. The tree in each round fits the current errors, and the leaf values are simply the mean residual within each leaf. This is the special case that makes gradient boosting intuitive — but the framework generalizes to any differentiable loss.
Loss Functions in Depth
The choice of loss function is the single most impactful decision in a gradient boosting pipeline. In my experience deploying XGBoost at scale — serving predictions to 250M Firefox users — the loss function mattered more than architecture, hyperparameter tuning, or feature engineering. A 23% gap in MAE separated the best and worst loss functions on the same model architecture.
Squared Error (L2)
Simple and well-understood. The pseudo-residuals are literal residuals. The problem: it treats a 100KB error on a 50KB script the same as a 100KB error on a 500KB script. For data spanning multiple orders of magnitude, this is a poor inductive bias — the model chases large absolute errors, which are almost always from the heavy tail.
Huber Loss
Quadratic near zero, linear in the tails. The transition point controls robustness to outliers. The pseudo-residual is:
This clips the gradient for large residuals, preventing outliers from dominating the tree fits. In practice, set to the -quantile (e.g., 90th percentile) of the absolute residuals.
Tweedie Loss
This is the loss function that matters most for zero-inflated, right-skewed data — exactly the distribution seen in web resource transfer sizes.
The Tweedie family has variance function where is the power parameter. For , the distribution is a compound Poisson-Gamma: a Poisson number of Gamma-distributed claims. The key property is a point mass at zero plus a continuous right tail.
The Tweedie deviance loss (what XGBoost minimizes when you set objective='reg:tweedie') is:
where is the log-link prediction. The gradient and Hessian are:
Why Tweedie works for tracker cost data. The Firefox tracker cost dataset has 39.5% exact zeros (beacons, pixels, empty responses) and a range spanning 5 orders of magnitude (0 to ~1.5MB). With , the variance scales as . This means:
- For near-zero predictions (beacons): the expected variance is tiny, so modest absolute errors still produce large deviance. But because the predictions themselves are small, the model does not waste capacity chasing them.
- For large predictions (scripts, stylesheets): the allowed variance grows with , so the model tolerates proportionally larger absolute errors. A 10KB error on a 500KB script is fine; a 10KB error on a 1KB resource is not.
This is exactly the right inductive bias. The model learns to be proportionally accurate across the entire range rather than minimizing absolute error (which would over-focus on the heavy tail) or relative error (which would over-focus on the near-zero mass).
| Loss | Pseudo-residual | Best for |
|---|---|---|
| Squared error | Symmetric, light-tailed data | |
| Huber | Clipped at | Outlier-contaminated data |
| Tweedie | Zero-inflated, right-skewed | |
| Quantile | Prediction intervals | |
| Log-cosh | Smooth Huber approximation |
Quantile Loss
For the -th quantile, the pinball loss is:
The pseudo-residual is if and otherwise. Fitting models at and gives an 80% prediction interval. Useful for uncertainty quantification in deployment, where point predictions are insufficient.
Log-Cosh
The gradient is , which is approximately for small errors and for large errors. This is a smooth, twice-differentiable approximation to Huber loss, which is convenient for second-order methods like XGBoost that need a well-behaved Hessian.
XGBoost: Regularized Gradient Boosting
XGBoost (Chen & Guestrin, 2016) adds three key innovations to Friedman’s gradient boosting: a regularized objective, a second-order Taylor approximation for efficient split finding, and systems-level optimizations.
Regularized Objective
The objective at round is:
where the regularization term penalizes tree complexity:
Here is the number of leaves, is the weight (prediction) of leaf , penalizes adding more leaves, is L2 regularization on leaf weights, and is L1 regularization. This is the tree analogue of elastic net regularization on the leaf outputs.
Second-Order Approximation
XGBoost takes a second-order Taylor expansion of the loss around the current predictions :
where is the gradient and is the Hessian, both evaluated at the current prediction. Dropping the constant term and grouping by leaf:
where and are the sum of gradients and Hessians for examples falling in leaf .
Setting gives the optimal leaf weight:
Substituting back gives the optimal objective value:
The Split Gain Formula
When considering splitting a leaf into left () and right () children, the reduction in objective is:
This is elegant. The first two terms measure how well the left and right children fit (in a Newton-step sense), the third term measures how well the unsplit leaf fit, and is the cost of adding a split. If , the split is not made — this provides built-in pre-pruning.
The Hessian acts as a natural weighting: examples where the loss is more curved (higher ) contribute more to the denominator, effectively down-weighting uncertain regions. The min_child_weight parameter sets a threshold on — a leaf must accumulate at least this much Hessian sum, preventing splits that rely on too few or too uncertain examples.
Systems Optimizations
- Histogram-based splits. Rather than evaluating every possible split point, XGBoost (and LightGBM) bucket continuous features into discrete bins (default 256). This reduces split-finding complexity from to and improves cache locality.
- Column subsampling. Sampling a fraction of features at each tree or split level, analogous to random forests. Reduces correlation between trees and provides regularization.
- Row subsampling. Training each tree on a random subset of examples (). Introduces stochasticity that helps generalization.
- Missing value handling. For each split, XGBoost learns a default direction (left or right) for missing values by trying both and choosing the one that maximizes gain. This is learned, not heuristic.
- Sparsity-aware computation. XGBoost only iterates over non-missing values when computing split gains, making it efficient on sparse data (common in NLP and recommender systems).
LightGBM and CatBoost
LightGBM
LightGBM (Ke et al., 2017) introduced two algorithmic innovations and a different tree growth strategy that make it significantly faster than XGBoost on large datasets.
Leaf-wise (best-first) growth. XGBoost grows trees level-wise: all nodes at depth are split before moving to depth . LightGBM grows leaf-wise: at each step, it splits the leaf with the largest gain reduction, regardless of depth. Leaf-wise growth produces more complex, asymmetric trees that can fit the same loss with fewer leaves. The risk is overfitting on small datasets, mitigated by max_depth and num_leaves constraints.
Gradient-based One-Side Sampling (GOSS). Not all training examples are equally informative. Examples with large gradients (under-fit by the current model) contribute more to the gain computation than examples with small gradients (already well-fit). GOSS keeps all examples with large gradients and randomly samples a fraction of examples with small gradients, up-weighting the sampled examples to maintain unbiased gradient estimates. This is stochastic gradient descent applied to the split-finding step.
Exclusive Feature Bundling (EFB). In high-dimensional sparse data, many features are mutually exclusive (rarely nonzero simultaneously). EFB bundles such features into single features by adding offsets, reducing the effective number of features and the cost of histogram construction. This is particularly effective for one-hot encoded categoricals.
CatBoost
CatBoost (Prokhoreva et al., 2018) addresses a subtle but important source of overfitting: target leakage in categorical feature encoding.
The problem. Standard approaches compute target statistics (e.g., mean target per category) using the entire training set, then use these statistics as features. But the target value of example was used to compute the statistic that example then sees as a feature — this is target leakage, and it biases the model toward overfitting on frequent categories.
Ordered boosting. CatBoost uses a random permutation of training examples. For example in the permutation, target statistics are computed using only examples that precede it in the permutation. This eliminates target leakage at the cost of increased variance, which CatBoost addresses by averaging over multiple permutations.
Oblivious trees. CatBoost defaults to symmetric (oblivious) decision trees: at each depth level, the same split condition is applied to all nodes. This acts as strong regularization and enables efficient prediction via bit manipulation.
| Framework | Tree growth | Key innovation | Best for |
|---|---|---|---|
| XGBoost | Level-wise | Second-order splits, regularized objective | General-purpose, structured data |
| LightGBM | Leaf-wise | GOSS + EFB | Large datasets, high dimensionality |
| CatBoost | Symmetric | Ordered boosting | Categorical-heavy data |
Regularization
Gradient boosting is a greedy procedure with no built-in capacity constraint (you can always add more trees). Regularization is not optional — it is what makes the method work.
Shrinkage (learning rate ). Scale each tree’s contribution by . Smaller requires more trees but produces better generalization. The intuition: small steps in function space allow later trees to correct mistakes of earlier trees more effectively. A common heuristic is to set between 0.01 and 0.1 and use early stopping to determine .
Early stopping. Monitor validation loss and stop adding trees when it has not improved for consecutive rounds. This is the most important regularizer in practice — it directly controls the bias-variance tradeoff by limiting the number of boosting rounds.
Max depth. Limiting tree depth to constrains the interaction order of features. A depth- tree can model interactions between at most features. Typical values range from 4 to 8. Deeper trees reduce bias but increase variance and training time.
L1 and L2 on leaf weights. The and terms in the XGBoost objective shrink leaf weights toward zero. L2 () smooths predictions across leaves; L1 () encourages sparse leaf weights (some leaves predict exactly zero after regularization). In the gain formula, appears in the denominator , damping the influence of leaves with low Hessian sums.
Minimum samples per leaf. In XGBoost, min_child_weight sets a threshold on the Hessian sum . For squared error ( for all ), this is equivalent to a minimum sample count. For other losses where the Hessian varies, it is more nuanced — it requires minimum “confidence” rather than minimum count.
Subsampling. Row subsampling (subsample) and column subsampling (colsample_bytree, colsample_bylevel) inject stochasticity. This decorrelates the trees in the ensemble and reduces variance, similar in spirit to the random feature selection in random forests.
SHAP for Interpretability
A gradient boosting model with 500 trees of depth 8 has on the order of leaves. Understanding what such a model has learned requires a principled attribution method.
Shapley Values
SHAP (SHapley Additive exPlanations) applies Shapley values from cooperative game theory to machine learning predictions. For a prediction , the Shapley value of feature is the average marginal contribution of feature across all possible coalitions of features:
where is the set of all features and is the model’s expected output conditioned on the features in being fixed. Computing this exactly requires evaluations — exponential in the number of features.
TreeSHAP
Lundberg et al. (2020) showed that for tree-based models, exact Shapley values can be computed in time, where is the number of trees, is the maximum number of leaves, and is the maximum depth. The key insight is that tree structure constrains the coalitions that matter: a feature only interacts with features on its path from root to leaf.
TreeSHAP recursively tracks the proportion of all possible feature orderings that are consistent with each path through the tree. This avoids the exponential enumeration by exploiting the tree’s conditional independence structure.
SHAP Properties
SHAP satisfies three axiomatic properties:
- Local accuracy. , where . The attributions sum to the prediction.
- Missingness. If feature is missing (not used by the model), then .
- Consistency. If a model changes so that feature ‘s marginal contribution increases or stays the same for all possible coalitions, then does not decrease.
These properties are uniquely satisfied by Shapley values (Shapley, 1953). No other attribution method satisfies all three.
Application to Tracker Cost Prediction
In the Firefox tracker cost model, SHAP analysis revealed that domain_type_median — the historical median transfer size for each tracker domain — dominated all other features at 8.65% gain. This makes intuitive sense: a domain that historically serves 200KB scripts will probably continue to serve 200KB scripts. The SHAP dependence plot showed a near-linear relationship between domain_type_median and the SHAP value, with the model using other features (content type, domain frequency, compressed size) to adjust for deviations from the median.
Hyperparameter Tuning
Gradient boosting has many interacting hyperparameters. Grid search is infeasible; random search is wasteful. Bayesian optimization (e.g., Optuna, Hyperopt) is the standard approach.
Key Hyperparameters
| Parameter | Range | Effect |
|---|---|---|
n_estimators | 100—5000 | Number of boosting rounds. Use early stopping. |
learning_rate () | 0.01—0.3 | Step size. Smaller = more rounds needed, better generalization. |
max_depth | 3—10 | Maximum tree depth. Controls interaction order. |
min_child_weight | 1—100 | Minimum Hessian sum per leaf. Prevents splits on small groups. |
subsample | 0.5—1.0 | Row subsampling fraction. |
colsample_bytree | 0.5—1.0 | Feature subsampling per tree. |
lambda (L2 reg) | 0—10 | L2 penalty on leaf weights. |
alpha (L1 reg) | 0—10 | L1 penalty on leaf weights. |
gamma | 0—5 | Minimum gain for a split. |
Bayesian Optimization with Optuna
Optuna uses a Tree-structured Parzen Estimator (TPE) to model the relationship between hyperparameters and validation loss. At each trial, TPE fits two density estimators — one over hyperparameter configurations that produced good results () and one over configurations that produced bad results () — and selects the next configuration by maximizing .
The typical setup:
import optuna
import xgboost as xgb
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 100),
'lambda': trial.suggest_float('lambda', 1e-3, 10.0, log=True),
'objective': 'reg:tweedie',
'tweedie_variance_power': 1.5,
}
cv_results = xgb.cv(params, dtrain, num_boost_round=1000,
nfold=5, early_stopping_rounds=50,
metrics='mae')
return cv_results['test-mae-mean'].min()
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=40)With 40 trials and 5-fold cross-validation, this explores the hyperparameter space efficiently. The log-scale sampling for learning_rate and lambda reflects their multiplicative effect on model behavior.
Connection to the Tracker Cost Model
The Firefox Content Blocking team needed to estimate the network cost of third-party trackers blocked before they load. The challenge: predicting transfer sizes for resources that are never fetched, using only metadata available at block time (domain, content type, protocol, etc.).
Architecture
XGBoost with Tweedie loss (), trained on observed (non-blocked) tracker transfers and applied to blocked ones. The key configuration:
| Parameter | Value |
|---|---|
n_estimators | 500 |
max_depth | 8 |
learning_rate | 0.05 |
objective | reg:tweedie |
tweedie_variance_power | 1.5 |
subsample | 0.8 |
| Tuning method | Optuna, 40 trials, 5-fold CV |
Results
| Model | MAE (bytes) | Improvement over LUT |
|---|---|---|
| XGBoost + Tweedie | 3,466 | 47.5% |
| XGBoost + Squared Error | 4,074 | 38.3% |
| XGBoost + Huber | 3,997 | 39.4% |
| Lookup Table (baseline) | 6,601 | — |
The loss function choice produced a 23% gap in MAE between Tweedie and squared error on the same architecture. This is a larger effect than any hyperparameter or feature engineering decision. The reason: Tweedie’s variance scaling matches the data-generating process. Tracker transfer sizes are not normally distributed — they are zero-inflated and right-skewed, with beacons at 0 bytes and scripts exceeding 1MB. Squared error wastes model capacity chasing large absolute errors on heavy scripts; Tweedie allocates capacity proportionally.
Aggregation
The per-request MAE of 3,466 bytes sounds large, but the model was evaluated on aggregate accuracy at the page level. Across 7,376 test pages, predictions of total tracker cost per page had a median error of 2.3%. Individual errors cancel under aggregation — a consequence of the law of large numbers when errors are approximately unbiased. This is the production-relevant metric: Firefox reports total tracker cost savings per page, not per-request estimates.
The model runs in the Nimbus experimentation platform and serves cost estimates in the Enhanced Tracking Protection panel for approximately 250 million Firefox users.