Gradient Boosting

Gradient boosting constructs an additive model by performing gradient descent in function space. Where neural networks optimize a fixed-architecture model by updating parameters, gradient boosting optimizes by adding new functions — each one a small tree that corrects the errors of the ensemble so far. This reframing, from parameter space to function space, is the central insight.

Additive Modeling and Functional Gradient Descent

We want to find a function $F^*$ that minimizes the expected loss:

F^* = \arg\min_F \mathbb{E}_{x,y}[L(y, F(\mathbf{x}))]

In practice we minimize the empirical risk over $n$ training examples:

F^* = \arg\min_F \sum_{i=1}^{n} L(y_i, F(\mathbf{x}_i))

Gradient boosting approximates $F^*$ as a sum of base learners:

F_M(\mathbf{x}) = F_0(\mathbf{x}) + \sum_{m=1}^{M} \eta \, \gamma_m \, h_m(\mathbf{x})

where $F_0$ is a constant initialization, each $h_m$ is a regression tree, $\gamma_m$ is a step size found by line search, and $\eta \in (0, 1]$ is the shrinkage (learning rate).

The key insight is that at each iteration, we want to add the function $h$ that most reduces the loss. If we could take a functional gradient, the steepest descent direction at the current model $F_{m-1}$ is:

-\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \bigg|_{F = F_{m-1}}

This is a vector of $n$ values — one per training example — representing the direction in “prediction space” that most decreases the loss. Since we cannot store an arbitrary function, we fit a tree $h_m$ to these negative gradients. The tree acts as a parametric approximation of the functional gradient, one that generalizes to unseen points.

This is why gradient boosting is sometimes called “stage-wise additive modeling via gradient descent in function space” (Friedman, 2001). Each tree is not fitting the original targets — it is fitting the direction of steepest improvement.

The Algorithm

Initialization. Set $F_0(\mathbf{x}) = \arg\min_c \sum_{i=1}^n L(y_i, c)$ . For squared error this is the mean $\bar{y}$ . For Tweedie deviance it is the log of the mean.

For $m = 1, 2, \ldots, M$ :

Compute pseudo-residuals. For each training example $i$ :

r_{im} = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \bigg|_{F = F_{m-1}}

Fit a regression tree $h_m$ to the pseudo-residuals $\{(\mathbf{x}_i, r_{im})\}_{i=1}^n$ , yielding terminal regions $\{R_{jm}\}_{j=1}^J$ .
Compute optimal leaf values via line search within each terminal region:

\gamma_{jm} = \arg\min_\gamma \sum_{\mathbf{x}_i \in R_{jm}} L(y_i, F_{m-1}(\mathbf{x}_i) + \gamma)

Update the model:

F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \sum_{j=1}^{J} \gamma_{jm} \, \mathbf{1}(\mathbf{x} \in R_{jm})

Output $F_M(\mathbf{x})$ .

Squared Error Walkthrough

For $L(y, F) = \frac{1}{2}(y - F)^2$ , the pseudo-residual is:

r_{im} = -\frac{\partial}{\partial F}\frac{1}{2}(y_i - F)^2 \bigg|_{F = F_{m-1}(\mathbf{x}_i)} = y_i - F_{m-1}(\mathbf{x}_i)

These are literal residuals. The tree in each round fits the current errors, and the leaf values $\gamma_{jm}$ are simply the mean residual within each leaf. This is the special case that makes gradient boosting intuitive — but the framework generalizes to any differentiable loss.

Loss Functions in Depth

The choice of loss function is the single most impactful decision in a gradient boosting pipeline. In my experience deploying XGBoost at scale — serving predictions to 250M Firefox users — the loss function mattered more than architecture, hyperparameter tuning, or feature engineering. A 23% gap in MAE separated the best and worst loss functions on the same model architecture.

Squared Error (L2)

L(y, F) = \frac{1}{2}(y - F)^2, \qquad r_i = y_i - F(\mathbf{x}_i)

Simple and well-understood. The pseudo-residuals are literal residuals. The problem: it treats a 100KB error on a 50KB script the same as a 100KB error on a 500KB script. For data spanning multiple orders of magnitude, this is a poor inductive bias — the model chases large absolute errors, which are almost always from the heavy tail.

Huber Loss

L_\delta(y, F) = \begin{cases} \frac{1}{2}(y - F)^2 & \text{if } |y - F| \le \delta \\ \delta |y - F| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Quadratic near zero, linear in the tails. The transition point $\delta$ controls robustness to outliers. The pseudo-residual is:

r_i = \begin{cases} y_i - F(\mathbf{x}_i) & \text{if } |y_i - F(\mathbf{x}_i)| \le \delta \\ \delta \cdot \text{sign}(y_i - F(\mathbf{x}_i)) & \text{otherwise} \end{cases}

This clips the gradient for large residuals, preventing outliers from dominating the tree fits. In practice, set $\delta$ to the $\alpha$ -quantile (e.g., 90th percentile) of the absolute residuals.

Tweedie Loss

This is the loss function that matters most for zero-inflated, right-skewed data — exactly the distribution seen in web resource transfer sizes.

The Tweedie family has variance function $\text{Var}(Y|\mu) = \phi \mu^p$ where $p$ is the power parameter. For $1 < p < 2$ , the distribution is a compound Poisson-Gamma: a Poisson number of Gamma-distributed claims. The key property is a point mass at zero plus a continuous right tail.

The Tweedie deviance loss (what XGBoost minimizes when you set objective='reg:tweedie') is:

L(y, F) = -y \frac{e^{(1-p)F}}{1-p} + \frac{e^{(2-p)F}}{2-p}

where $F = \log(\mu)$ is the log-link prediction. The gradient and Hessian are:

\frac{\partial L}{\partial F} = -y \, e^{(1-p)F} + e^{(2-p)F}

\frac{\partial^2 L}{\partial F^2} = -y(1-p) \, e^{(1-p)F} + (2-p) \, e^{(2-p)F}

Why Tweedie works for tracker cost data. The Firefox tracker cost dataset has 39.5% exact zeros (beacons, pixels, empty responses) and a range spanning 5 orders of magnitude (0 to ~1.5MB). With $p = 1.5$ , the variance scales as $\mu^{1.5}$ . This means:

For near-zero predictions (beacons): the expected variance is tiny, so modest absolute errors still produce large deviance. But because the predictions themselves are small, the model does not waste capacity chasing them.
For large predictions (scripts, stylesheets): the allowed variance grows with $\mu^{1.5}$ , so the model tolerates proportionally larger absolute errors. A 10KB error on a 500KB script is fine; a 10KB error on a 1KB resource is not.

This is exactly the right inductive bias. The model learns to be proportionally accurate across the entire range rather than minimizing absolute error (which would over-focus on the heavy tail) or relative error (which would over-focus on the near-zero mass).

Loss	Pseudo-residual	Best for
Squared error	$y - F$	Symmetric, light-tailed data
Huber	Clipped at $\delta$	Outlier-contaminated data
Tweedie	$-y e^{(1-p)F} + e^{(2-p)F}$	Zero-inflated, right-skewed
Quantile	$\tau - \mathbf{1}(y < F)$	Prediction intervals
Log-cosh	$\tanh(y - F)$	Smooth Huber approximation

Quantile Loss

For the $\tau$ -th quantile, the pinball loss is:

L_\tau(y, F) = \begin{cases} \tau (y - F) & \text{if } y \ge F \\ (1-\tau)(F - y) & \text{if } y < F \end{cases}

The pseudo-residual is $r_i = \tau$ if $y_i > F(\mathbf{x}_i)$ and $r_i = -(1-\tau)$ otherwise. Fitting models at $\tau = 0.1$ and $\tau = 0.9$ gives an 80% prediction interval. Useful for uncertainty quantification in deployment, where point predictions are insufficient.

Log-Cosh

L(y, F) = \log(\cosh(y - F))

The gradient is $\tanh(y - F)$ , which is approximately $(y - F)$ for small errors and $\pm 1$ for large errors. This is a smooth, twice-differentiable approximation to Huber loss, which is convenient for second-order methods like XGBoost that need a well-behaved Hessian.

XGBoost: Regularized Gradient Boosting

XGBoost (Chen & Guestrin, 2016) adds three key innovations to Friedman’s gradient boosting: a regularized objective, a second-order Taylor approximation for efficient split finding, and systems-level optimizations.

Regularized Objective

The objective at round $m$ is:

\mathcal{L}^{(m)} = \sum_{i=1}^n L(y_i, \hat{y}_i^{(m-1)} + h_m(\mathbf{x}_i)) + \Omega(h_m)

where the regularization term penalizes tree complexity:

\Omega(h) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 + \alpha \sum_{j=1}^T |w_j|

Here $T$ is the number of leaves, $w_j$ is the weight (prediction) of leaf $j$ , $\gamma$ penalizes adding more leaves, $\lambda$ is L2 regularization on leaf weights, and $\alpha$ is L1 regularization. This is the tree analogue of elastic net regularization on the leaf outputs.

Second-Order Approximation

XGBoost takes a second-order Taylor expansion of the loss around the current predictions $\hat{y}_i^{(m-1)}$ :

\mathcal{L}^{(m)} \approx \sum_{i=1}^n \left[ L(y_i, \hat{y}_i^{(m-1)}) + g_i h_m(\mathbf{x}_i) + \frac{1}{2} h_i \, h_m(\mathbf{x}_i)^2 \right] + \Omega(h_m)

where $g_i = \frac{\partial L}{\partial \hat{y}^{(m-1)}}$ is the gradient and $h_i = \frac{\partial^2 L}{\partial (\hat{y}^{(m-1)})^2}$ is the Hessian, both evaluated at the current prediction. Dropping the constant term and grouping by leaf:

\tilde{\mathcal{L}}^{(m)} = \sum_{j=1}^T \left[ G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \gamma T

where $G_j = \sum_{i \in I_j} g_i$ and $H_j = \sum_{i \in I_j} h_i$ are the sum of gradients and Hessians for examples falling in leaf $j$ .

Setting $\frac{\partial \tilde{\mathcal{L}}}{\partial w_j} = 0$ gives the optimal leaf weight:

w_j^* = -\frac{G_j}{H_j + \lambda}

Substituting back gives the optimal objective value:

\tilde{\mathcal{L}}^* = -\frac{1}{2}\sum_{j=1}^T \frac{G_j^2}{H_j + \lambda} + \gamma T

The Split Gain Formula

When considering splitting a leaf into left ( $L$ ) and right ( $R$ ) children, the reduction in objective is:

\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma

This is elegant. The first two terms measure how well the left and right children fit (in a Newton-step sense), the third term measures how well the unsplit leaf fit, and $\gamma$ is the cost of adding a split. If $\text{Gain} \le 0$ , the split is not made — this provides built-in pre-pruning.

The Hessian $H$ acts as a natural weighting: examples where the loss is more curved (higher $h_i$ ) contribute more to the denominator, effectively down-weighting uncertain regions. The min_child_weight parameter sets a threshold on $H_j$ — a leaf must accumulate at least this much Hessian sum, preventing splits that rely on too few or too uncertain examples.

Systems Optimizations

Histogram-based splits. Rather than evaluating every possible split point, XGBoost (and LightGBM) bucket continuous features into discrete bins (default 256). This reduces split-finding complexity from $O(n \cdot d)$ to $O(\text{bins} \cdot d)$ and improves cache locality.
Column subsampling. Sampling a fraction of features at each tree or split level, analogous to random forests. Reduces correlation between trees and provides regularization.
Row subsampling. Training each tree on a random subset of examples ( $\texttt{subsample} < 1.0$ ). Introduces stochasticity that helps generalization.
Missing value handling. For each split, XGBoost learns a default direction (left or right) for missing values by trying both and choosing the one that maximizes gain. This is learned, not heuristic.
Sparsity-aware computation. XGBoost only iterates over non-missing values when computing split gains, making it efficient on sparse data (common in NLP and recommender systems).

LightGBM and CatBoost

LightGBM

LightGBM (Ke et al., 2017) introduced two algorithmic innovations and a different tree growth strategy that make it significantly faster than XGBoost on large datasets.

Leaf-wise (best-first) growth. XGBoost grows trees level-wise: all nodes at depth $d$ are split before moving to depth $d+1$ . LightGBM grows leaf-wise: at each step, it splits the leaf with the largest gain reduction, regardless of depth. Leaf-wise growth produces more complex, asymmetric trees that can fit the same loss with fewer leaves. The risk is overfitting on small datasets, mitigated by max_depth and num_leaves constraints.

Gradient-based One-Side Sampling (GOSS). Not all training examples are equally informative. Examples with large gradients (under-fit by the current model) contribute more to the gain computation than examples with small gradients (already well-fit). GOSS keeps all examples with large gradients and randomly samples a fraction of examples with small gradients, up-weighting the sampled examples to maintain unbiased gradient estimates. This is stochastic gradient descent applied to the split-finding step.

Exclusive Feature Bundling (EFB). In high-dimensional sparse data, many features are mutually exclusive (rarely nonzero simultaneously). EFB bundles such features into single features by adding offsets, reducing the effective number of features and the cost of histogram construction. This is particularly effective for one-hot encoded categoricals.

CatBoost

CatBoost (Prokhoreva et al., 2018) addresses a subtle but important source of overfitting: target leakage in categorical feature encoding.

The problem. Standard approaches compute target statistics (e.g., mean target per category) using the entire training set, then use these statistics as features. But the target value of example $i$ was used to compute the statistic that example $i$ then sees as a feature — this is target leakage, and it biases the model toward overfitting on frequent categories.

Ordered boosting. CatBoost uses a random permutation of training examples. For example $i$ in the permutation, target statistics are computed using only examples that precede it in the permutation. This eliminates target leakage at the cost of increased variance, which CatBoost addresses by averaging over multiple permutations.

Oblivious trees. CatBoost defaults to symmetric (oblivious) decision trees: at each depth level, the same split condition is applied to all nodes. This acts as strong regularization and enables efficient prediction via bit manipulation.

Framework	Tree growth	Key innovation	Best for
XGBoost	Level-wise	Second-order splits, regularized objective	General-purpose, structured data
LightGBM	Leaf-wise	GOSS + EFB	Large datasets, high dimensionality
CatBoost	Symmetric	Ordered boosting	Categorical-heavy data

Regularization

Gradient boosting is a greedy procedure with no built-in capacity constraint (you can always add more trees). Regularization is not optional — it is what makes the method work.

Shrinkage (learning rate $\eta$ ). Scale each tree’s contribution by $\eta \in (0, 1]$ . Smaller $\eta$ requires more trees but produces better generalization. The intuition: small steps in function space allow later trees to correct mistakes of earlier trees more effectively. A common heuristic is to set $\eta$ between 0.01 and 0.1 and use early stopping to determine $M$ .

Early stopping. Monitor validation loss and stop adding trees when it has not improved for $k$ consecutive rounds. This is the most important regularizer in practice — it directly controls the bias-variance tradeoff by limiting the number of boosting rounds.

Max depth. Limiting tree depth to $d$ constrains the interaction order of features. A depth- $d$ tree can model interactions between at most $d$ features. Typical values range from 4 to 8. Deeper trees reduce bias but increase variance and training time.

L1 and L2 on leaf weights. The $\alpha$ and $\lambda$ terms in the XGBoost objective shrink leaf weights toward zero. L2 ( $\lambda$ ) smooths predictions across leaves; L1 ( $\alpha$ ) encourages sparse leaf weights (some leaves predict exactly zero after regularization). In the gain formula, $\lambda$ appears in the denominator $H_j + \lambda$ , damping the influence of leaves with low Hessian sums.

Minimum samples per leaf. In XGBoost, min_child_weight sets a threshold on the Hessian sum $H_j$ . For squared error ( $h_i = 1$ for all $i$ ), this is equivalent to a minimum sample count. For other losses where the Hessian varies, it is more nuanced — it requires minimum “confidence” rather than minimum count.

Subsampling. Row subsampling (subsample) and column subsampling (colsample_bytree, colsample_bylevel) inject stochasticity. This decorrelates the trees in the ensemble and reduces variance, similar in spirit to the random feature selection in random forests.

SHAP for Interpretability

A gradient boosting model with 500 trees of depth 8 has on the order of $500 \times 2^8 \approx 128{,}000$ leaves. Understanding what such a model has learned requires a principled attribution method.

Shapley Values

SHAP (SHapley Additive exPlanations) applies Shapley values from cooperative game theory to machine learning predictions. For a prediction $f(\mathbf{x})$ , the Shapley value of feature $j$ is the average marginal contribution of feature $j$ across all possible coalitions of features:

\phi_j = \sum_{S \subseteq \mathcal{F} \setminus \{j\}} \frac{|S|!\,(|\mathcal{F}| - |S| - 1)!}{|\mathcal{F}|!} \left[ f(S \cup \{j\}) - f(S) \right]

where $\mathcal{F}$ is the set of all features and $f(S)$ is the model’s expected output conditioned on the features in $S$ being fixed. Computing this exactly requires $2^{|\mathcal{F}|}$ evaluations — exponential in the number of features.

TreeSHAP

Lundberg et al. (2020) showed that for tree-based models, exact Shapley values can be computed in $O(TLD^2)$ time, where $T$ is the number of trees, $L$ is the maximum number of leaves, and $D$ is the maximum depth. The key insight is that tree structure constrains the coalitions that matter: a feature only interacts with features on its path from root to leaf.

TreeSHAP recursively tracks the proportion of all possible feature orderings that are consistent with each path through the tree. This avoids the exponential enumeration by exploiting the tree’s conditional independence structure.

SHAP Properties

SHAP satisfies three axiomatic properties:

Local accuracy. $f(\mathbf{x}) = \phi_0 + \sum_{j=1}^{|\mathcal{F}|} \phi_j(\mathbf{x})$ , where $\phi_0 = \mathbb{E}[f(\mathbf{x})]$ . The attributions sum to the prediction.
Missingness. If feature $j$ is missing (not used by the model), then $\phi_j = 0$ .
Consistency. If a model changes so that feature $j$ ‘s marginal contribution increases or stays the same for all possible coalitions, then $\phi_j$ does not decrease.

These properties are uniquely satisfied by Shapley values (Shapley, 1953). No other attribution method satisfies all three.

Application to Tracker Cost Prediction

In the Firefox tracker cost model, SHAP analysis revealed that domain_type_median — the historical median transfer size for each tracker domain — dominated all other features at 8.65% gain. This makes intuitive sense: a domain that historically serves 200KB scripts will probably continue to serve 200KB scripts. The SHAP dependence plot showed a near-linear relationship between domain_type_median and the SHAP value, with the model using other features (content type, domain frequency, compressed size) to adjust for deviations from the median.

Hyperparameter Tuning

Gradient boosting has many interacting hyperparameters. Grid search is infeasible; random search is wasteful. Bayesian optimization (e.g., Optuna, Hyperopt) is the standard approach.

Key Hyperparameters

Parameter	Range	Effect
`n_estimators`	100—5000	Number of boosting rounds. Use early stopping.
`learning_rate` ( $\eta$ )	0.01—0.3	Step size. Smaller = more rounds needed, better generalization.
`max_depth`	3—10	Maximum tree depth. Controls interaction order.
`min_child_weight`	1—100	Minimum Hessian sum per leaf. Prevents splits on small groups.
`subsample`	0.5—1.0	Row subsampling fraction.
`colsample_bytree`	0.5—1.0	Feature subsampling per tree.
`lambda` (L2 reg)	0—10	L2 penalty on leaf weights.
`alpha` (L1 reg)	0—10	L1 penalty on leaf weights.
`gamma`	0—5	Minimum gain for a split.

Bayesian Optimization with Optuna

Optuna uses a Tree-structured Parzen Estimator (TPE) to model the relationship between hyperparameters and validation loss. At each trial, TPE fits two density estimators — one over hyperparameter configurations that produced good results ( $l(\mathbf{x})$ ) and one over configurations that produced bad results ( $g(\mathbf{x})$ ) — and selects the next configuration by maximizing $l(\mathbf{x}) / g(\mathbf{x})$ .

The typical setup:

import optuna
import xgboost as xgb

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 100),
        'lambda': trial.suggest_float('lambda', 1e-3, 10.0, log=True),
        'objective': 'reg:tweedie',
        'tweedie_variance_power': 1.5,
    }
    cv_results = xgb.cv(params, dtrain, num_boost_round=1000,
                        nfold=5, early_stopping_rounds=50,
                        metrics='mae')
    return cv_results['test-mae-mean'].min()

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=40)

With 40 trials and 5-fold cross-validation, this explores the hyperparameter space efficiently. The log-scale sampling for learning_rate and lambda reflects their multiplicative effect on model behavior.

Connection to the Tracker Cost Model

The Firefox Content Blocking team needed to estimate the network cost of third-party trackers blocked before they load. The challenge: predicting transfer sizes for resources that are never fetched, using only metadata available at block time (domain, content type, protocol, etc.).

Architecture

XGBoost with Tweedie loss ( $p = 1.5$ ), trained on observed (non-blocked) tracker transfers and applied to blocked ones. The key configuration:

Parameter	Value
`n_estimators`	500
`max_depth`	8
`learning_rate`	0.05
`objective`	`reg:tweedie`
`tweedie_variance_power`	1.5
`subsample`	0.8
Tuning method	Optuna, 40 trials, 5-fold CV

Results

Model	MAE (bytes)	Improvement over LUT
XGBoost + Tweedie	3,466	47.5%
XGBoost + Squared Error	4,074	38.3%
XGBoost + Huber	3,997	39.4%
Lookup Table (baseline)	6,601	—

The loss function choice produced a 23% gap in MAE between Tweedie and squared error on the same architecture. This is a larger effect than any hyperparameter or feature engineering decision. The reason: Tweedie’s variance scaling matches the data-generating process. Tracker transfer sizes are not normally distributed — they are zero-inflated and right-skewed, with beacons at 0 bytes and scripts exceeding 1MB. Squared error wastes model capacity chasing large absolute errors on heavy scripts; Tweedie allocates capacity proportionally.

Aggregation

The per-request MAE of 3,466 bytes sounds large, but the model was evaluated on aggregate accuracy at the page level. Across 7,376 test pages, predictions of total tracker cost per page had a median error of 2.3%. Individual errors cancel under aggregation — a consequence of the law of large numbers when errors are approximately unbiased. This is the production-relevant metric: Firefox reports total tracker cost savings per page, not per-request estimates.

The model runs in the Nimbus experimentation platform and serves cost estimates in the Enhanced Tracking Protection panel for approximately 250 million Firefox users.