Bias-Variance Tradeoff and Bagging

The bias-variance decomposition provides a theoretical framework for understanding generalization error. Bagging exploits this decomposition by reducing variance through ensemble averaging, and random forests extend bagging with feature randomization to decorrelate ensemble members.

The Bias-Variance Decomposition

Consider a regression problem where we observe data generated by $y = f(\mathbf{x}) + \epsilon$ , with $\epsilon \sim (0, \sigma^2)$ representing irreducible noise. A learning algorithm $\mathcal{A}$ trained on a dataset $\mathcal{D}$ produces a hypothesis $\hat{f}_\mathcal{D}(\mathbf{x})$ . The expected prediction error at a point $\mathbf{x}$ decomposes as:

\mathbb{E}_\mathcal{D}\left[(y - \hat{f}_\mathcal{D}(\mathbf{x}))^2\right] = \underbrace{\left(f(\mathbf{x}) - \mathbb{E}_\mathcal{D}[\hat{f}_\mathcal{D}(\mathbf{x})]\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_\mathcal{D}\left[(\hat{f}_\mathcal{D}(\mathbf{x}) - \mathbb{E}_\mathcal{D}[\hat{f}_\mathcal{D}(\mathbf{x})])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}

The expectation is over random training sets $\mathcal{D}$ drawn from the same distribution.

Bias measures how far the average prediction (across all possible training sets) is from the true function. High bias indicates that the model class is too restrictive to capture the true relationship. A linear model fit to a quadratic function has high bias.

Variance measures how much the prediction varies across different training sets. High variance indicates that the model is too sensitive to the specific training examples it saw. A high-degree polynomial fit to a small dataset has high variance.

Irreducible noise $\sigma^2$ sets the floor. No model can achieve lower error than this.

Derivation

Expanding the squared error:

\mathbb{E}_\mathcal{D}[(y - \hat{f})^2] = \mathbb{E}_\mathcal{D}[(f + \epsilon - \hat{f})^2]

= \mathbb{E}_\mathcal{D}[(f - \hat{f})^2] + 2\mathbb{E}_\mathcal{D}[(f - \hat{f})\epsilon] + \mathbb{E}[\epsilon^2]

Since $\epsilon$ is independent of $\hat{f}$ and has zero mean, the cross term vanishes. For the first term, let $\bar{f} = \mathbb{E}_\mathcal{D}[\hat{f}]$ :

\mathbb{E}_\mathcal{D}[(f - \hat{f})^2] = \mathbb{E}_\mathcal{D}[(f - \bar{f} + \bar{f} - \hat{f})^2]

= (f - \bar{f})^2 + 2(f - \bar{f})\mathbb{E}_\mathcal{D}[\bar{f} - \hat{f}] + \mathbb{E}_\mathcal{D}[(\bar{f} - \hat{f})^2]

The middle term is zero since $\mathbb{E}_\mathcal{D}[\bar{f} - \hat{f}] = 0$ . This yields the decomposition.

The Tradeoff in Practice

Property	Low Complexity	High Complexity
Bias	High	Low
Variance	Low	High
Training error	High	Low
Test error	Depends	Depends
Example	Linear regression	Deep decision tree

Model complexity controls the tradeoff. As complexity increases (more parameters, higher polynomial degree, deeper trees), bias decreases but variance increases. The optimal complexity minimizes total test error.

The classical view (Geman et al., 1992) posits a U-shaped test error curve: underfitting on the left (high bias), overfitting on the right (high variance), with an optimal point in between.

The modern view (Belkin et al., 2019) observes that highly overparameterized models (neural networks with far more parameters than training examples) can achieve low test error despite interpolating the training data. This double descent phenomenon suggests that the classical U-curve is incomplete: beyond the interpolation threshold, test error can decrease again as model capacity continues to grow.

Bagging: Bootstrap Aggregating

Bagging (Breiman, 1996) reduces variance by training multiple models on bootstrap samples and averaging their predictions.

The Bootstrap

A bootstrap sample $\mathcal{D}_b^*$ is drawn from $\mathcal{D}$ by sampling $N$ examples with replacement. Each bootstrap sample:

Contains approximately $1 - 1/e \approx 63.2\%$ of unique training examples
Leaves approximately $36.8\%$ as out-of-bag (OOB) examples
Has the same size $N$ as the original dataset

The Bagging Algorithm

Draw $B$ bootstrap samples $\mathcal{D}_1^*, \ldots, \mathcal{D}_B^*$
Train a base model $\hat{f}_b$ on each $\mathcal{D}_b^*$
Aggregate predictions:
- Regression: $\hat{f}_{\text{bag}}(\mathbf{x}) = \frac{1}{B}\sum_{b=1}^B \hat{f}_b(\mathbf{x})$
- Classification: $\hat{f}_{\text{bag}}(\mathbf{x}) = \text{mode}\{\hat{f}_1(\mathbf{x}), \ldots, \hat{f}_B(\mathbf{x})\}$

Why Bagging Reduces Variance

For $B$ models with predictions $\hat{f}_1, \ldots, \hat{f}_B$ , each with variance $\sigma^2$ and pairwise correlation $\rho$ :

\text{Var}\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b\right) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2

As $B \to \infty$ , the variance approaches $\rho \sigma^2$ . If the models were independent ( $\rho = 0$ ), variance would vanish. In practice, bootstrap samples overlap (~63.2% shared data), so $\rho > 0$ and variance reduction is bounded. The key insight: reducing correlation between ensemble members is as important as increasing ensemble size.

Out-of-Bag Estimation

Each training example $(\mathbf{x}^{(i)}, y^{(i)})$ appears in approximately $63.2\%$ of bootstrap samples. The OOB prediction uses only the models that did not train on example $i$ :

\hat{f}_{\text{OOB}}(\mathbf{x}^{(i)}) = \frac{1}{|\{b : i \notin \mathcal{D}_b^*\}|}\sum_{b: i \notin \mathcal{D}_b^*} \hat{f}_b(\mathbf{x}^{(i)})

The OOB error is a nearly unbiased estimate of test error, computed without needing a separate validation set. This is particularly valuable when data is scarce.

Random Forests

Random forests (Breiman, 2001) extend bagging with feature randomization: at each split in each tree, only a random subset of $m$ features is considered as candidates.

Algorithm

For each tree $b = 1, \ldots, B$ :

Draw a bootstrap sample $\mathcal{D}_b^*$
Grow a decision tree, but at each split:
- Sample $m$ features uniformly at random from $D$ total features
- Find the best split among only these $m$ features
- Split the node
Grow the tree to full depth (no pruning)

Feature Subset Size

The hyperparameter $m$ controls the bias-variance tradeoff within the ensemble:

Setting	$m$	Effect
Bagging	$D$	Maximum signal per tree, maximum correlation between trees
Default (classification)	$\sqrt{D}$	Good balance for most problems
Default (regression)	$D/3$	Standard recommendation
Extreme	$1$	Minimum correlation, maximum bias per tree

Reducing $m$ increases the bias of individual trees (each tree sees less information) but decreases the correlation $\rho$ between trees. The variance reduction from decorrelation typically outweighs the bias increase, yielding lower ensemble error.

Why Random Forests Work

Consider the ensemble variance formula $\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$ :

Bagging alone makes $B$ large, reducing the second term, but $\rho$ remains high because all trees see the same dominant features and make similar splits.
Feature randomization reduces $\rho$ by forcing different trees to discover different predictive patterns. A strong feature that would dominate every tree’s root split is excluded from $\approx 1 - m/D$ of all split decisions.

The result is an ensemble where individual members are weak but collectively strong, and their errors are sufficiently uncorrelated to cancel in aggregation.

Feature Importance

Random forests provide two measures of feature importance:

Mean Decrease in Impurity (MDI). Sum the impurity reduction (Gini or MSE) from all splits on feature $j$ , averaged across trees. Fast to compute but biased toward high-cardinality features.

Permutation importance. For each feature $j$ , randomly permute its values in the OOB data and measure the increase in OOB error. Unbiased but slower. This is the more reliable measure and directly estimates how much predictive information the feature carries.

For production ML applications, SHAP (SHapley Additive exPlanations) values provide a theoretically grounded alternative based on cooperative game theory, decomposing each prediction into per-feature contributions. SHAP values satisfy local accuracy, missingness, and consistency axioms that MDI and permutation importance do not.

Practical Considerations

Number of trees. Random forest performance monotonically improves with $B$ (it cannot overfit by adding more trees). In practice, 100—500 trees suffice for most problems; beyond this, gains are marginal. The OOB error curve typically plateaus and can be used to determine when adding more trees is no longer beneficial.

Tree depth. Unlike individual decision trees, random forest trees are grown to full depth (each leaf contains a single example or minimum node size). The ensemble averaging handles the variance that would make a single deep tree overfit.

Computational cost. Training is $O(B \cdot N \cdot m \cdot \log N)$ per tree (considering $m$ features at each of $O(\log N)$ depth levels for $N$ examples). Trees are independent and trivially parallelizable.

When random forests excel. Tabular data with moderate dimensionality, heterogeneous features (mix of continuous and categorical), and no requirement for GPU infrastructure. Random forests remain highly competitive with gradient boosting methods on many tabular benchmarks, particularly when hyperparameter tuning budget is limited.

When to prefer gradient boosting. When maximum predictive accuracy is required and tuning budget is available. Gradient boosting (covered in the next article) builds trees sequentially, with each tree correcting the errors of the ensemble so far, enabling it to achieve lower bias than random forests at the cost of higher tuning sensitivity.

Connection to the Tracker Cost Model

The tracker cost estimation model evaluated random forests alongside gradient boosted trees. Random Forest achieved MAE 4,675 with Spearman $\rho = 0.913$ , competitive with several gradient boosting variants despite no hyperparameter tuning. XGBoost with Tweedie loss (MAE 3,466, $\rho = 0.945$ ) outperformed it, primarily due to Tweedie’s superior handling of the zero-inflated target distribution rather than architectural advantages. The 27% gap is attributable to the loss function: random forests average leaf predictions, which naturally handles zero-inflation through leaf-level mixture, but cannot match Tweedie’s magnitude-proportional error weighting.

Summary

Concept	Key Insight
Bias-variance decomposition	Test error = bias $^2$ + variance + noise; model complexity trades between the first two
Bagging	Averaging $B$ models reduces variance by factor $\approx 1/B$ (limited by correlation)
Bootstrap	Sampling with replacement produces diverse training sets; OOB provides free validation
Random forests	Feature randomization decorrelates trees, enabling further variance reduction
Feature importance	Permutation importance and SHAP provide reliable feature attribution

The bias-variance framework explains when and why ensemble methods work. Bagging reduces variance by averaging; random forests reduce it further by decorrelating. Gradient boosting, covered next, takes the complementary approach: reducing bias by sequential correction.