Bias-Variance Tradeoff and Bagging

The bias-variance decomposition provides a theoretical framework for understanding generalization error. Bagging exploits this decomposition by reducing variance through ensemble averaging, and random forests extend bagging with feature randomization to decorrelate ensemble members.


The Bias-Variance Decomposition

Consider a regression problem where we observe data generated by y=f(x)+ϵy = f(\mathbf{x}) + \epsilon, with ϵ(0,σ2)\epsilon \sim (0, \sigma^2) representing irreducible noise. A learning algorithm A\mathcal{A} trained on a dataset D\mathcal{D} produces a hypothesis f^D(x)\hat{f}_\mathcal{D}(\mathbf{x}). The expected prediction error at a point x\mathbf{x} decomposes as:

ED[(yf^D(x))2]=(f(x)ED[f^D(x)])2Bias2+ED[(f^D(x)ED[f^D(x)])2]Variance+σ2Irreducible noise\mathbb{E}_\mathcal{D}\left[(y - \hat{f}_\mathcal{D}(\mathbf{x}))^2\right] = \underbrace{\left(f(\mathbf{x}) - \mathbb{E}_\mathcal{D}[\hat{f}_\mathcal{D}(\mathbf{x})]\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_\mathcal{D}\left[(\hat{f}_\mathcal{D}(\mathbf{x}) - \mathbb{E}_\mathcal{D}[\hat{f}_\mathcal{D}(\mathbf{x})])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}

The expectation is over random training sets D\mathcal{D} drawn from the same distribution.

Bias measures how far the average prediction (across all possible training sets) is from the true function. High bias indicates that the model class is too restrictive to capture the true relationship. A linear model fit to a quadratic function has high bias.

Variance measures how much the prediction varies across different training sets. High variance indicates that the model is too sensitive to the specific training examples it saw. A high-degree polynomial fit to a small dataset has high variance.

Irreducible noise σ2\sigma^2 sets the floor. No model can achieve lower error than this.

Derivation

Expanding the squared error:

ED[(yf^)2]=ED[(f+ϵf^)2]\mathbb{E}_\mathcal{D}[(y - \hat{f})^2] = \mathbb{E}_\mathcal{D}[(f + \epsilon - \hat{f})^2] =ED[(ff^)2]+2ED[(ff^)ϵ]+E[ϵ2]= \mathbb{E}_\mathcal{D}[(f - \hat{f})^2] + 2\mathbb{E}_\mathcal{D}[(f - \hat{f})\epsilon] + \mathbb{E}[\epsilon^2]

Since ϵ\epsilon is independent of f^\hat{f} and has zero mean, the cross term vanishes. For the first term, let fˉ=ED[f^]\bar{f} = \mathbb{E}_\mathcal{D}[\hat{f}]:

ED[(ff^)2]=ED[(ffˉ+fˉf^)2]\mathbb{E}_\mathcal{D}[(f - \hat{f})^2] = \mathbb{E}_\mathcal{D}[(f - \bar{f} + \bar{f} - \hat{f})^2] =(ffˉ)2+2(ffˉ)ED[fˉf^]+ED[(fˉf^)2]= (f - \bar{f})^2 + 2(f - \bar{f})\mathbb{E}_\mathcal{D}[\bar{f} - \hat{f}] + \mathbb{E}_\mathcal{D}[(\bar{f} - \hat{f})^2]

The middle term is zero since ED[fˉf^]=0\mathbb{E}_\mathcal{D}[\bar{f} - \hat{f}] = 0. This yields the decomposition.


The Tradeoff in Practice

PropertyLow ComplexityHigh Complexity
BiasHighLow
VarianceLowHigh
Training errorHighLow
Test errorDependsDepends
ExampleLinear regressionDeep decision tree

Model complexity controls the tradeoff. As complexity increases (more parameters, higher polynomial degree, deeper trees), bias decreases but variance increases. The optimal complexity minimizes total test error.

The classical view (Geman et al., 1992) posits a U-shaped test error curve: underfitting on the left (high bias), overfitting on the right (high variance), with an optimal point in between.

The modern view (Belkin et al., 2019) observes that highly overparameterized models (neural networks with far more parameters than training examples) can achieve low test error despite interpolating the training data. This double descent phenomenon suggests that the classical U-curve is incomplete: beyond the interpolation threshold, test error can decrease again as model capacity continues to grow.


Bagging: Bootstrap Aggregating

Bagging (Breiman, 1996) reduces variance by training multiple models on bootstrap samples and averaging their predictions.

The Bootstrap

A bootstrap sample Db\mathcal{D}_b^* is drawn from D\mathcal{D} by sampling NN examples with replacement. Each bootstrap sample:

  • Contains approximately 11/e63.2%1 - 1/e \approx 63.2\% of unique training examples
  • Leaves approximately 36.8%36.8\% as out-of-bag (OOB) examples
  • Has the same size NN as the original dataset

The Bagging Algorithm

  1. Draw BB bootstrap samples D1,,DB\mathcal{D}_1^*, \ldots, \mathcal{D}_B^*
  2. Train a base model f^b\hat{f}_b on each Db\mathcal{D}_b^*
  3. Aggregate predictions:
    • Regression: f^bag(x)=1Bb=1Bf^b(x)\hat{f}_{\text{bag}}(\mathbf{x}) = \frac{1}{B}\sum_{b=1}^B \hat{f}_b(\mathbf{x})
    • Classification: f^bag(x)=mode{f^1(x),,f^B(x)}\hat{f}_{\text{bag}}(\mathbf{x}) = \text{mode}\{\hat{f}_1(\mathbf{x}), \ldots, \hat{f}_B(\mathbf{x})\}

Why Bagging Reduces Variance

For BB models with predictions f^1,,f^B\hat{f}_1, \ldots, \hat{f}_B, each with variance σ2\sigma^2 and pairwise correlation ρ\rho:

Var(1Bb=1Bf^b)=ρσ2+1ρBσ2\text{Var}\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b\right) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2

As BB \to \infty, the variance approaches ρσ2\rho \sigma^2. If the models were independent (ρ=0\rho = 0), variance would vanish. In practice, bootstrap samples overlap (~63.2% shared data), so ρ>0\rho > 0 and variance reduction is bounded. The key insight: reducing correlation between ensemble members is as important as increasing ensemble size.

Out-of-Bag Estimation

Each training example (x(i),y(i))(\mathbf{x}^{(i)}, y^{(i)}) appears in approximately 63.2%63.2\% of bootstrap samples. The OOB prediction uses only the models that did not train on example ii:

f^OOB(x(i))=1{b:iDb}b:iDbf^b(x(i))\hat{f}_{\text{OOB}}(\mathbf{x}^{(i)}) = \frac{1}{|\{b : i \notin \mathcal{D}_b^*\}|}\sum_{b: i \notin \mathcal{D}_b^*} \hat{f}_b(\mathbf{x}^{(i)})

The OOB error is a nearly unbiased estimate of test error, computed without needing a separate validation set. This is particularly valuable when data is scarce.


Random Forests

Random forests (Breiman, 2001) extend bagging with feature randomization: at each split in each tree, only a random subset of mm features is considered as candidates.

Algorithm

For each tree b=1,,Bb = 1, \ldots, B:

  1. Draw a bootstrap sample Db\mathcal{D}_b^*
  2. Grow a decision tree, but at each split:
    • Sample mm features uniformly at random from DD total features
    • Find the best split among only these mm features
    • Split the node
  3. Grow the tree to full depth (no pruning)

Feature Subset Size

The hyperparameter mm controls the bias-variance tradeoff within the ensemble:

SettingmmEffect
BaggingDDMaximum signal per tree, maximum correlation between trees
Default (classification)D\sqrt{D}Good balance for most problems
Default (regression)D/3D/3Standard recommendation
Extreme11Minimum correlation, maximum bias per tree

Reducing mm increases the bias of individual trees (each tree sees less information) but decreases the correlation ρ\rho between trees. The variance reduction from decorrelation typically outweighs the bias increase, yielding lower ensemble error.

Why Random Forests Work

Consider the ensemble variance formula ρσ2+1ρBσ2\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2:

  • Bagging alone makes BB large, reducing the second term, but ρ\rho remains high because all trees see the same dominant features and make similar splits.
  • Feature randomization reduces ρ\rho by forcing different trees to discover different predictive patterns. A strong feature that would dominate every tree’s root split is excluded from 1m/D\approx 1 - m/D of all split decisions.

The result is an ensemble where individual members are weak but collectively strong, and their errors are sufficiently uncorrelated to cancel in aggregation.

Feature Importance

Random forests provide two measures of feature importance:

Mean Decrease in Impurity (MDI). Sum the impurity reduction (Gini or MSE) from all splits on feature jj, averaged across trees. Fast to compute but biased toward high-cardinality features.

Permutation importance. For each feature jj, randomly permute its values in the OOB data and measure the increase in OOB error. Unbiased but slower. This is the more reliable measure and directly estimates how much predictive information the feature carries.

For production ML applications, SHAP (SHapley Additive exPlanations) values provide a theoretically grounded alternative based on cooperative game theory, decomposing each prediction into per-feature contributions. SHAP values satisfy local accuracy, missingness, and consistency axioms that MDI and permutation importance do not.


Practical Considerations

Number of trees. Random forest performance monotonically improves with BB (it cannot overfit by adding more trees). In practice, 100—500 trees suffice for most problems; beyond this, gains are marginal. The OOB error curve typically plateaus and can be used to determine when adding more trees is no longer beneficial.

Tree depth. Unlike individual decision trees, random forest trees are grown to full depth (each leaf contains a single example or minimum node size). The ensemble averaging handles the variance that would make a single deep tree overfit.

Computational cost. Training is O(BNmlogN)O(B \cdot N \cdot m \cdot \log N) per tree (considering mm features at each of O(logN)O(\log N) depth levels for NN examples). Trees are independent and trivially parallelizable.

When random forests excel. Tabular data with moderate dimensionality, heterogeneous features (mix of continuous and categorical), and no requirement for GPU infrastructure. Random forests remain highly competitive with gradient boosting methods on many tabular benchmarks, particularly when hyperparameter tuning budget is limited.

When to prefer gradient boosting. When maximum predictive accuracy is required and tuning budget is available. Gradient boosting (covered in the next article) builds trees sequentially, with each tree correcting the errors of the ensemble so far, enabling it to achieve lower bias than random forests at the cost of higher tuning sensitivity.


Connection to the Tracker Cost Model

The tracker cost estimation model evaluated random forests alongside gradient boosted trees. Random Forest achieved MAE 4,675 with Spearman ρ=0.913\rho = 0.913, competitive with several gradient boosting variants despite no hyperparameter tuning. XGBoost with Tweedie loss (MAE 3,466, ρ=0.945\rho = 0.945) outperformed it, primarily due to Tweedie’s superior handling of the zero-inflated target distribution rather than architectural advantages. The 27% gap is attributable to the loss function: random forests average leaf predictions, which naturally handles zero-inflation through leaf-level mixture, but cannot match Tweedie’s magnitude-proportional error weighting.


Summary

ConceptKey Insight
Bias-variance decompositionTest error = bias2^2 + variance + noise; model complexity trades between the first two
BaggingAveraging BB models reduces variance by factor 1/B\approx 1/B (limited by correlation)
BootstrapSampling with replacement produces diverse training sets; OOB provides free validation
Random forestsFeature randomization decorrelates trees, enabling further variance reduction
Feature importancePermutation importance and SHAP provide reliable feature attribution

The bias-variance framework explains when and why ensemble methods work. Bagging reduces variance by averaging; random forests reduce it further by decorrelating. Gradient boosting, covered next, takes the complementary approach: reducing bias by sequential correction.