Bias-Variance Tradeoff and Bagging
The bias-variance decomposition provides a theoretical framework for understanding generalization error. Bagging exploits this decomposition by reducing variance through ensemble averaging, and random forests extend bagging with feature randomization to decorrelate ensemble members.
The Bias-Variance Decomposition
Consider a regression problem where we observe data generated by , with representing irreducible noise. A learning algorithm trained on a dataset produces a hypothesis . The expected prediction error at a point decomposes as:
The expectation is over random training sets drawn from the same distribution.
Bias measures how far the average prediction (across all possible training sets) is from the true function. High bias indicates that the model class is too restrictive to capture the true relationship. A linear model fit to a quadratic function has high bias.
Variance measures how much the prediction varies across different training sets. High variance indicates that the model is too sensitive to the specific training examples it saw. A high-degree polynomial fit to a small dataset has high variance.
Irreducible noise sets the floor. No model can achieve lower error than this.
Derivation
Expanding the squared error:
Since is independent of and has zero mean, the cross term vanishes. For the first term, let :
The middle term is zero since . This yields the decomposition.
The Tradeoff in Practice
| Property | Low Complexity | High Complexity |
|---|---|---|
| Bias | High | Low |
| Variance | Low | High |
| Training error | High | Low |
| Test error | Depends | Depends |
| Example | Linear regression | Deep decision tree |
Model complexity controls the tradeoff. As complexity increases (more parameters, higher polynomial degree, deeper trees), bias decreases but variance increases. The optimal complexity minimizes total test error.
The classical view (Geman et al., 1992) posits a U-shaped test error curve: underfitting on the left (high bias), overfitting on the right (high variance), with an optimal point in between.
The modern view (Belkin et al., 2019) observes that highly overparameterized models (neural networks with far more parameters than training examples) can achieve low test error despite interpolating the training data. This double descent phenomenon suggests that the classical U-curve is incomplete: beyond the interpolation threshold, test error can decrease again as model capacity continues to grow.
Bagging: Bootstrap Aggregating
Bagging (Breiman, 1996) reduces variance by training multiple models on bootstrap samples and averaging their predictions.
The Bootstrap
A bootstrap sample is drawn from by sampling examples with replacement. Each bootstrap sample:
- Contains approximately of unique training examples
- Leaves approximately as out-of-bag (OOB) examples
- Has the same size as the original dataset
The Bagging Algorithm
- Draw bootstrap samples
- Train a base model on each
- Aggregate predictions:
- Regression:
- Classification:
Why Bagging Reduces Variance
For models with predictions , each with variance and pairwise correlation :
As , the variance approaches . If the models were independent (), variance would vanish. In practice, bootstrap samples overlap (~63.2% shared data), so and variance reduction is bounded. The key insight: reducing correlation between ensemble members is as important as increasing ensemble size.
Out-of-Bag Estimation
Each training example appears in approximately of bootstrap samples. The OOB prediction uses only the models that did not train on example :
The OOB error is a nearly unbiased estimate of test error, computed without needing a separate validation set. This is particularly valuable when data is scarce.
Random Forests
Random forests (Breiman, 2001) extend bagging with feature randomization: at each split in each tree, only a random subset of features is considered as candidates.
Algorithm
For each tree :
- Draw a bootstrap sample
- Grow a decision tree, but at each split:
- Sample features uniformly at random from total features
- Find the best split among only these features
- Split the node
- Grow the tree to full depth (no pruning)
Feature Subset Size
The hyperparameter controls the bias-variance tradeoff within the ensemble:
| Setting | Effect | |
|---|---|---|
| Bagging | Maximum signal per tree, maximum correlation between trees | |
| Default (classification) | Good balance for most problems | |
| Default (regression) | Standard recommendation | |
| Extreme | Minimum correlation, maximum bias per tree |
Reducing increases the bias of individual trees (each tree sees less information) but decreases the correlation between trees. The variance reduction from decorrelation typically outweighs the bias increase, yielding lower ensemble error.
Why Random Forests Work
Consider the ensemble variance formula :
- Bagging alone makes large, reducing the second term, but remains high because all trees see the same dominant features and make similar splits.
- Feature randomization reduces by forcing different trees to discover different predictive patterns. A strong feature that would dominate every tree’s root split is excluded from of all split decisions.
The result is an ensemble where individual members are weak but collectively strong, and their errors are sufficiently uncorrelated to cancel in aggregation.
Feature Importance
Random forests provide two measures of feature importance:
Mean Decrease in Impurity (MDI). Sum the impurity reduction (Gini or MSE) from all splits on feature , averaged across trees. Fast to compute but biased toward high-cardinality features.
Permutation importance. For each feature , randomly permute its values in the OOB data and measure the increase in OOB error. Unbiased but slower. This is the more reliable measure and directly estimates how much predictive information the feature carries.
For production ML applications, SHAP (SHapley Additive exPlanations) values provide a theoretically grounded alternative based on cooperative game theory, decomposing each prediction into per-feature contributions. SHAP values satisfy local accuracy, missingness, and consistency axioms that MDI and permutation importance do not.
Practical Considerations
Number of trees. Random forest performance monotonically improves with (it cannot overfit by adding more trees). In practice, 100—500 trees suffice for most problems; beyond this, gains are marginal. The OOB error curve typically plateaus and can be used to determine when adding more trees is no longer beneficial.
Tree depth. Unlike individual decision trees, random forest trees are grown to full depth (each leaf contains a single example or minimum node size). The ensemble averaging handles the variance that would make a single deep tree overfit.
Computational cost. Training is per tree (considering features at each of depth levels for examples). Trees are independent and trivially parallelizable.
When random forests excel. Tabular data with moderate dimensionality, heterogeneous features (mix of continuous and categorical), and no requirement for GPU infrastructure. Random forests remain highly competitive with gradient boosting methods on many tabular benchmarks, particularly when hyperparameter tuning budget is limited.
When to prefer gradient boosting. When maximum predictive accuracy is required and tuning budget is available. Gradient boosting (covered in the next article) builds trees sequentially, with each tree correcting the errors of the ensemble so far, enabling it to achieve lower bias than random forests at the cost of higher tuning sensitivity.
Connection to the Tracker Cost Model
The tracker cost estimation model evaluated random forests alongside gradient boosted trees. Random Forest achieved MAE 4,675 with Spearman , competitive with several gradient boosting variants despite no hyperparameter tuning. XGBoost with Tweedie loss (MAE 3,466, ) outperformed it, primarily due to Tweedie’s superior handling of the zero-inflated target distribution rather than architectural advantages. The 27% gap is attributable to the loss function: random forests average leaf predictions, which naturally handles zero-inflation through leaf-level mixture, but cannot match Tweedie’s magnitude-proportional error weighting.
Summary
| Concept | Key Insight |
|---|---|
| Bias-variance decomposition | Test error = bias + variance + noise; model complexity trades between the first two |
| Bagging | Averaging models reduces variance by factor (limited by correlation) |
| Bootstrap | Sampling with replacement produces diverse training sets; OOB provides free validation |
| Random forests | Feature randomization decorrelates trees, enabling further variance reduction |
| Feature importance | Permutation importance and SHAP provide reliable feature attribution |
The bias-variance framework explains when and why ensemble methods work. Bagging reduces variance by averaging; random forests reduce it further by decorrelating. Gradient boosting, covered next, takes the complementary approach: reducing bias by sequential correction.