Decision Trees
Decision trees are simple but powerful learning algorithms widely used in practice, including Kaggle competitions. They mimic human decision-making processes and can achieve excellent performance.
Learning Objectives
By the end of this topic, you should be able to:
- Classify a new data point using a decision tree
- Draw the decision boundaries defined by a decision tree
- Calculate the entropy of a distribution
- Calculate the expected information gain of splitting on a feature
- Construct a decision tree by recursively selecting features that maximize information gain
- Describe the advantages and limitations of decision trees
Components of a Decision Tree
A decision tree consists of three main components:
- Internal nodes: Each internal node tests a feature
- Branches: The feature test determines which branch to follow
- Leaf nodes: Each leaf node contains a prediction
Example: Citrus Fruit Classification
Consider a dataset of citrus fruits with weight and height features:
| Weight | Height | Fruit |
|---|---|---|
| 6.0 | 4.0 | Orange |
| 6.0 | 8.0 | Lemon |
| 7.0 | 8.0 | Orange |
| 7.0 | 9.0 | Orange |
| 8.0 | 9.0 | Orange |
| 7.0 | 10.0 | Lemon |
| 6.5 | 9.0 | Lemon |
A decision tree for this data might first test Height <= 9.5, then Weight <= 6.75, creating regions that separate lemons from oranges.
Classifying a Data Point
To classify a new point, start at the root and follow the branches based on feature tests until reaching a leaf node. The leaf’s prediction is the classification.
For example, given (weight=7, height=6):
- Height 6 <= 9.5? Yes → go left
- Weight 7 > 6.75? Yes → go right
- Prediction: Orange
Decision Boundaries
Decision trees create axis-aligned decision boundaries. Each split divides the feature space with a line parallel to one of the axes. The resulting regions form rectangles in 2D.
Classification vs. Regression Trees
Decision trees can handle both classification and regression tasks:
Classification Tree: At each leaf node, the prediction is the most common target value among all data points that reach that leaf.
Regression Tree: At each leaf node, the prediction is the mean target value among all data points that reach that leaf.
Building a Useful Tree
Why Not Find the Optimal Tree?
The optimal tree classifies all training examples correctly, but:
- This can require one leaf per example and won’t generalize
- Finding the smallest such tree is NP-complete
- Decision trees are universal function approximators (can represent any function)
Instead, we aim to learn a useful tree using a greedy strategy.
Greedy Feature Selection
At each step, choose the feature (and split) that most reduces a loss function.
Common loss functions:
- Uncertainty (entropy)
- Impurity (Gini index)
- Squared error (for regression)
Measuring Uncertainty with Entropy
Entropy quantifies the uncertainty inherent in a distribution’s possible outcomes.
Formula
For a random variable with possible values :
Properties of Entropy
High Entropy:
- Uniform-like distribution
- Flat histogram
- Sampled values are less predictable
Low Entropy:
- Concentrated on a few outcomes
- Peaked histogram
- Sampled values are more predictable
Binary Distribution Examples
For a binary distribution :
| Distribution | Entropy |
|---|---|
| (0.5, 0.5) | 1.0 |
| (0.3, 0.7) | 0.881 |
| (0.1, 0.9) | 0.469 |
| (0.01, 0.99) | 0.081 |
Maximum entropy occurs at (most uncertain), and entropy approaches 0 as the distribution becomes more skewed (more certain).
Information Gain
We want to know about a target variable , but instead of observing directly, we observe a feature . Information gain measures how much observing reduces our uncertainty about .
Formulas
Entropy of Y (before observing X):
Expected Conditional Entropy (after observing X):
Information Gain (also called mutual information):
Worked Example: Fruit Classification
Using our citrus dataset, let’s calculate the information gain from splitting on Weight <= 6.75.
Step 1: Calculate
Count: 3 lemons, 4 oranges (total 7)
Step 2: Build the joint probability table
| Orange | Lemon | |
|---|---|---|
| Weight <= 6.75 | 1/7 | 2/7 |
| Weight > 6.75 | 3/7 | 1/7 |
Step 3: Calculate conditional entropies
For Weight <= 6.75 (3 samples: 1 orange, 2 lemons):
For Weight > 6.75 (4 samples: 3 oranges, 1 lemon):
Step 4: Calculate expected conditional entropy
Step 5: Calculate information gain
Properties of Information Gain
| Property | Formula |
|---|---|
| Entropy is non-negative | |
| Observing never increases uncertainty | $H(Y |
| If X and Y are independent | $H(Y) = H(Y |
| If X fully determines Y | $H(Y |
Learning a Decision Tree Recursively
Algorithm
function BuildTree(data):
if should_stop_splitting(data):
return create_leaf_node(data)
best_feature, best_split = find_best_split(data) # maximize info gain
left_data, right_data = split_data(data, best_feature, best_split)
left_child = BuildTree(left_data)
right_child = BuildTree(right_data)
return create_internal_node(best_feature, best_split, left_child, right_child)Stopping Criteria
We stop splitting when:
- All examples have the same label (entropy = 0)
- No features left to split on
- A depth limit is reached
- The loss stops decreasing enough (minimum information gain threshold)
- Too few examples remain in a node
Preventing Overfitting
Decision trees are prone to overfitting, especially when grown to full depth.
Pruning Strategies
- Pre-pruning: Set maximum depth, minimum samples per leaf, minimum information gain
- Post-pruning: Grow full tree, then remove nodes that don’t improve validation performance
Hyperparameter Selection
Use a validation set to select:
- Stopping criteria
- Split thresholds
- Maximum depth
Ensemble Methods
Combining multiple trees often improves performance:
Bagging (Bootstrap Aggregating):
- Train many trees on bootstrap samples of the data
- Average predictions (regression) or vote (classification)
- Example: Random Forests
Boosting:
- Train trees sequentially
- Each tree focuses on examples the previous trees got wrong
- Examples: AdaBoost, Gradient Boosting, XGBoost
Continuous vs. Discrete Features
Continuous features: Split using thresholds (e.g., Height <= 9.5)
- Need to search for the best threshold
- Common approach: sort values and try splits between adjacent values
Discrete features: Can split on equality (e.g., Color == Red)
- Binary: creates two branches
- Multi-valued: can create multiple branches or use binary splits
Advantages and Limitations
Advantages
- Easy to understand and interpret
- Handles both numerical and categorical features
- Requires little data preprocessing
- Implicit feature selection
- Can capture non-linear relationships
Limitations
- Prone to overfitting
- Can be unstable (small data changes lead to different trees)
- Axis-aligned boundaries may not fit data well
- Greedy algorithm doesn’t guarantee global optimum
Summary
| Concept | Key Formula/Idea |
|---|---|
| Entropy | |
| Conditional Entropy | $H(Y |
| Information Gain | $IG(Y |
| Tree Building | Greedily maximize information gain |
| Stopping | Depth limit, min samples, min gain |
| Overfitting Prevention | Pruning, ensembles |
Connection to Modern Practice
Decision trees are rarely used alone in production ML. Their primary role is as the weak learner in ensemble methods:
Random Forests (Breiman, 2001) aggregate many decision trees trained on bootstrap samples with feature randomization. Each tree is high-variance but low-bias; averaging reduces variance while preserving the low bias. See the Bias-Variance and Bagging article.
Gradient Boosting (Friedman, 2001) builds trees sequentially, with each tree fitting the negative gradient of the loss. XGBoost, LightGBM, and CatBoost are the dominant implementations. The tracker cost estimation model uses XGBoost with Tweedie loss, achieving 47.5% MAE reduction over lookup table baselines. See the Gradient Boosting article.
Split criteria in practice. While this article derives information gain (entropy-based), scikit-learn’s default for classification is Gini impurity: . Gini and entropy produce nearly identical trees in practice; Gini is slightly faster to compute (no logarithm). For regression trees, the split criterion is variance reduction (MSE).
Computational complexity. Training a decision tree to depth with examples and features costs per split decision, or total for a balanced tree. This makes deep trees expensive on high-dimensional data. Histogram-based splitting (used in LightGBM) bins continuous features into 256 buckets, reducing the per-split cost from to .
Interpretability. Decision trees remain valued for their interpretability. In regulated industries (finance, healthcare), a single decision tree may be preferred over a black-box ensemble because each prediction can be traced through a sequence of human-readable rules. SHAP values provide post-hoc interpretability for ensembles while preserving the tree structure’s computational efficiency (TreeSHAP runs in polynomial time).