4: Feature Mapping, Regularization, and Linear Classification
This lecture covers how to extend linear models to capture non-linear relationships through feature mapping, how to prevent overfitting using regularization, and how to adapt linear models for classification tasks.
Feature Mapping (Basis Expansion)
What if the relationship between inputs and outputs is not linear? We can still use linear regression by transforming the inputs!
The Key Idea
-
Transform the inputs: Map original features to a new feature space
-
Fit a linear model: Run linear regression with the new features as inputs
The key insight: Create new features so that the non-linear relationship becomes linear in the transformed space.
Typically (we expand to a higher-dimensional space).
Example: Polynomial Feature Mapping
Goal: Learn a polynomial function
The relationship is not linear in , but we can make it linear by defining:
Polynomial feature map:
Now is linear in . The model is linear with respect to the parameters, even though it’s non-linear in the original input .
Polynomial Degree and Model Complexity
| Degree | Feature Map | Hypothesis | Behavior |
|---|---|---|---|
| Simple line (underfitting) | |||
| Good fit | |||
| Overfitting |
Model Complexity and Generalization
As the polynomial degree increases:
| Training Error | Test Error | Diagnosis | |
|---|---|---|---|
| Low | High | High | Underfitting - model too simple |
| Optimal | Low | Low | Good generalization |
| High | Very low (near 0) | High | Overfitting - model too complex |
Key observations:
- Training error decreases monotonically as model complexity increases
- Test error decreases initially, then increases (U-shaped curve)
- Weight magnitudes grow as increases (motivates regularization)
Regularization
Why Regularize?
When overfitting occurs:
- Weights become very large (finely tuned to training data)
- The function oscillates wildly between data points
- Small changes in input cause large changes in output
Controlling Model Complexity
Two approaches to prevent overfitting:
-
Tune as a hyperparameter using a validation set
- Decreases the number of parameters
-
Use regularization to enforce simpler solutions
- Keep the number of parameters large, but constrain their values
The Regularizer
This is the squared Euclidean norm of the weight vector.
Regularized Cost Function
This creates a tradeoff:
- Smaller → better fit to training data
- Smaller → smaller weights (simpler model)
- controls the tradeoff between the two
Tuning
| Effect on Weights | Effect on Fit | Risk | |
|---|---|---|---|
| Too large () | All weights small | Poor fit to training data | Underfitting |
| Too small () | Some weights large | Great fit to training data | Overfitting |
| Optimal | Balanced | Good generalization | - |
Intuition:
- too large: , model tries only to keep weights small
- too small: , model ignores regularization
Why Penalize Large Weights?
- A large weight → prediction is very sensitive to that feature
- We expect output to depend on a combination of features
- Large weights often indicate the model is fitting noise
A Modular Approach to Machine Learning
| Component | Purpose |
|---|---|
| Model | Describes relationships between variables |
| Loss/Cost Function | Quantifies how badly a hypothesis fits the data |
| Regularizer | Expresses preferences over different hypotheses |
| Optimization Algorithm | Fit a model that minimizes loss and satisfies regularization |
Binary Linear Classification
From Regression to Classification
In classification, the target is discrete rather than continuous.
Classification setup:
- Dataset:
- Each target is discrete
- Binary classification:
- : positive example
- : negative example
Linear Model for Binary Classification
Step 1: Compute a linear combination
Step 2: Apply threshold to generate prediction
where is the threshold.
Simplifying the Model
Eliminating the threshold :
Since , we can absorb into the bias:
Eliminating the bias :
Add a dummy feature and let be its weight:
Decision Boundary
The decision boundary is the set of points where :
This defines a hyperplane that separates the two classes.
Example: Modeling the AND Function
Goal: Learn weights to classify the logical AND function perfectly.
| 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 |
System of inequalities for perfect classification:
The model predicts if :
- : (predict 0)
- : (predict 0)
- : (predict 0)
- : (predict 1)
One solution:
The decision boundary is: , or equivalently
Linearly Separable Data
- If data is linearly separable, we can find weights that classify every point correctly
- In practice, data is rarely linearly separable
- A linear model will inevitably make mistakes
- We need a way to measure errors and adjust the model → leads to loss functions
Summary
Feature Mapping
| Concept | Description |
|---|---|
| Goal | Model non-linear relationships using linear regression |
| Method | Create new features so model is linear in transformed space |
| Tradeoff | Higher-degree features → more expressive but risk overfitting |
Regularization
| Concept | Formula |
|---|---|
| Regularizer | |
| Regularized Cost | |
| Hyperparameter | controls fit vs. simplicity tradeoff |
Binary Linear Classification
| Component | Formula |
|---|---|
| Linear Model | |
| Threshold Activation | |
| Decision Boundary | Hyperplane |