Logistic Regression and Regularization
This lecture covers how to extend linear models to capture non-linear relationships through feature mapping, how to prevent overfitting using regularization, and how to adapt linear models for classification tasks.
Feature Mapping (Basis Expansion)
What if the relationship between inputs and outputs is not linear? We can still use linear regression by transforming the inputs!
The Key Idea
-
Transform the inputs: Map original features to a new feature space
-
Fit a linear model: Run linear regression with the new features as inputs
The key insight: Create new features so that the non-linear relationship becomes linear in the transformed space.
Typically (we expand to a higher-dimensional space).
Example: Polynomial Feature Mapping
Goal: Learn a polynomial function
The relationship is not linear in , but we can make it linear by defining:
Polynomial feature map:
Now is linear in . The model is linear with respect to the parameters, even though it’s non-linear in the original input .
Polynomial Degree and Model Complexity
| Degree | Feature Map | Hypothesis | Behavior |
|---|---|---|---|
| Simple line (underfitting) | |||
| Good fit | |||
| Overfitting |
Model Complexity and Generalization
As the polynomial degree increases:
| Training Error | Test Error | Diagnosis | |
|---|---|---|---|
| Low | High | High | Underfitting - model too simple |
| Optimal | Low | Low | Good generalization |
| High | Very low (near 0) | High | Overfitting - model too complex |
Key observations:
- Training error decreases monotonically as model complexity increases
- Test error decreases initially, then increases (U-shaped curve)
- Weight magnitudes grow as increases (motivates regularization)
Regularization
Why Regularize?
When overfitting occurs:
- Weights become very large (finely tuned to training data)
- The function oscillates wildly between data points
- Small changes in input cause large changes in output
Controlling Model Complexity
Two approaches to prevent overfitting:
-
Tune as a hyperparameter using a validation set
- Decreases the number of parameters
-
Use regularization to enforce simpler solutions
- Keep the number of parameters large, but constrain their values
The Regularizer
This is the squared Euclidean norm of the weight vector.
Regularized Cost Function
This creates a tradeoff:
- Smaller → better fit to training data
- Smaller → smaller weights (simpler model)
- controls the tradeoff between the two
Tuning
| Effect on Weights | Effect on Fit | Risk | |
|---|---|---|---|
| Too large () | All weights small | Poor fit to training data | Underfitting |
| Too small () | Some weights large | Great fit to training data | Overfitting |
| Optimal | Balanced | Good generalization | - |
Intuition:
- too large: , model tries only to keep weights small
- too small: , model ignores regularization
Why Penalize Large Weights?
- A large weight → prediction is very sensitive to that feature
- We expect output to depend on a combination of features
- Large weights often indicate the model is fitting noise
A Modular Approach to Machine Learning
| Component | Purpose |
|---|---|
| Model | Describes relationships between variables |
| Loss/Cost Function | Quantifies how badly a hypothesis fits the data |
| Regularizer | Expresses preferences over different hypotheses |
| Optimization Algorithm | Fit a model that minimizes loss and satisfies regularization |
Binary Linear Classification
From Regression to Classification
In classification, the target is discrete rather than continuous.
Classification setup:
- Dataset:
- Each target is discrete
- Binary classification:
- : positive example
- : negative example
Linear Model for Binary Classification
Step 1: Compute a linear combination
Step 2: Apply threshold to generate prediction
where is the threshold.
Simplifying the Model
Eliminating the threshold :
Since , we can absorb into the bias:
Eliminating the bias :
Add a dummy feature and let be its weight:
Decision Boundary
The decision boundary is the set of points where :
This defines a hyperplane that separates the two classes.
Example: Modeling the AND Function
Goal: Learn weights to classify the logical AND function perfectly.
| 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 |
System of inequalities for perfect classification:
The model predicts if :
- : (predict 0)
- : (predict 0)
- : (predict 0)
- : (predict 1)
One solution:
The decision boundary is: , or equivalently
Linearly Separable Data
- If data is linearly separable, we can find weights that classify every point correctly
- In practice, data is rarely linearly separable
- A linear model will inevitably make mistakes
- We need a way to measure errors and adjust the model → leads to loss functions
Summary
Feature Mapping
| Concept | Description |
|---|---|
| Goal | Model non-linear relationships using linear regression |
| Method | Create new features so model is linear in transformed space |
| Tradeoff | Higher-degree features → more expressive but risk overfitting |
Regularization
| Concept | Formula |
|---|---|
| Regularizer | |
| Regularized Cost | |
| Hyperparameter | controls fit vs. simplicity tradeoff |
Binary Linear Classification
| Component | Formula |
|---|---|
| Linear Model | |
| Threshold Activation | |
| Decision Boundary | Hyperplane |
Connection to Modern Practice
Logistic Regression as a Strong Baseline
Logistic regression remains the standard first model in production ML classification tasks. At companies like Google and Meta, logistic regression baselines are established before more complex models are tried. Reasons:
- Fast to train and serve. A single matrix-vector multiply at inference. Critical when serving millions of predictions per second.
- Interpretable. Each weight indicates how feature affects the log-odds. In regulated domains, this transparency is required.
- Well-calibrated. Because the sigmoid output is derived from the Bernoulli log-likelihood, logistic regression produces calibrated probabilities by construction (under correct specification). Many modern models require post-hoc calibration (Platt scaling, temperature scaling) to achieve what logistic regression provides natively.
- Convex optimization. The cross-entropy loss for logistic regression is convex in , guaranteeing a unique global minimum. This means results are reproducible regardless of initialization.
L1 vs L2 Regularization
This article covers L2 (Ridge) regularization. L1 (Lasso) regularization induces sparsity: many weights become exactly zero, effectively performing feature selection. Elastic Net combines both: .
From a Bayesian perspective: L2 regularization corresponds to a Gaussian prior on , L1 to a Laplace prior. The Laplace prior’s sharp peak at zero is what produces sparsity.
Generalization to Neural Networks
A neural network with a sigmoid output and binary cross-entropy loss is logistic regression on learned features:
where is the representation from the hidden layers. The final layer is literally logistic regression applied to the learned representation instead of the raw features. This connection explains why cross-entropy is the standard classification loss in deep learning: it is the maximum likelihood objective for Bernoulli outputs.
Multi-Class Extension
Logistic regression generalizes to classes via the softmax function:
The loss becomes categorical cross-entropy: . This is the standard output layer for multi-class classification in neural networks.