Logistic Regression and Regularization

This lecture covers how to extend linear models to capture non-linear relationships through feature mapping, how to prevent overfitting using regularization, and how to adapt linear models for classification tasks.

Feature Mapping (Basis Expansion)

What if the relationship between inputs and outputs is not linear? We can still use linear regression by transforming the inputs!

The Key Idea

Transform the inputs: Map original features $\mathbf{x} \in \mathbb{R}^D$ to a new feature space $\mathbb{R}^{D'}$ $\psi(\mathbf{x}): \mathbb{R}^D \rightarrow \mathbb{R}^{D'}$
Fit a linear model: Run linear regression with the new features as inputs

The key insight: Create new features so that the non-linear relationship becomes linear in the transformed space.

Typically $D' > D$ (we expand to a higher-dimensional space).

Example: Polynomial Feature Mapping

Goal: Learn a polynomial function $y = w_0 + w_1 x + w_2 x^2 + \cdots + w_M x^M$

The relationship $y$ is not linear in $x$ , but we can make it linear by defining:

Polynomial feature map: $\psi(x): \mathbb{R} \mapsto \mathbb{R}^{M+1}, \quad \psi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^M \end{bmatrix}$

Now $y = \mathbf{w}^\top \psi(x)$ is linear in $\psi(x)$ . The model is linear with respect to the parameters, even though it’s non-linear in the original input $x$ .

Polynomial Degree and Model Complexity

Degree $M$	Feature Map	Hypothesis	Behavior
$M = 1$	$\psi(x) = [1, x]^\top$	$y = w_0 + w_1 x$	Simple line (underfitting)
$M = 3$	$\psi(x) = [1, x, x^2, x^3]^\top$	$y = w_0 + w_1 x + w_2 x^2 + w_3 x^3$	Good fit
$M = 9$	$\psi(x) = [1, x, \ldots, x^9]^\top$	$y = w_0 + \cdots + w_9 x^9$	Overfitting

Model Complexity and Generalization

As the polynomial degree $M$ increases:

$M$	Training Error	Test Error	Diagnosis
Low	High	High	Underfitting - model too simple
Optimal	Low	Low	Good generalization
High	Very low (near 0)	High	Overfitting - model too complex

Key observations:

Training error decreases monotonically as model complexity increases
Test error decreases initially, then increases (U-shaped curve)
Weight magnitudes grow as $M$ increases (motivates regularization)

Regularization

Why Regularize?

When overfitting occurs:

Weights become very large (finely tuned to training data)
The function oscillates wildly between data points
Small changes in input cause large changes in output

Controlling Model Complexity

Two approaches to prevent overfitting:

Tune $M$ as a hyperparameter using a validation set
- Decreases the number of parameters
Use regularization to enforce simpler solutions
- Keep the number of parameters large, but constrain their values

The $L^2$ Regularizer

$R(\mathbf{w}) = \frac{1}{2} \|\mathbf{w}\|_2^2 = \frac{1}{2} \sum_j w_j^2$

This is the squared Euclidean norm of the weight vector.

Regularized Cost Function

$\mathcal{E}_{\text{reg}}(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \lambda R(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \frac{\lambda}{2} \|\mathbf{w}\|_2^2$

This creates a tradeoff:

Smaller $\mathcal{E}$ → better fit to training data
Smaller $R$ → smaller weights (simpler model)
$\lambda$ controls the tradeoff between the two

Tuning $\lambda$

$\lambda$	Effect on Weights	Effect on Fit	Risk
Too large ( $\lambda \to \infty$ )	All weights small	Poor fit to training data	Underfitting
Too small ( $\lambda \to 0$ )	Some weights large	Great fit to training data	Overfitting
Optimal	Balanced	Good generalization	-

Intuition:

$\lambda$ too large: $\mathcal{E}_{\text{reg}} \approx \lambda R(\mathbf{w})$ , model tries only to keep weights small
$\lambda$ too small: $\mathcal{E}_{\text{reg}} \approx \mathcal{E}(\mathbf{w})$ , model ignores regularization

Why Penalize Large Weights?

A large weight → prediction is very sensitive to that feature
We expect output to depend on a combination of features
Large weights often indicate the model is fitting noise

A Modular Approach to Machine Learning

Component	Purpose
Model	Describes relationships between variables
Loss/Cost Function	Quantifies how badly a hypothesis fits the data
Regularizer	Expresses preferences over different hypotheses
Optimization Algorithm	Fit a model that minimizes loss and satisfies regularization

Binary Linear Classification

From Regression to Classification

In classification, the target is discrete rather than continuous.

Classification setup:

Dataset: $\{(\mathbf{x}^{(1)}, t^{(1)}), (\mathbf{x}^{(2)}, t^{(2)}), \ldots, (\mathbf{x}^{(N)}, t^{(N)})\}$
Each target $t^{(i)}$ is discrete
Binary classification: $t^{(i)} \in \{0, 1\}$ $t^{(i)} \in {0, 1}$
- $t^{(i)} = 1$ : positive example
- $t^{(i)} = 0$ : negative example

Linear Model for Binary Classification

Step 1: Compute a linear combination $z = \mathbf{w}^\top \mathbf{x} + b$

Step 2: Apply threshold to generate prediction $y = \begin{cases} 1, & \text{if } z \geq r \\ 0, & \text{if } z < r \end{cases}$

where $r$ is the threshold.

Simplifying the Model

Eliminating the threshold $r$ :

Since $\mathbf{w}^\top \mathbf{x} + b \geq r \iff \mathbf{w}^\top \mathbf{x} + (b - r) \geq 0$ , we can absorb $r$ into the bias:

$z = \mathbf{w}^\top \mathbf{x} + b_0, \quad y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}$

Eliminating the bias $b_0$ :

Add a dummy feature $x_0 = 1$ and let $b_0$ be its weight:

$z = \mathbf{w}^\top \mathbf{x}, \quad y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}$

Decision Boundary

The decision boundary is the set of points where $z = 0$ : $\mathbf{w}^\top \mathbf{x} = 0$

This defines a hyperplane that separates the two classes.

Example: Modeling the AND Function

Goal: Learn weights to classify the logical AND function perfectly.

$x_0$	$x_1$	$x_2$	$t$
1	0	0	0
1	0	1	0
1	1	0	0
1	1	1	1

System of inequalities for perfect classification:

The model predicts $y = 1$ if $w_1 x_1 + w_2 x_2 + w_0 \geq 0$ :

$(0,0)$ : $w_0 < 0$ (predict 0)
$(0,1)$ : $w_2 + w_0 < 0$ (predict 0)
$(1,0)$ : $w_1 + w_0 < 0$ (predict 0)
$(1,1)$ : $w_1 + w_2 + w_0 \geq 0$ (predict 1)

One solution: $w_1 = 1, w_2 = 1, w_0 = -1.5$

The decision boundary is: $x_1 + x_2 - 1.5 = 0$ , or equivalently $x_2 = -x_1 + 1.5$

Linearly Separable Data

If data is linearly separable, we can find weights that classify every point correctly
In practice, data is rarely linearly separable
A linear model will inevitably make mistakes
We need a way to measure errors and adjust the model → leads to loss functions

Summary

Feature Mapping

Concept	Description
Goal	Model non-linear relationships using linear regression
Method	Create new features $\psi(\mathbf{x})$ so model is linear in transformed space
Tradeoff	Higher-degree features → more expressive but risk overfitting

Regularization

Concept	Formula
$L^2$ Regularizer	$R(\mathbf{w}) = \frac{1}{2}\\|\mathbf{w}\\|_2^2$
Regularized Cost	$\mathcal{E}_{\text{reg}}(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \frac{\lambda}{2}\\|\mathbf{w}\\|_2^2$
Hyperparameter	$\lambda$ controls fit vs. simplicity tradeoff

Binary Linear Classification

Component	Formula
Linear Model	$z = \mathbf{w}^\top \mathbf{x}$
Threshold Activation	$y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}$
Decision Boundary	Hyperplane $\mathbf{w}^\top \mathbf{x} = 0$

Connection to Modern Practice

Logistic Regression as a Strong Baseline

Logistic regression remains the standard first model in production ML classification tasks. At companies like Google and Meta, logistic regression baselines are established before more complex models are tried. Reasons:

Fast to train and serve. A single matrix-vector multiply at inference. Critical when serving millions of predictions per second.
Interpretable. Each weight $w_j$ indicates how feature $j$ affects the log-odds. In regulated domains, this transparency is required.
Well-calibrated. Because the sigmoid output is derived from the Bernoulli log-likelihood, logistic regression produces calibrated probabilities by construction (under correct specification). Many modern models require post-hoc calibration (Platt scaling, temperature scaling) to achieve what logistic regression provides natively.
Convex optimization. The cross-entropy loss for logistic regression is convex in $\mathbf{w}$ , guaranteeing a unique global minimum. This means results are reproducible regardless of initialization.

L1 vs L2 Regularization

This article covers L2 (Ridge) regularization. L1 (Lasso) regularization $R(\mathbf{w}) = \|\mathbf{w}\|_1$ induces sparsity: many weights become exactly zero, effectively performing feature selection. Elastic Net combines both: $R(\mathbf{w}) = \alpha\|\mathbf{w}\|_1 + (1-\alpha)\frac{1}{2}\|\mathbf{w}\|_2^2$ .

From a Bayesian perspective: L2 regularization corresponds to a Gaussian prior on $\mathbf{w}$ , L1 to a Laplace prior. The Laplace prior’s sharp peak at zero is what produces sparsity.

Generalization to Neural Networks

A neural network with a sigmoid output and binary cross-entropy loss is logistic regression on learned features:

P(Y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{h}(\mathbf{x}))

where $\mathbf{h}(\mathbf{x})$ is the representation from the hidden layers. The final layer is literally logistic regression applied to the learned representation instead of the raw features. This connection explains why cross-entropy is the standard classification loss in deep learning: it is the maximum likelihood objective for Bernoulli outputs.

Multi-Class Extension

Logistic regression generalizes to $K > 2$ classes via the softmax function:

P(Y = k \mid \mathbf{x}) = \frac{\exp(\mathbf{w}_k^\top \mathbf{x})}{\sum_{j=1}^K \exp(\mathbf{w}_j^\top \mathbf{x})}

The loss becomes categorical cross-entropy: $\mathcal{L} = -\sum_k y_k \log p_k$ . This is the standard output layer for multi-class classification in neural networks.

Logistic Regression and Regularization

Feature Mapping (Basis Expansion)

The Key Idea

Example: Polynomial Feature Mapping

Polynomial Degree and Model Complexity

Model Complexity and Generalization

Regularization

Why Regularize?

Controlling Model Complexity

The L2L^2L2 Regularizer

Regularized Cost Function

Tuning λ\lambdaλ

Why Penalize Large Weights?

A Modular Approach to Machine Learning

Binary Linear Classification

From Regression to Classification

Linear Model for Binary Classification

Simplifying the Model

Decision Boundary

Example: Modeling the AND Function

Linearly Separable Data

Summary

Feature Mapping

Regularization

Binary Linear Classification

Connection to Modern Practice

Logistic Regression as a Strong Baseline

L1 vs L2 Regularization

Generalization to Neural Networks

Multi-Class Extension

The $L^2$ Regularizer

Tuning $\lambda$