| |

Logistic Regression and Regularization

This lecture covers how to extend linear models to capture non-linear relationships through feature mapping, how to prevent overfitting using regularization, and how to adapt linear models for classification tasks.


Feature Mapping (Basis Expansion)

What if the relationship between inputs and outputs is not linear? We can still use linear regression by transforming the inputs!

The Key Idea

  1. Transform the inputs: Map original features xโˆˆRD\mathbf{x} \in \mathbb{R}^D to a new feature space RDโ€ฒ\mathbb{R}^{D'} ฯˆ(x):RDโ†’RDโ€ฒ\psi(\mathbf{x}): \mathbb{R}^D \rightarrow \mathbb{R}^{D'}

  2. Fit a linear model: Run linear regression with the new features as inputs

The key insight: Create new features so that the non-linear relationship becomes linear in the transformed space.

Typically Dโ€ฒ>DD' > D (we expand to a higher-dimensional space).

Example: Polynomial Feature Mapping

Goal: Learn a polynomial function y=w0+w1x+w2x2+โ‹ฏ+wMxMy = w_0 + w_1 x + w_2 x^2 + \cdots + w_M x^M

The relationship yy is not linear in xx, but we can make it linear by defining:

Polynomial feature map: ฯˆ(x):Rโ†ฆRM+1,ฯˆ(x)=[1xx2โ‹ฎxM]\psi(x): \mathbb{R} \mapsto \mathbb{R}^{M+1}, \quad \psi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^M \end{bmatrix}

Now y=wโŠคฯˆ(x)y = \mathbf{w}^\top \psi(x) is linear in ฯˆ(x)\psi(x). The model is linear with respect to the parameters, even though itโ€™s non-linear in the original input xx.

Polynomial Degree and Model Complexity

Degree MMFeature MapHypothesisBehavior
M=1M = 1ฯˆ(x)=[1,x]โŠค\psi(x) = [1, x]^\topy=w0+w1xy = w_0 + w_1 xSimple line (underfitting)
M=3M = 3ฯˆ(x)=[1,x,x2,x3]โŠค\psi(x) = [1, x, x^2, x^3]^\topy=w0+w1x+w2x2+w3x3y = w_0 + w_1 x + w_2 x^2 + w_3 x^3Good fit
M=9M = 9ฯˆ(x)=[1,x,โ€ฆ,x9]โŠค\psi(x) = [1, x, \ldots, x^9]^\topy=w0+โ‹ฏ+w9x9y = w_0 + \cdots + w_9 x^9Overfitting

Model Complexity and Generalization

As the polynomial degree MM increases:

MMTraining ErrorTest ErrorDiagnosis
LowHighHighUnderfitting - model too simple
OptimalLowLowGood generalization
HighVery low (near 0)HighOverfitting - model too complex

Key observations:

  • Training error decreases monotonically as model complexity increases
  • Test error decreases initially, then increases (U-shaped curve)
  • Weight magnitudes grow as MM increases (motivates regularization)

Regularization

Why Regularize?

When overfitting occurs:

  • Weights become very large (finely tuned to training data)
  • The function oscillates wildly between data points
  • Small changes in input cause large changes in output

Controlling Model Complexity

Two approaches to prevent overfitting:

  1. Tune MM as a hyperparameter using a validation set

    • Decreases the number of parameters
  2. Use regularization to enforce simpler solutions

    • Keep the number of parameters large, but constrain their values

The L2L^2 Regularizer

R(w)=12โˆฅwโˆฅ22=12โˆ‘jwj2R(\mathbf{w}) = \frac{1}{2} \|\mathbf{w}\|_2^2 = \frac{1}{2} \sum_j w_j^2

This is the squared Euclidean norm of the weight vector.

Regularized Cost Function

Ereg(w)=E(w)+ฮปR(w)=E(w)+ฮป2โˆฅwโˆฅ22\mathcal{E}_{\text{reg}}(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \lambda R(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \frac{\lambda}{2} \|\mathbf{w}\|_2^2

This creates a tradeoff:

  • Smaller E\mathcal{E} โ†’ better fit to training data
  • Smaller RR โ†’ smaller weights (simpler model)
  • ฮป\lambda controls the tradeoff between the two

Tuning ฮป\lambda

ฮป\lambdaEffect on WeightsEffect on FitRisk
Too large (ฮปโ†’โˆž\lambda \to \infty)All weights smallPoor fit to training dataUnderfitting
Too small (ฮปโ†’0\lambda \to 0)Some weights largeGreat fit to training dataOverfitting
OptimalBalancedGood generalization-

Intuition:

  • ฮป\lambda too large: Eregโ‰ˆฮปR(w)\mathcal{E}_{\text{reg}} \approx \lambda R(\mathbf{w}), model tries only to keep weights small
  • ฮป\lambda too small: Eregโ‰ˆE(w)\mathcal{E}_{\text{reg}} \approx \mathcal{E}(\mathbf{w}), model ignores regularization

Why Penalize Large Weights?

  • A large weight โ†’ prediction is very sensitive to that feature
  • We expect output to depend on a combination of features
  • Large weights often indicate the model is fitting noise

A Modular Approach to Machine Learning

ComponentPurpose
ModelDescribes relationships between variables
Loss/Cost FunctionQuantifies how badly a hypothesis fits the data
RegularizerExpresses preferences over different hypotheses
Optimization AlgorithmFit a model that minimizes loss and satisfies regularization

Binary Linear Classification

From Regression to Classification

In classification, the target is discrete rather than continuous.

Classification setup:

  • Dataset: {(x(1),t(1)),(x(2),t(2)),โ€ฆ,(x(N),t(N))}\{(\mathbf{x}^{(1)}, t^{(1)}), (\mathbf{x}^{(2)}, t^{(2)}), \ldots, (\mathbf{x}^{(N)}, t^{(N)})\}
  • Each target t(i)t^{(i)} is discrete
  • Binary classification: t(i)โˆˆ{0,1}t^{(i)} \in \{0, 1\}
    • t(i)=1t^{(i)} = 1: positive example
    • t(i)=0t^{(i)} = 0: negative example

Linear Model for Binary Classification

Step 1: Compute a linear combination z=wโŠคx+bz = \mathbf{w}^\top \mathbf{x} + b

Step 2: Apply threshold to generate prediction y={1,ifย zโ‰ฅr0,ifย z<ry = \begin{cases} 1, & \text{if } z \geq r \\ 0, & \text{if } z < r \end{cases}

where rr is the threshold.

Simplifying the Model

Eliminating the threshold rr:

Since wโŠคx+bโ‰ฅrโ€…โ€ŠโŸบโ€…โ€ŠwโŠคx+(bโˆ’r)โ‰ฅ0\mathbf{w}^\top \mathbf{x} + b \geq r \iff \mathbf{w}^\top \mathbf{x} + (b - r) \geq 0, we can absorb rr into the bias:

z=wโŠคx+b0,y={1,ifย zโ‰ฅ00,ifย z<0z = \mathbf{w}^\top \mathbf{x} + b_0, \quad y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}

Eliminating the bias b0b_0:

Add a dummy feature x0=1x_0 = 1 and let b0b_0 be its weight:

z=wโŠคx,y={1,ifย zโ‰ฅ00,ifย z<0z = \mathbf{w}^\top \mathbf{x}, \quad y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}

Decision Boundary

The decision boundary is the set of points where z=0z = 0: wโŠคx=0\mathbf{w}^\top \mathbf{x} = 0

This defines a hyperplane that separates the two classes.

Example: Modeling the AND Function

Goal: Learn weights to classify the logical AND function perfectly.

x0x_0x1x_1x2x_2tt
1000
1010
1100
1111

System of inequalities for perfect classification:

The model predicts y=1y = 1 if w1x1+w2x2+w0โ‰ฅ0w_1 x_1 + w_2 x_2 + w_0 \geq 0:

  • (0,0)(0,0): w0<0w_0 < 0 (predict 0)
  • (0,1)(0,1): w2+w0<0w_2 + w_0 < 0 (predict 0)
  • (1,0)(1,0): w1+w0<0w_1 + w_0 < 0 (predict 0)
  • (1,1)(1,1): w1+w2+w0โ‰ฅ0w_1 + w_2 + w_0 \geq 0 (predict 1)

One solution: w1=1,w2=1,w0=โˆ’1.5w_1 = 1, w_2 = 1, w_0 = -1.5

The decision boundary is: x1+x2โˆ’1.5=0x_1 + x_2 - 1.5 = 0, or equivalently x2=โˆ’x1+1.5x_2 = -x_1 + 1.5

Linearly Separable Data

  • If data is linearly separable, we can find weights that classify every point correctly
  • In practice, data is rarely linearly separable
  • A linear model will inevitably make mistakes
  • We need a way to measure errors and adjust the model โ†’ leads to loss functions

Summary

Feature Mapping

ConceptDescription
GoalModel non-linear relationships using linear regression
MethodCreate new features ฯˆ(x)\psi(\mathbf{x}) so model is linear in transformed space
TradeoffHigher-degree features โ†’ more expressive but risk overfitting

Regularization

ConceptFormula
L2L^2 RegularizerR(w)=12โˆฅwโˆฅ22R(\mathbf{w}) = \frac{1}{2}\|\mathbf{w}\|_2^2
Regularized CostEreg(w)=E(w)+ฮป2โˆฅwโˆฅ22\mathcal{E}_{\text{reg}}(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2
Hyperparameterฮป\lambda controls fit vs. simplicity tradeoff

Binary Linear Classification

ComponentFormula
Linear Modelz=wโŠคxz = \mathbf{w}^\top \mathbf{x}
Threshold Activationy={1,ifย zโ‰ฅ00,ifย z<0y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}
Decision BoundaryHyperplane wโŠคx=0\mathbf{w}^\top \mathbf{x} = 0