| |

4: Feature Mapping, Regularization, and Linear Classification

This lecture covers how to extend linear models to capture non-linear relationships through feature mapping, how to prevent overfitting using regularization, and how to adapt linear models for classification tasks.


Feature Mapping (Basis Expansion)

What if the relationship between inputs and outputs is not linear? We can still use linear regression by transforming the inputs!

The Key Idea

  1. Transform the inputs: Map original features xRD\mathbf{x} \in \mathbb{R}^D to a new feature space RD\mathbb{R}^{D'} ψ(x):RDRD\psi(\mathbf{x}): \mathbb{R}^D \rightarrow \mathbb{R}^{D'}

  2. Fit a linear model: Run linear regression with the new features as inputs

The key insight: Create new features so that the non-linear relationship becomes linear in the transformed space.

Typically D>DD' > D (we expand to a higher-dimensional space).

Example: Polynomial Feature Mapping

Goal: Learn a polynomial function y=w0+w1x+w2x2++wMxMy = w_0 + w_1 x + w_2 x^2 + \cdots + w_M x^M

The relationship yy is not linear in xx, but we can make it linear by defining:

Polynomial feature map: ψ(x):RRM+1,ψ(x)=[1xx2xM]\psi(x): \mathbb{R} \mapsto \mathbb{R}^{M+1}, \quad \psi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^M \end{bmatrix}

Now y=wψ(x)y = \mathbf{w}^\top \psi(x) is linear in ψ(x)\psi(x). The model is linear with respect to the parameters, even though it’s non-linear in the original input xx.

Polynomial Degree and Model Complexity

Degree MMFeature MapHypothesisBehavior
M=1M = 1ψ(x)=[1,x]\psi(x) = [1, x]^\topy=w0+w1xy = w_0 + w_1 xSimple line (underfitting)
M=3M = 3ψ(x)=[1,x,x2,x3]\psi(x) = [1, x, x^2, x^3]^\topy=w0+w1x+w2x2+w3x3y = w_0 + w_1 x + w_2 x^2 + w_3 x^3Good fit
M=9M = 9ψ(x)=[1,x,,x9]\psi(x) = [1, x, \ldots, x^9]^\topy=w0++w9x9y = w_0 + \cdots + w_9 x^9Overfitting

Model Complexity and Generalization

As the polynomial degree MM increases:

MMTraining ErrorTest ErrorDiagnosis
LowHighHighUnderfitting - model too simple
OptimalLowLowGood generalization
HighVery low (near 0)HighOverfitting - model too complex

Key observations:

  • Training error decreases monotonically as model complexity increases
  • Test error decreases initially, then increases (U-shaped curve)
  • Weight magnitudes grow as MM increases (motivates regularization)

Regularization

Why Regularize?

When overfitting occurs:

  • Weights become very large (finely tuned to training data)
  • The function oscillates wildly between data points
  • Small changes in input cause large changes in output

Controlling Model Complexity

Two approaches to prevent overfitting:

  1. Tune MM as a hyperparameter using a validation set

    • Decreases the number of parameters
  2. Use regularization to enforce simpler solutions

    • Keep the number of parameters large, but constrain their values

The L2L^2 Regularizer

R(w)=12w22=12jwj2R(\mathbf{w}) = \frac{1}{2} \|\mathbf{w}\|_2^2 = \frac{1}{2} \sum_j w_j^2

This is the squared Euclidean norm of the weight vector.

Regularized Cost Function

Ereg(w)=E(w)+λR(w)=E(w)+λ2w22\mathcal{E}_{\text{reg}}(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \lambda R(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \frac{\lambda}{2} \|\mathbf{w}\|_2^2

This creates a tradeoff:

  • Smaller E\mathcal{E} → better fit to training data
  • Smaller RR → smaller weights (simpler model)
  • λ\lambda controls the tradeoff between the two

Tuning λ\lambda

λ\lambdaEffect on WeightsEffect on FitRisk
Too large (λ\lambda \to \infty)All weights smallPoor fit to training dataUnderfitting
Too small (λ0\lambda \to 0)Some weights largeGreat fit to training dataOverfitting
OptimalBalancedGood generalization-

Intuition:

  • λ\lambda too large: EregλR(w)\mathcal{E}_{\text{reg}} \approx \lambda R(\mathbf{w}), model tries only to keep weights small
  • λ\lambda too small: EregE(w)\mathcal{E}_{\text{reg}} \approx \mathcal{E}(\mathbf{w}), model ignores regularization

Why Penalize Large Weights?

  • A large weight → prediction is very sensitive to that feature
  • We expect output to depend on a combination of features
  • Large weights often indicate the model is fitting noise

A Modular Approach to Machine Learning

ComponentPurpose
ModelDescribes relationships between variables
Loss/Cost FunctionQuantifies how badly a hypothesis fits the data
RegularizerExpresses preferences over different hypotheses
Optimization AlgorithmFit a model that minimizes loss and satisfies regularization

Binary Linear Classification

From Regression to Classification

In classification, the target is discrete rather than continuous.

Classification setup:

  • Dataset: {(x(1),t(1)),(x(2),t(2)),,(x(N),t(N))}\{(\mathbf{x}^{(1)}, t^{(1)}), (\mathbf{x}^{(2)}, t^{(2)}), \ldots, (\mathbf{x}^{(N)}, t^{(N)})\}
  • Each target t(i)t^{(i)} is discrete
  • Binary classification: t(i){0,1}t^{(i)} \in \{0, 1\}
    • t(i)=1t^{(i)} = 1: positive example
    • t(i)=0t^{(i)} = 0: negative example

Linear Model for Binary Classification

Step 1: Compute a linear combination z=wx+bz = \mathbf{w}^\top \mathbf{x} + b

Step 2: Apply threshold to generate prediction y={1,if zr0,if z<ry = \begin{cases} 1, & \text{if } z \geq r \\ 0, & \text{if } z < r \end{cases}

where rr is the threshold.

Simplifying the Model

Eliminating the threshold rr:

Since wx+br    wx+(br)0\mathbf{w}^\top \mathbf{x} + b \geq r \iff \mathbf{w}^\top \mathbf{x} + (b - r) \geq 0, we can absorb rr into the bias:

z=wx+b0,y={1,if z00,if z<0z = \mathbf{w}^\top \mathbf{x} + b_0, \quad y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}

Eliminating the bias b0b_0:

Add a dummy feature x0=1x_0 = 1 and let b0b_0 be its weight:

z=wx,y={1,if z00,if z<0z = \mathbf{w}^\top \mathbf{x}, \quad y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}

Decision Boundary

The decision boundary is the set of points where z=0z = 0: wx=0\mathbf{w}^\top \mathbf{x} = 0

This defines a hyperplane that separates the two classes.

Example: Modeling the AND Function

Goal: Learn weights to classify the logical AND function perfectly.

x0x_0x1x_1x2x_2tt
1000
1010
1100
1111

System of inequalities for perfect classification:

The model predicts y=1y = 1 if w1x1+w2x2+w00w_1 x_1 + w_2 x_2 + w_0 \geq 0:

  • (0,0)(0,0): w0<0w_0 < 0 (predict 0)
  • (0,1)(0,1): w2+w0<0w_2 + w_0 < 0 (predict 0)
  • (1,0)(1,0): w1+w0<0w_1 + w_0 < 0 (predict 0)
  • (1,1)(1,1): w1+w2+w00w_1 + w_2 + w_0 \geq 0 (predict 1)

One solution: w1=1,w2=1,w0=1.5w_1 = 1, w_2 = 1, w_0 = -1.5

The decision boundary is: x1+x21.5=0x_1 + x_2 - 1.5 = 0, or equivalently x2=x1+1.5x_2 = -x_1 + 1.5

Linearly Separable Data

  • If data is linearly separable, we can find weights that classify every point correctly
  • In practice, data is rarely linearly separable
  • A linear model will inevitably make mistakes
  • We need a way to measure errors and adjust the model → leads to loss functions

Summary

Feature Mapping

ConceptDescription
GoalModel non-linear relationships using linear regression
MethodCreate new features ψ(x)\psi(\mathbf{x}) so model is linear in transformed space
TradeoffHigher-degree features → more expressive but risk overfitting

Regularization

ConceptFormula
L2L^2 RegularizerR(w)=12w22R(\mathbf{w}) = \frac{1}{2}\|\mathbf{w}\|_2^2
Regularized CostEreg(w)=E(w)+λ2w22\mathcal{E}_{\text{reg}}(\mathbf{w}) = \mathcal{E}(\mathbf{w}) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2
Hyperparameterλ\lambda controls fit vs. simplicity tradeoff

Binary Linear Classification

ComponentFormula
Linear Modelz=wxz = \mathbf{w}^\top \mathbf{x}
Threshold Activationy={1,if z00,if z<0y = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{if } z < 0 \end{cases}
Decision BoundaryHyperplane wx=0\mathbf{w}^\top \mathbf{x} = 0