| |

3.1: Linear Regression

Linear regression is one of the simplest and most fundamental machine learning algorithms. Despite its simplicity, it serves as an excellent baseline model and foundation for understanding more complex algorithms.


Regression Problem Setup

In regression, we aim to learn a function f:RDRf: \mathbb{R}^D \mapsto \mathbb{R} that maps input features to continuous output values.

Dataset: NN training examples {(x(1),t(1)),(x(2),t(2)),,(x(N),t(N))}\{(\mathbf{x}^{(1)}, t^{(1)}), (\mathbf{x}^{(2)}, t^{(2)}), \ldots, (\mathbf{x}^{(N)}, t^{(N)})\}

For each example ii:

  • x(i)RD\mathbf{x}^{(i)} \in \mathbb{R}^D is the input feature vector (D features)
  • t(i)Rt^{(i)} \in \mathbb{R} is the scalar target (ground truth)

Goal: Learn ff such that t(i)y(i)=f(x(i))t^{(i)} \approx y^{(i)} = f(\mathbf{x}^{(i)}) for all training examples.

Examples of Regression Problems

ProblemInput x(i)\mathbf{x}^{(i)}Target t(i)t^{(i)}
Housing pricesSquare footage, location, # bedrooms/bathroomsHouse price
Weather predictionTemperature, humidity, wind speedAmount of rainfall
Revenue forecastingPrevious sales dataCompany revenue

The Linear Model

The simplest assumption we can make is that the relationship between inputs and outputs is linear.

Model Definition

y(i)=w1x1(i)+w2x2(i)++wDxD(i)+by^{(i)} = w_1 x_1^{(i)} + w_2 x_2^{(i)} + \cdots + w_D x_D^{(i)} + b

Or in compact form:

y(i)=j=1Dwjxj(i)+by^{(i)} = \sum_{j=1}^{D} w_j x_j^{(i)} + b

Where:

  • wjw_j are the weights (model parameters)
  • bb is the bias term (intercept)

Why linear? Linear models are:

  • Simple to understand and interpret
  • Computationally efficient to train
  • Serve as good baseline models
  • Can be extended with feature engineering (polynomial features, etc.)

Why a bias term? The bias allows the model to fit data that doesn’t pass through the origin. Without it, we force f(0)=0f(\mathbf{0}) = 0.

Vectorized Form

We can express this more compactly using vectors:

y(i)=wx(i)y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)}

where w=[w1w2wD]\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} and x(i)=[x1(i)x2(i)xD(i)]\mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_D^{(i)} \end{bmatrix}

Absorbing the Bias

To simplify notation, we absorb the bias bb into the weight vector by adding a dummy feature x0(i)=1x_0^{(i)} = 1:

w=[bw1wD],x(i)=[1x1(i)xD(i)]\mathbf{w} = \begin{bmatrix} b \\ w_1 \\ \vdots \\ w_D \end{bmatrix}, \quad \mathbf{x}^{(i)} = \begin{bmatrix} 1 \\ x_1^{(i)} \\ \vdots \\ x_D^{(i)} \end{bmatrix}

Now our model becomes simply:

y(i)=wx(i)y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)}

Predictions for All Training Examples

Define the design matrix X\mathbf{X} (rows are transposed feature vectors):

X=[x(1)x(2)x(N)]=[1x1(1)xD(1)1x1(2)xD(2)1x1(N)xD(N)]\mathbf{X} = \begin{bmatrix} \mathbf{x}^{(1)\top} \\ \mathbf{x}^{(2)\top} \\ \vdots \\ \mathbf{x}^{(N)\top} \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & \cdots & x_D^{(1)} \\ 1 & x_1^{(2)} & \cdots & x_D^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(N)} & \cdots & x_D^{(N)} \end{bmatrix}

Then predictions for all examples:

y=Xw\mathbf{y} = \mathbf{X}\mathbf{w}

where y=[y(1)y(2)y(N)]\mathbf{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{bmatrix}


Loss and Cost Functions

Loss Function

The loss function L(y(i),t(i))L(y^{(i)}, t^{(i)}) quantifies how bad a prediction is for a single example.

We use squared error loss:

L(y(i),t(i))=12(y(i)t(i))2=12(wx(i)t(i))2L(y^{(i)}, t^{(i)}) = \frac{1}{2}(y^{(i)} - t^{(i)})^2 = \frac{1}{2}(\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)})^2

Why this form?

  • The square ensures we minimize the magnitude of the residual (y(i)t(i))(y^{(i)} - t^{(i)})
  • The factor of 12\frac{1}{2} makes derivative calculations convenient (cancels with the power of 2)

Cost Function

The cost function E(w)\mathcal{E}(\mathbf{w}) is the average loss across all training examples:

E(w)=1Ni=1NL(y(i),t(i))\mathcal{E}(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} L(y^{(i)}, t^{(i)})

Expanding with squared error:

E(w)=12Ni=1N(y(i)t(i))2=12Ni=1N(wx(i)t(i))2\mathcal{E}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^{N} (y^{(i)} - t^{(i)})^2 = \frac{1}{2N} \sum_{i=1}^{N} (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)})^2

Note: In practice, “loss” and “cost” are often used interchangeably.

Vectorized Cost Function

Define the target vector: t=[t(1)t(2)t(N)]\mathbf{t} = \begin{bmatrix} t^{(1)} \\ t^{(2)} \\ \vdots \\ t^{(N)} \end{bmatrix}

Then:

E(w)=12Ni=1N(y(i)t(i))2=12N(yt)(yt)\mathcal{E}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^{N} (y^{(i)} - t^{(i)})^2 = \frac{1}{2N} (\mathbf{y} - \mathbf{t})^\top (\mathbf{y} - \mathbf{t}) =12N(Xwt)(Xwt)=12NXwt22= \frac{1}{2N} (\mathbf{X}\mathbf{w} - \mathbf{t})^\top (\mathbf{X}\mathbf{w} - \mathbf{t}) = \frac{1}{2N} \|\mathbf{X}\mathbf{w} - \mathbf{t}\|_2^2

This is called the Mean Squared Error (MSE) cost function.


Why Vectorize?

Vectorized computations offer several advantages:

  1. Speed: Operations run in parallel on CPU/GPU → reduced computation time
  2. Cleaner code: No explicit loops → easier to read and maintain
  3. Memory efficiency: Better handling of large datasets
  4. Optimized support: Libraries (NumPy, PyTorch, etc.) are built for vectorized operations

Computing the Gradient

To find optimal weights, we minimize the cost function:

minwE(w)\min_{\mathbf{w}} \mathcal{E}(\mathbf{w})

Strategy:

  1. Take derivatives of E(w)\mathcal{E}(\mathbf{w}) with respect to w\mathbf{w}
  2. Set derivatives to zero
  3. Solve for w\mathbf{w}

Gradient of the Loss Function

Since the cost is the average of losses, we first compute wL(w)\nabla_{\mathbf{w}} L(\mathbf{w}).

For a single example:

L(y(i),t(i))=12(y(i)t(i))2=12(wx(i)t(i))2L(y^{(i)}, t^{(i)}) = \frac{1}{2}(y^{(i)} - t^{(i)})^2 = \frac{1}{2}(\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)})^2

Using the chain rule:

Lwj=Ly(i)y(i)wj\frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial y^{(i)}} \cdot \frac{\partial y^{(i)}}{\partial w_j}

where:

  • Ly(i)=y(i)t(i)\frac{\partial L}{\partial y^{(i)}} = y^{(i)} - t^{(i)} (derivative of squared error)
  • y(i)wj=xj(i)\frac{\partial y^{(i)}}{\partial w_j} = x_j^{(i)} (derivative of linear model)

Therefore:

Lwj=(y(i)t(i))xj(i)=(wx(i)t(i))xj(i)\frac{\partial L}{\partial w_j} = (y^{(i)} - t^{(i)}) x_j^{(i)} = (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) x_j^{(i)}

The full gradient vector:

wL(w)=[Lw0Lw1LwD]=(wx(i)t(i))x(i)\nabla_{\mathbf{w}} L(\mathbf{w}) = \begin{bmatrix} \frac{\partial L}{\partial w_0} \\ \frac{\partial L}{\partial w_1} \\ \vdots \\ \frac{\partial L}{\partial w_D} \end{bmatrix} = (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) \mathbf{x}^{(i)}

Gradient of the Cost Function

The cost is the average of losses:

wE(w)=1Ni=1NwL(w)=1Ni=1N(wx(i)t(i))x(i)\nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\mathbf{w}} L(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) \mathbf{x}^{(i)}

Vectorized Gradient

To vectorize i=1N(wx(i)t(i))x(i)\sum_{i=1}^{N} (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) \mathbf{x}^{(i)}:

Define residuals: r=Xwt\mathbf{r} = \mathbf{X}\mathbf{w} - \mathbf{t} (an N×1N \times 1 vector)

Then:

i=1Nr(i)x(i)=Xr=X(Xwt)\sum_{i=1}^{N} r^{(i)} \mathbf{x}^{(i)} = \mathbf{X}^\top \mathbf{r} = \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t})

Therefore, the vectorized gradient:

wE(w)=1NX(Xwt)\boxed{\nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t})}

Direct Solution (Normal Equations)

Set the gradient to zero and solve:

wE(w)=1NX(Xwt)=0\nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t}) = 0 XXw=Xt\Rightarrow \mathbf{X}^\top \mathbf{X} \mathbf{w} = \mathbf{X}^\top \mathbf{t} w=(XX)1Xt\Rightarrow \mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t}

This is called the normal equations or closed-form solution.

Advantages and Disadvantages

Advantages:

  • No iteration required, computes optimal weights in one step
  • Guaranteed to find the global minimum (for convex problems)

Disadvantages:

  • Computing (XX)1(\mathbf{X}^\top \mathbf{X})^{-1} is expensive: O(D3)O(D^3) complexity
  • Requires XX\mathbf{X}^\top \mathbf{X} to be invertible
  • Does not generalize to other models or loss functions
  • Impractical when DD (number of features) is very large

Linear Regression Properties

Advantages

  • Interpretable: Weights directly show feature importance
  • Efficient: Fast to train on moderate-sized datasets
  • Good baseline: Establishes performance floor for more complex models

Limitations

  • Assumes linearity: Cannot capture nonlinear relationships without feature engineering
  • Sensitive to outliers: Squared error heavily penalizes large residuals
  • Continuous features: Requires numerical inputs (categorical features need encoding)
  • Multicollinearity: Performance degrades when features are highly correlated
  • High bias: May underfit complex data (simple model)

Summary

ComponentFormula
Linear Modely(i)=wx(i)y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)}
y=Xw\mathbf{y} = \mathbf{X}\mathbf{w}
Loss FunctionL(y(i),t(i))=12(y(i)t(i))2L(y^{(i)}, t^{(i)}) = \frac{1}{2}(y^{(i)} - t^{(i)})^2
Cost Function (MSE)E(w)=12NXwt22\mathcal{E}(\mathbf{w}) = \frac{1}{2N} \|\mathbf{X}\mathbf{w} - \mathbf{t}\|_2^2
GradientwE(w)=1NX(Xwt)\nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t})
Direct Solutionw=(XX)1Xt\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t}

Direct Solution:

  • No iteration required
  • Computationally expensive (O(D3)O(D^3))
  • Does not generalize to other models