3.1: Linear Regression Linear regression is one of the simplest and most fundamental machine learning algorithms. Despite its simplicity, it serves as an excellent baseline model and foundation for understanding more complex algorithms.
Regression Problem Setup
In regression, we aim to learn a function f : R D ↦ R f: \mathbb{R}^D \mapsto \mathbb{R} f : R D ↦ R that maps input features to continuous output values.
Dataset: N N N training examples { ( x ( 1 ) , t ( 1 ) ) , ( x ( 2 ) , t ( 2 ) ) , … , ( x ( N ) , t ( N ) ) } \{(\mathbf{x}^{(1)}, t^{(1)}), (\mathbf{x}^{(2)}, t^{(2)}), \ldots, (\mathbf{x}^{(N)}, t^{(N)})\} {( x ( 1 ) , t ( 1 ) ) , ( x ( 2 ) , t ( 2 ) ) , … , ( x ( N ) , t ( N ) )}
For each example i i i :
x ( i ) ∈ R D \mathbf{x}^{(i)} \in \mathbb{R}^D x ( i ) ∈ R D is the input feature vector (D features)
t ( i ) ∈ R t^{(i)} \in \mathbb{R} t ( i ) ∈ R is the scalar target (ground truth)
Goal: Learn f f f such that t ( i ) ≈ y ( i ) = f ( x ( i ) ) t^{(i)} \approx y^{(i)} = f(\mathbf{x}^{(i)}) t ( i ) ≈ y ( i ) = f ( x ( i ) ) for all training examples.
Examples of Regression Problems
Problem Input x ( i ) \mathbf{x}^{(i)} x ( i ) Target t ( i ) t^{(i)} t ( i ) Housing prices Square footage, location, # bedrooms/bathrooms House price Weather prediction Temperature, humidity, wind speed Amount of rainfall Revenue forecasting Previous sales data Company revenue
The Linear Model
The simplest assumption we can make is that the relationship between inputs and outputs is linear .
Model Definition
y ( i ) = w 1 x 1 ( i ) + w 2 x 2 ( i ) + ⋯ + w D x D ( i ) + b y^{(i)} = w_1 x_1^{(i)} + w_2 x_2^{(i)} + \cdots + w_D x_D^{(i)} + b y ( i ) = w 1 x 1 ( i ) + w 2 x 2 ( i ) + ⋯ + w D x D ( i ) + b
Or in compact form:
y ( i ) = ∑ j = 1 D w j x j ( i ) + b y^{(i)} = \sum_{j=1}^{D} w_j x_j^{(i)} + b y ( i ) = j = 1 ∑ D w j x j ( i ) + b
Where:
w j w_j w j are the weights (model parameters)
b b b is the bias term (intercept)
Why linear? Linear models are:
Simple to understand and interpret
Computationally efficient to train
Serve as good baseline models
Can be extended with feature engineering (polynomial features, etc.)
Why a bias term? The bias allows the model to fit data that doesn’t pass through the origin. Without it, we force f ( 0 ) = 0 f(\mathbf{0}) = 0 f ( 0 ) = 0 .
We can express this more compactly using vectors:
y ( i ) = w ⊤ x ( i ) y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} y ( i ) = w ⊤ x ( i )
where w = [ w 1 w 2 ⋮ w D ] \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} w = w 1 w 2 ⋮ w D and x ( i ) = [ x 1 ( i ) x 2 ( i ) ⋮ x D ( i ) ] \mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_D^{(i)} \end{bmatrix} x ( i ) = x 1 ( i ) x 2 ( i ) ⋮ x D ( i )
Absorbing the Bias
To simplify notation, we absorb the bias b b b into the weight vector by adding a dummy feature x 0 ( i ) = 1 x_0^{(i)} = 1 x 0 ( i ) = 1 :
w = [ b w 1 ⋮ w D ] , x ( i ) = [ 1 x 1 ( i ) ⋮ x D ( i ) ] \mathbf{w} = \begin{bmatrix} b \\ w_1 \\ \vdots \\ w_D \end{bmatrix}, \quad \mathbf{x}^{(i)} = \begin{bmatrix} 1 \\ x_1^{(i)} \\ \vdots \\ x_D^{(i)} \end{bmatrix} w = b w 1 ⋮ w D , x ( i ) = 1 x 1 ( i ) ⋮ x D ( i )
Now our model becomes simply:
y ( i ) = w ⊤ x ( i ) y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} y ( i ) = w ⊤ x ( i )
Predictions for All Training Examples
Define the design matrix X \mathbf{X} X (rows are transposed feature vectors):
X = [ x ( 1 ) ⊤ x ( 2 ) ⊤ ⋮ x ( N ) ⊤ ] = [ 1 x 1 ( 1 ) ⋯ x D ( 1 ) 1 x 1 ( 2 ) ⋯ x D ( 2 ) ⋮ ⋮ ⋱ ⋮ 1 x 1 ( N ) ⋯ x D ( N ) ] \mathbf{X} = \begin{bmatrix} \mathbf{x}^{(1)\top} \\ \mathbf{x}^{(2)\top} \\ \vdots \\ \mathbf{x}^{(N)\top} \end{bmatrix} = \begin{bmatrix}
1 & x_1^{(1)} & \cdots & x_D^{(1)} \\
1 & x_1^{(2)} & \cdots & x_D^{(2)} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(N)} & \cdots & x_D^{(N)}
\end{bmatrix} X = x ( 1 ) ⊤ x ( 2 ) ⊤ ⋮ x ( N ) ⊤ = 1 1 ⋮ 1 x 1 ( 1 ) x 1 ( 2 ) ⋮ x 1 ( N ) ⋯ ⋯ ⋱ ⋯ x D ( 1 ) x D ( 2 ) ⋮ x D ( N )
Then predictions for all examples:
y = X w \mathbf{y} = \mathbf{X}\mathbf{w} y = Xw
where y = [ y ( 1 ) y ( 2 ) ⋮ y ( N ) ] \mathbf{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{bmatrix} y = y ( 1 ) y ( 2 ) ⋮ y ( N )
Loss and Cost Functions
Loss Function
The loss function L ( y ( i ) , t ( i ) ) L(y^{(i)}, t^{(i)}) L ( y ( i ) , t ( i ) ) quantifies how bad a prediction is for a single example.
We use squared error loss :
L ( y ( i ) , t ( i ) ) = 1 2 ( y ( i ) − t ( i ) ) 2 = 1 2 ( w ⊤ x ( i ) − t ( i ) ) 2 L(y^{(i)}, t^{(i)}) = \frac{1}{2}(y^{(i)} - t^{(i)})^2 = \frac{1}{2}(\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)})^2 L ( y ( i ) , t ( i ) ) = 2 1 ( y ( i ) − t ( i ) ) 2 = 2 1 ( w ⊤ x ( i ) − t ( i ) ) 2
Why this form?
The square ensures we minimize the magnitude of the residual ( y ( i ) − t ( i ) ) (y^{(i)} - t^{(i)}) ( y ( i ) − t ( i ) )
The factor of 1 2 \frac{1}{2} 2 1 makes derivative calculations convenient (cancels with the power of 2)
Cost Function
The cost function E ( w ) \mathcal{E}(\mathbf{w}) E ( w ) is the average loss across all training examples:
E ( w ) = 1 N ∑ i = 1 N L ( y ( i ) , t ( i ) ) \mathcal{E}(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} L(y^{(i)}, t^{(i)}) E ( w ) = N 1 i = 1 ∑ N L ( y ( i ) , t ( i ) )
Expanding with squared error:
E ( w ) = 1 2 N ∑ i = 1 N ( y ( i ) − t ( i ) ) 2 = 1 2 N ∑ i = 1 N ( w ⊤ x ( i ) − t ( i ) ) 2 \mathcal{E}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^{N} (y^{(i)} - t^{(i)})^2 = \frac{1}{2N} \sum_{i=1}^{N} (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)})^2 E ( w ) = 2 N 1 i = 1 ∑ N ( y ( i ) − t ( i ) ) 2 = 2 N 1 i = 1 ∑ N ( w ⊤ x ( i ) − t ( i ) ) 2
Note: In practice, “loss” and “cost” are often used interchangeably.
Vectorized Cost Function
Define the target vector: t = [ t ( 1 ) t ( 2 ) ⋮ t ( N ) ] \mathbf{t} = \begin{bmatrix} t^{(1)} \\ t^{(2)} \\ \vdots \\ t^{(N)} \end{bmatrix} t = t ( 1 ) t ( 2 ) ⋮ t ( N )
Then:
E ( w ) = 1 2 N ∑ i = 1 N ( y ( i ) − t ( i ) ) 2 = 1 2 N ( y − t ) ⊤ ( y − t ) \mathcal{E}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^{N} (y^{(i)} - t^{(i)})^2 = \frac{1}{2N} (\mathbf{y} - \mathbf{t})^\top (\mathbf{y} - \mathbf{t}) E ( w ) = 2 N 1 i = 1 ∑ N ( y ( i ) − t ( i ) ) 2 = 2 N 1 ( y − t ) ⊤ ( y − t )
= 1 2 N ( X w − t ) ⊤ ( X w − t ) = 1 2 N ∥ X w − t ∥ 2 2 = \frac{1}{2N} (\mathbf{X}\mathbf{w} - \mathbf{t})^\top (\mathbf{X}\mathbf{w} - \mathbf{t}) = \frac{1}{2N} \|\mathbf{X}\mathbf{w} - \mathbf{t}\|_2^2 = 2 N 1 ( Xw − t ) ⊤ ( Xw − t ) = 2 N 1 ∥ Xw − t ∥ 2 2
This is called the Mean Squared Error (MSE) cost function.
Why Vectorize?
Vectorized computations offer several advantages:
Speed: Operations run in parallel on CPU/GPU → reduced computation time
Cleaner code: No explicit loops → easier to read and maintain
Memory efficiency: Better handling of large datasets
Optimized support: Libraries (NumPy, PyTorch, etc.) are built for vectorized operations
Computing the Gradient
To find optimal weights, we minimize the cost function:
min w E ( w ) \min_{\mathbf{w}} \mathcal{E}(\mathbf{w}) w min E ( w )
Strategy:
Take derivatives of E ( w ) \mathcal{E}(\mathbf{w}) E ( w ) with respect to w \mathbf{w} w
Set derivatives to zero
Solve for w \mathbf{w} w
Gradient of the Loss Function
Since the cost is the average of losses, we first compute ∇ w L ( w ) \nabla_{\mathbf{w}} L(\mathbf{w}) ∇ w L ( w ) .
For a single example:
L ( y ( i ) , t ( i ) ) = 1 2 ( y ( i ) − t ( i ) ) 2 = 1 2 ( w ⊤ x ( i ) − t ( i ) ) 2 L(y^{(i)}, t^{(i)}) = \frac{1}{2}(y^{(i)} - t^{(i)})^2 = \frac{1}{2}(\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)})^2 L ( y ( i ) , t ( i ) ) = 2 1 ( y ( i ) − t ( i ) ) 2 = 2 1 ( w ⊤ x ( i ) − t ( i ) ) 2
Using the chain rule:
∂ L ∂ w j = ∂ L ∂ y ( i ) ⋅ ∂ y ( i ) ∂ w j \frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial y^{(i)}} \cdot \frac{\partial y^{(i)}}{\partial w_j} ∂ w j ∂ L = ∂ y ( i ) ∂ L ⋅ ∂ w j ∂ y ( i )
where:
∂ L ∂ y ( i ) = y ( i ) − t ( i ) \frac{\partial L}{\partial y^{(i)}} = y^{(i)} - t^{(i)} ∂ y ( i ) ∂ L = y ( i ) − t ( i ) (derivative of squared error)
∂ y ( i ) ∂ w j = x j ( i ) \frac{\partial y^{(i)}}{\partial w_j} = x_j^{(i)} ∂ w j ∂ y ( i ) = x j ( i ) (derivative of linear model)
Therefore:
∂ L ∂ w j = ( y ( i ) − t ( i ) ) x j ( i ) = ( w ⊤ x ( i ) − t ( i ) ) x j ( i ) \frac{\partial L}{\partial w_j} = (y^{(i)} - t^{(i)}) x_j^{(i)} = (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) x_j^{(i)} ∂ w j ∂ L = ( y ( i ) − t ( i ) ) x j ( i ) = ( w ⊤ x ( i ) − t ( i ) ) x j ( i )
The full gradient vector:
∇ w L ( w ) = [ ∂ L ∂ w 0 ∂ L ∂ w 1 ⋮ ∂ L ∂ w D ] = ( w ⊤ x ( i ) − t ( i ) ) x ( i ) \nabla_{\mathbf{w}} L(\mathbf{w}) = \begin{bmatrix} \frac{\partial L}{\partial w_0} \\ \frac{\partial L}{\partial w_1} \\ \vdots \\ \frac{\partial L}{\partial w_D} \end{bmatrix} = (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) \mathbf{x}^{(i)} ∇ w L ( w ) = ∂ w 0 ∂ L ∂ w 1 ∂ L ⋮ ∂ w D ∂ L = ( w ⊤ x ( i ) − t ( i ) ) x ( i )
Gradient of the Cost Function
The cost is the average of losses:
∇ w E ( w ) = 1 N ∑ i = 1 N ∇ w L ( w ) = 1 N ∑ i = 1 N ( w ⊤ x ( i ) − t ( i ) ) x ( i ) \nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\mathbf{w}} L(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) \mathbf{x}^{(i)} ∇ w E ( w ) = N 1 i = 1 ∑ N ∇ w L ( w ) = N 1 i = 1 ∑ N ( w ⊤ x ( i ) − t ( i ) ) x ( i )
Vectorized Gradient
To vectorize ∑ i = 1 N ( w ⊤ x ( i ) − t ( i ) ) x ( i ) \sum_{i=1}^{N} (\mathbf{w}^\top \mathbf{x}^{(i)} - t^{(i)}) \mathbf{x}^{(i)} ∑ i = 1 N ( w ⊤ x ( i ) − t ( i ) ) x ( i ) :
Define residuals: r = X w − t \mathbf{r} = \mathbf{X}\mathbf{w} - \mathbf{t} r = Xw − t (an N × 1 N \times 1 N × 1 vector)
Then:
∑ i = 1 N r ( i ) x ( i ) = X ⊤ r = X ⊤ ( X w − t ) \sum_{i=1}^{N} r^{(i)} \mathbf{x}^{(i)} = \mathbf{X}^\top \mathbf{r} = \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t}) i = 1 ∑ N r ( i ) x ( i ) = X ⊤ r = X ⊤ ( Xw − t )
Therefore, the vectorized gradient :
∇ w E ( w ) = 1 N X ⊤ ( X w − t ) \boxed{\nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t})} ∇ w E ( w ) = N 1 X ⊤ ( Xw − t )
Direct Solution (Normal Equations)
Set the gradient to zero and solve:
∇ w E ( w ) = 1 N X ⊤ ( X w − t ) = 0 \nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t}) = 0 ∇ w E ( w ) = N 1 X ⊤ ( Xw − t ) = 0
⇒ X ⊤ X w = X ⊤ t \Rightarrow \mathbf{X}^\top \mathbf{X} \mathbf{w} = \mathbf{X}^\top \mathbf{t} ⇒ X ⊤ Xw = X ⊤ t
⇒ w = ( X ⊤ X ) − 1 X ⊤ t \Rightarrow \mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t} ⇒ w = ( X ⊤ X ) − 1 X ⊤ t
This is called the normal equations or closed-form solution .
Advantages and Disadvantages
Advantages:
No iteration required, computes optimal weights in one step
Guaranteed to find the global minimum (for convex problems)
Disadvantages:
Computing ( X ⊤ X ) − 1 (\mathbf{X}^\top \mathbf{X})^{-1} ( X ⊤ X ) − 1 is expensive: O ( D 3 ) O(D^3) O ( D 3 ) complexity
Requires X ⊤ X \mathbf{X}^\top \mathbf{X} X ⊤ X to be invertible
Does not generalize to other models or loss functions
Impractical when D D D (number of features) is very large
Linear Regression Properties
Advantages
Interpretable: Weights directly show feature importance
Efficient: Fast to train on moderate-sized datasets
Good baseline: Establishes performance floor for more complex models
Limitations
Assumes linearity: Cannot capture nonlinear relationships without feature engineering
Sensitive to outliers: Squared error heavily penalizes large residuals
Continuous features: Requires numerical inputs (categorical features need encoding)
Multicollinearity: Performance degrades when features are highly correlated
High bias: May underfit complex data (simple model)
Summary
Component Formula Linear Model y ( i ) = w ⊤ x ( i ) y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} y ( i ) = w ⊤ x ( i ) y = X w \mathbf{y} = \mathbf{X}\mathbf{w} y = Xw Loss Function L ( y ( i ) , t ( i ) ) = 1 2 ( y ( i ) − t ( i ) ) 2 L(y^{(i)}, t^{(i)}) = \frac{1}{2}(y^{(i)} - t^{(i)})^2 L ( y ( i ) , t ( i ) ) = 2 1 ( y ( i ) − t ( i ) ) 2 Cost Function (MSE) E ( w ) = 1 2 N ∥ X w − t ∥ 2 2 \mathcal{E}(\mathbf{w}) = \frac{1}{2N} \|\mathbf{X}\mathbf{w} - \mathbf{t}\|_2^2 E ( w ) = 2 N 1 ∥ Xw − t ∥ 2 2 Gradient ∇ w E ( w ) = 1 N X ⊤ ( X w − t ) \nabla_{\mathbf{w}} \mathcal{E}(\mathbf{w}) = \frac{1}{N} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{t}) ∇ w E ( w ) = N 1 X ⊤ ( Xw − t ) Direct Solution w = ( X ⊤ X ) − 1 X ⊤ t \mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t} w = ( X ⊤ X ) − 1 X ⊤ t
Direct Solution:
No iteration required
Computationally expensive (O ( D 3 ) O(D^3) O ( D 3 ) )
Does not generalize to other models