Linear regression is one of the simplest and most fundamental machine learning algorithms. Despite its simplicity, it serves as an excellent baseline model and foundation for understanding more complex algorithms.
Regression Problem Setup
In regression, we aim to learn a function f:RD↦R that maps input features to continuous output values.
Dataset:N training examples {(x(1),t(1)),(x(2),t(2)),…,(x(N),t(N))}
For each example i:
x(i)∈RD is the input feature vector (D features)
t(i)∈R is the scalar target (ground truth)
Goal: Learn f such that t(i)≈y(i)=f(x(i)) for all training examples.
Examples of Regression Problems
Problem
Input x(i)
Target t(i)
Housing prices
Square footage, location, # bedrooms/bathrooms
House price
Weather prediction
Temperature, humidity, wind speed
Amount of rainfall
Revenue forecasting
Previous sales data
Company revenue
The Linear Model
The simplest assumption we can make is that the relationship between inputs and outputs is linear.
Model Definition
y(i)=w1x1(i)+w2x2(i)+⋯+wDxD(i)+b
Or in compact form:
y(i)=j=1∑Dwjxj(i)+b
Where:
wj are the weights (model parameters)
b is the bias term (intercept)
Why linear? Linear models are:
Simple to understand and interpret
Computationally efficient to train
Serve as good baseline models
Can be extended with feature engineering (polynomial features, etc.)
Why a bias term? The bias allows the model to fit data that doesn’t pass through the origin. Without it, we force f(0)=0.
Vectorized Form
We can express this more compactly using vectors:
y(i)=w⊤x(i)
where w=w1w2⋮wD and x(i)=x1(i)x2(i)⋮xD(i)
Absorbing the Bias
To simplify notation, we absorb the bias b into the weight vector by adding a dummy featurex0(i)=1:
w=bw1⋮wD,x(i)=1x1(i)⋮xD(i)
Now our model becomes simply:
y(i)=w⊤x(i)
Predictions for All Training Examples
Define the design matrixX (rows are transposed feature vectors):
This is called the normal equations or closed-form solution.
Advantages and Disadvantages
Advantages:
No iteration required, computes optimal weights in one step
Guaranteed to find the global minimum (for convex problems)
Disadvantages:
Computing (X⊤X)−1 is expensive: O(D3) complexity
Requires X⊤X to be invertible
Does not generalize to other models or loss functions
Impractical when D (number of features) is very large
Linear Regression Properties
Advantages
Interpretable: Weights directly show feature importance
Efficient: Fast to train on moderate-sized datasets
Good baseline: Establishes performance floor for more complex models
Limitations
Assumes linearity: Cannot capture nonlinear relationships without feature engineering
Sensitive to outliers: Squared error heavily penalizes large residuals
Continuous features: Requires numerical inputs (categorical features need encoding)
Multicollinearity: Performance degrades when features are highly correlated
High bias: May underfit complex data (simple model)
Summary
Component
Formula
Linear Model
y(i)=w⊤x(i) y=Xw
Loss Function
L(y(i),t(i))=21(y(i)−t(i))2
Cost Function (MSE)
E(w)=2N1∥Xw−t∥22
Gradient
∇wE(w)=N1X⊤(Xw−t)
Direct Solution
w=(X⊤X)−1X⊤t
Direct Solution:
No iteration required
Computationally expensive (O(D3))
Does not generalize to other models
Connection to Modern Practice
MSE as Gaussian MLE
Minimizing the MSE cost function is equivalent to maximum likelihood estimation under the assumption Y∣x∼N(w⊤x,σ2). The negative log-likelihood:
−ℓ(w)=2nlog(2πσ2)+2σ21i=1∑n(y(i)−w⊤x(i))2
Minimizing over w reduces to minimizing the sum of squared residuals. This statistical perspective explains why MSE is the “default” regression loss: it is optimal when errors are Gaussian. For non-Gaussian errors (zero-inflated, heavy-tailed), alternative losses such as Tweedie, Huber, or quantile loss are more appropriate. See the Loss Functions article.
Regularized Variants
Method
Penalty
Effect
Ridge (L2)
λ∥w∥22
Shrinks all weights toward zero, handles multicollinearity
Lasso (L1)
λ∥w∥1
Induces sparsity (feature selection)
Elastic Net
α∥w∥1+(1−α)21∥w∥22
Combines L1 sparsity with L2 stability
Ridge regression has a closed-form solution: w=(X⊤X+λI)−1X⊤t. The λI term ensures invertibility even when X⊤X is singular or ill-conditioned (more features than observations).
Beyond Linear Features
Linear regression on nonlinear features is the basis for many modern methods:
Polynomial regression: features are [1,x,x2,…,xd]
Kernel regression / Gaussian processes: features are kernel evaluations [K(x,x1),…,K(x,xn)]
Neural network linear heads: the final layer of a neural network is linear regression on the learned hidden representation
The power of linear regression lies not in the linearity of the input-output relationship but in the linearity of the parameters, which guarantees convex optimization and closed-form solutions.