9: The Linear Model
Linear regression is the most important model in statistics. It is the foundation on which regularized regression, generalized linear models, and much of feature engineering are built. This article develops the linear model from a statistical perspective: the OLS estimator as the MLE under Gaussian errors, the geometry of the hat matrix, goodness-of-fit measures, and the Gauss-Markov theorem that establishes OLS as the best linear unbiased estimator.
The Model
The linear model assumes:
where:
- is the response vector
- is the design matrix (assumed fixed and of full column rank )
- is the coefficient vector
- is the error vector with i.i.d. components
The assumptions encoded in this model:
- Linearity:
- Independence: errors are independent across observations
- Homoscedasticity: for all
- Normality: errors are Gaussian (needed for exact finite-sample inference)
The first three are the Gauss-Markov conditions. Normality is an additional assumption that enables -tests and -tests with exact distributions, not just asymptotic approximations.
Ordinary Least Squares
The OLS estimator minimizes the sum of squared residuals:
Taking the gradient and setting it to zero:
This gives the normal equations:
When has full column rank, is invertible, and:
OLS as Maximum Likelihood
Under the Gaussian error assumption, . The log-likelihood is:
Maximizing over for fixed is equivalent to minimizing , which gives the OLS estimator. The MLE of is:
This is biased. The unbiased estimator is , which accounts for the degrees of freedom used in estimating .
Connection to ML. Minimizing squared error loss under a linear model is exactly maximum likelihood under Gaussian noise. This is why MSE is the default loss for regression: it has a clear probabilistic justification. When the noise is not Gaussian, other loss functions (Huber, quantile) may be more appropriate, corresponding to MLE under different error distributions.
The Hat Matrix
The fitted values are:
where is the hat matrix (it “puts the hat on ”). The hat matrix is the orthogonal projection onto the column space of .
Properties of :
- Symmetric:
- Idempotent:
- Eigenvalues are 0 or 1
Leverage. The diagonal entries are the leverage values:
The leverage measures how far is from the center of the predictor space. High-leverage points have disproportionate influence on the fitted regression. A common rule of thumb flags observations with .
Residuals
The residuals are:
Key properties:
- (residuals are not independent, even though errors are)
- (high-leverage points have smaller residual variance)
- (residuals are orthogonal to every predictor column)
Standardized residuals adjust for unequal variance:
where . Under the model, is approximately -distributed.
Goodness of Fit
(coefficient of determination):
where is the total sum of squares. is the fraction of variance in explained by the model. It satisfies (when an intercept is included).
The problem with . Adding any predictor can only increase (or leave it unchanged), regardless of whether the predictor is meaningful. A model with achieves by perfectly interpolating the data.
Adjusted penalizes model complexity:
Adjusted can decrease when an uninformative predictor is added, providing a rough guard against overfitting. It is not, however, a formal model selection criterion — AIC and BIC (covered in article 10) are preferable.
The F-Test for Overall Significance
The global F-test tests whether any predictor has a nonzero coefficient:
The test statistic is:
Under , . Reject if is large. This test compares the full model to the intercept-only model using the ratio of explained variance per degree of freedom to unexplained variance per degree of freedom.
More generally, to compare a reduced model (with parameters) to a full model (with parameters, ):
The Gauss-Markov Theorem
Theorem (Gauss-Markov). Under the Gauss-Markov conditions (linearity, independence, homoscedasticity — normality is not required), the OLS estimator is the Best Linear Unbiased Estimator (BLUE): among all estimators that are linear functions of and unbiased for , OLS has the smallest variance.
Formally, if is any other linear unbiased estimator of , then:
What Gauss-Markov does not say:
- It does not claim OLS is the best among all estimators, only among linear unbiased ones
- Biased estimators (ridge, lasso) can have lower MSE by trading bias for variance
- If errors are non-Gaussian, nonlinear estimators may dominate OLS
Connection to ML. The Gauss-Markov theorem justifies OLS as the starting point, but ML practice regularly moves beyond it. Ridge regression () and lasso ( penalty) are biased but can achieve lower prediction error when predictors are correlated or numerous. The theorem tells us the price of unbiasedness: OLS is optimal under that constraint, but relaxing it opens the door to regularization, which is nearly always beneficial in high-dimensional settings.
Geometric Interpretation
OLS has a clean geometric meaning. The column space of , denoted , is a -dimensional subspace of . The fitted values are the orthogonal projection of onto , and the residual vector is orthogonal to .
This gives the Pythagorean decomposition:
between and its projection. The geometry makes it clear why can only increase with more predictors: projecting onto a larger subspace can only reduce the residual.
Summary
| Concept | Key Result | ML Connection |
|---|---|---|
| OLS estimator | Equivalent to MSE minimization | |
| OLS = MLE | Under | MSE loss = Gaussian log-likelihood |
| Hat matrix | , projection onto | Leverage identifies influential points |
| Fraction of variance explained | Always increases with more features | |
| Adjusted | Penalizes for number of predictors | Rough complexity control |
| F-test | Compares nested models | Significance of feature groups |
| Gauss-Markov | OLS is BLUE | Regularization beats OLS by accepting bias |