9: The Linear Model

Linear regression is the most important model in statistics. It is the foundation on which regularized regression, generalized linear models, and much of feature engineering are built. This article develops the linear model from a statistical perspective: the OLS estimator as the MLE under Gaussian errors, the geometry of the hat matrix, goodness-of-fit measures, and the Gauss-Markov theorem that establishes OLS as the best linear unbiased estimator.


The Model

The linear model assumes:

Y=Xβ+ε,εN(0,σ2In)Y = X\beta + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_n)

where:

  • YRnY \in \mathbb{R}^n is the response vector
  • XRn×pX \in \mathbb{R}^{n \times p} is the design matrix (assumed fixed and of full column rank p<np < n)
  • βRp\beta \in \mathbb{R}^p is the coefficient vector
  • εRn\varepsilon \in \mathbb{R}^n is the error vector with i.i.d. N(0,σ2)\mathcal{N}(0, \sigma^2) components

The assumptions encoded in this model:

  1. Linearity: E[YX]=XβE[Y \mid X] = X\beta
  2. Independence: errors are independent across observations
  3. Homoscedasticity: Var(εi)=σ2\text{Var}(\varepsilon_i) = \sigma^2 for all ii
  4. Normality: errors are Gaussian (needed for exact finite-sample inference)

The first three are the Gauss-Markov conditions. Normality is an additional assumption that enables tt-tests and FF-tests with exact distributions, not just asymptotic approximations.


Ordinary Least Squares

The OLS estimator minimizes the sum of squared residuals:

β^=argminβYXβ2=argminβi=1n(YiXiTβ)2\hat{\beta} = \arg\min_\beta \|Y - X\beta\|^2 = \arg\min_\beta \sum_{i=1}^n (Y_i - X_i^T\beta)^2

Taking the gradient and setting it to zero:

βYXβ2=2XT(YXβ)=0\nabla_\beta \|Y - X\beta\|^2 = -2X^T(Y - X\beta) = 0

This gives the normal equations:

XTXβ^=XTYX^TX\hat{\beta} = X^TY

When XX has full column rank, XTXX^TX is invertible, and:

β^=(XTX)1XTY\hat{\beta} = (X^TX)^{-1}X^TY

OLS as Maximum Likelihood

Under the Gaussian error assumption, YN(Xβ,σ2I)Y \sim \mathcal{N}(X\beta, \sigma^2 I). The log-likelihood is:

(β,σ2)=n2log(2πσ2)12σ2YXβ2\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|Y - X\beta\|^2

Maximizing over β\beta for fixed σ2\sigma^2 is equivalent to minimizing YXβ2\|Y - X\beta\|^2, which gives the OLS estimator. The MLE of σ2\sigma^2 is:

σ^MLE2=1nYXβ^2=RSSn\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\|Y - X\hat{\beta}\|^2 = \frac{\text{RSS}}{n}

This is biased. The unbiased estimator is s2=RSS/(np)s^2 = \text{RSS}/(n - p), which accounts for the pp degrees of freedom used in estimating β\beta.

Connection to ML. Minimizing squared error loss under a linear model is exactly maximum likelihood under Gaussian noise. This is why MSE is the default loss for regression: it has a clear probabilistic justification. When the noise is not Gaussian, other loss functions (Huber, quantile) may be more appropriate, corresponding to MLE under different error distributions.


The Hat Matrix

The fitted values are:

Y^=Xβ^=X(XTX)1XTY=HY\hat{Y} = X\hat{\beta} = X(X^TX)^{-1}X^TY = HY

where H=X(XTX)1XTH = X(X^TX)^{-1}X^T is the hat matrix (it “puts the hat on YY”). The hat matrix is the orthogonal projection onto the column space of XX.

Properties of HH:

  • Symmetric: HT=HH^T = H
  • Idempotent: H2=HH^2 = H
  • rank(H)=tr(H)=p\text{rank}(H) = \text{tr}(H) = p
  • Eigenvalues are 0 or 1

Leverage. The diagonal entries hiih_{ii} are the leverage values:

hii=XiT(XTX)1Xi,0hii1,i=1nhii=ph_{ii} = X_i^T(X^TX)^{-1}X_i, \quad 0 \leq h_{ii} \leq 1, \quad \sum_{i=1}^n h_{ii} = p

The leverage hiih_{ii} measures how far XiX_i is from the center of the predictor space. High-leverage points have disproportionate influence on the fitted regression. A common rule of thumb flags observations with hii>2p/nh_{ii} > 2p/n.


Residuals

The residuals are:

e=YY^=(IH)Ye = Y - \hat{Y} = (I - H)Y

Key properties:

  • E[e]=0E[e] = 0
  • Cov(e)=σ2(IH)\text{Cov}(e) = \sigma^2(I - H) (residuals are not independent, even though errors are)
  • Var(ei)=σ2(1hii)\text{Var}(e_i) = \sigma^2(1 - h_{ii}) (high-leverage points have smaller residual variance)
  • XTe=0X^Te = 0 (residuals are orthogonal to every predictor column)

Standardized residuals adjust for unequal variance:

ri=eis1hiir_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}

where s=RSS/(np)s = \sqrt{\text{RSS}/(n-p)}. Under the model, rir_i is approximately tnpt_{n-p}-distributed.


Goodness of Fit

R2R^2 (coefficient of determination):

R2=1RSSTSS=1(YiY^i)2(YiYˉ)2R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}

where TSS=(YiYˉ)2\text{TSS} = \sum(Y_i - \bar{Y})^2 is the total sum of squares. R2R^2 is the fraction of variance in YY explained by the model. It satisfies 0R210 \leq R^2 \leq 1 (when an intercept is included).

The problem with R2R^2. Adding any predictor can only increase R2R^2 (or leave it unchanged), regardless of whether the predictor is meaningful. A model with p=np = n achieves R2=1R^2 = 1 by perfectly interpolating the data.

Adjusted R2R^2 penalizes model complexity:

Radj2=1RSS/(np)TSS/(n1)=1n1np(1R2)R^2_{\text{adj}} = 1 - \frac{\text{RSS}/(n-p)}{\text{TSS}/(n-1)} = 1 - \frac{n-1}{n-p}(1 - R^2)

Adjusted R2R^2 can decrease when an uninformative predictor is added, providing a rough guard against overfitting. It is not, however, a formal model selection criterion — AIC and BIC (covered in article 10) are preferable.


The F-Test for Overall Significance

The global F-test tests whether any predictor has a nonzero coefficient:

H0:β1=β2==βp1=0(intercept-only model)H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0 \quad \text{(intercept-only model)}

The test statistic is:

F=(TSSRSS)/(p1)RSS/(np)=R2/(p1)(1R2)/(np)F = \frac{(\text{TSS} - \text{RSS})/(p - 1)}{\text{RSS}/(n - p)} = \frac{R^2/(p-1)}{(1-R^2)/(n-p)}

Under H0H_0, FFp1,npF \sim F_{p-1,\, n-p}. Reject H0H_0 if FF is large. This test compares the full model to the intercept-only model using the ratio of explained variance per degree of freedom to unexplained variance per degree of freedom.

More generally, to compare a reduced model (with qq parameters) to a full model (with pp parameters, q<pq < p):

F=(RSSreducedRSSfull)/(pq)RSSfull/(np)Fpq,npunder H0F = \frac{(\text{RSS}_{\text{reduced}} - \text{RSS}_{\text{full}})/(p - q)}{\text{RSS}_{\text{full}}/(n - p)} \sim F_{p-q,\, n-p} \quad \text{under } H_0

The Gauss-Markov Theorem

Theorem (Gauss-Markov). Under the Gauss-Markov conditions (linearity, independence, homoscedasticity — normality is not required), the OLS estimator β^\hat{\beta} is the Best Linear Unbiased Estimator (BLUE): among all estimators that are linear functions of YY and unbiased for β\beta, OLS has the smallest variance.

Formally, if β~=CY\tilde{\beta} = CY is any other linear unbiased estimator of β\beta, then:

Var(aTβ~)Var(aTβ^)for all aRp\text{Var}(a^T\tilde{\beta}) \geq \text{Var}(a^T\hat{\beta}) \quad \text{for all } a \in \mathbb{R}^p

What Gauss-Markov does not say:

  • It does not claim OLS is the best among all estimators, only among linear unbiased ones
  • Biased estimators (ridge, lasso) can have lower MSE by trading bias for variance
  • If errors are non-Gaussian, nonlinear estimators may dominate OLS

Connection to ML. The Gauss-Markov theorem justifies OLS as the starting point, but ML practice regularly moves beyond it. Ridge regression (β^ridge=(XTX+λI)1XTY\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^TY) and lasso (L1L_1 penalty) are biased but can achieve lower prediction error when predictors are correlated or numerous. The theorem tells us the price of unbiasedness: OLS is optimal under that constraint, but relaxing it opens the door to regularization, which is nearly always beneficial in high-dimensional settings.


Geometric Interpretation

OLS has a clean geometric meaning. The column space of XX, denoted C(X)\mathcal{C}(X), is a pp-dimensional subspace of Rn\mathbb{R}^n. The fitted values Y^=HY\hat{Y} = HY are the orthogonal projection of YY onto C(X)\mathcal{C}(X), and the residual vector e=YY^e = Y - \hat{Y} is orthogonal to C(X)\mathcal{C}(X).

This gives the Pythagorean decomposition:

YYˉ12=Y^Yˉ12+e2\|Y - \bar{Y}\mathbf{1}\|^2 = \|\hat{Y} - \bar{Y}\mathbf{1}\|^2 + \|e\|^2 TSS=RegSS+RSS\text{TSS} = \text{RegSS} + \text{RSS}

R2=RegSS/TSS=cos2()R^2 = \text{RegSS}/\text{TSS} = \cos^2(\angle) between YYˉ1Y - \bar{Y}\mathbf{1} and its projection. The geometry makes it clear why R2R^2 can only increase with more predictors: projecting onto a larger subspace can only reduce the residual.


Summary

ConceptKey ResultML Connection
OLS estimatorβ^=(XTX)1XTY\hat{\beta} = (X^TX)^{-1}X^TYEquivalent to MSE minimization
OLS = MLEUnder εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I)MSE loss = Gaussian log-likelihood
Hat matrixH=X(XTX)1XTH = X(X^TX)^{-1}X^T, projection onto C(X)\mathcal{C}(X)Leverage identifies influential points
R2R^2Fraction of variance explainedAlways increases with more features
Adjusted R2R^2Penalizes for number of predictorsRough complexity control
F-testCompares nested modelsSignificance of feature groups
Gauss-MarkovOLS is BLUERegularization beats OLS by accepting bias