9: The Linear Model

Linear regression is the most important model in statistics. It is the foundation on which regularized regression, generalized linear models, and much of feature engineering are built. This article develops the linear model from a statistical perspective: the OLS estimator as the MLE under Gaussian errors, the geometry of the hat matrix, goodness-of-fit measures, and the Gauss-Markov theorem that establishes OLS as the best linear unbiased estimator.

The Model

The linear model assumes:

Y = X\beta + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_n)

where:

$Y \in \mathbb{R}^n$ is the response vector
$X \in \mathbb{R}^{n \times p}$ is the design matrix (assumed fixed and of full column rank $p < n$ )
$\beta \in \mathbb{R}^p$ is the coefficient vector
$\varepsilon \in \mathbb{R}^n$ is the error vector with i.i.d. $\mathcal{N}(0, \sigma^2)$ components

The assumptions encoded in this model:

Linearity: $E[Y \mid X] = X\beta$
Independence: errors are independent across observations
Homoscedasticity: $\text{Var}(\varepsilon_i) = \sigma^2$ for all $i$
Normality: errors are Gaussian (needed for exact finite-sample inference)

The first three are the Gauss-Markov conditions. Normality is an additional assumption that enables $t$ -tests and $F$ -tests with exact distributions, not just asymptotic approximations.

Ordinary Least Squares

The OLS estimator minimizes the sum of squared residuals:

\hat{\beta} = \arg\min_\beta \|Y - X\beta\|^2 = \arg\min_\beta \sum_{i=1}^n (Y_i - X_i^T\beta)^2

Taking the gradient and setting it to zero:

\nabla_\beta \|Y - X\beta\|^2 = -2X^T(Y - X\beta) = 0

This gives the normal equations:

X^TX\hat{\beta} = X^TY

When $X$ has full column rank, $X^TX$ is invertible, and:

\hat{\beta} = (X^TX)^{-1}X^TY

OLS as Maximum Likelihood

Under the Gaussian error assumption, $Y \sim \mathcal{N}(X\beta, \sigma^2 I)$ . The log-likelihood is:

\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|Y - X\beta\|^2

Maximizing over $\beta$ for fixed $\sigma^2$ is equivalent to minimizing $\|Y - X\beta\|^2$ , which gives the OLS estimator. The MLE of $\sigma^2$ is:

\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\|Y - X\hat{\beta}\|^2 = \frac{\text{RSS}}{n}

This is biased. The unbiased estimator is $s^2 = \text{RSS}/(n - p)$ , which accounts for the $p$ degrees of freedom used in estimating $\beta$ .

Connection to ML. Minimizing squared error loss under a linear model is exactly maximum likelihood under Gaussian noise. This is why MSE is the default loss for regression: it has a clear probabilistic justification. When the noise is not Gaussian, other loss functions (Huber, quantile) may be more appropriate, corresponding to MLE under different error distributions.

The Hat Matrix

The fitted values are:

\hat{Y} = X\hat{\beta} = X(X^TX)^{-1}X^TY = HY

where $H = X(X^TX)^{-1}X^T$ is the hat matrix (it “puts the hat on $Y$ ”). The hat matrix is the orthogonal projection onto the column space of $X$ .

Properties of $H$ :

Symmetric: $H^T = H$
Idempotent: $H^2 = H$
$\text{rank}(H) = \text{tr}(H) = p$
Eigenvalues are 0 or 1

Leverage. The diagonal entries $h_{ii}$ are the leverage values:

h_{ii} = X_i^T(X^TX)^{-1}X_i, \quad 0 \leq h_{ii} \leq 1, \quad \sum_{i=1}^n h_{ii} = p

The leverage $h_{ii}$ measures how far $X_i$ is from the center of the predictor space. High-leverage points have disproportionate influence on the fitted regression. A common rule of thumb flags observations with $h_{ii} > 2p/n$ .

Residuals

The residuals are:

e = Y - \hat{Y} = (I - H)Y

Key properties:

$E[e] = 0$
$\text{Cov}(e) = \sigma^2(I - H)$ (residuals are not independent, even though errors are)
$\text{Var}(e_i) = \sigma^2(1 - h_{ii})$ (high-leverage points have smaller residual variance)
$X^Te = 0$ (residuals are orthogonal to every predictor column)

Standardized residuals adjust for unequal variance:

r_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}

where $s = \sqrt{\text{RSS}/(n-p)}$ . Under the model, $r_i$ is approximately $t_{n-p}$ -distributed.

Goodness of Fit

$R^2$ (coefficient of determination):

R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}

where $\text{TSS} = \sum(Y_i - \bar{Y})^2$ is the total sum of squares. $R^2$ is the fraction of variance in $Y$ explained by the model. It satisfies $0 \leq R^2 \leq 1$ (when an intercept is included).

The problem with $R^2$ . Adding any predictor can only increase $R^2$ (or leave it unchanged), regardless of whether the predictor is meaningful. A model with $p = n$ achieves $R^2 = 1$ by perfectly interpolating the data.

Adjusted $R^2$ penalizes model complexity:

R^2_{\text{adj}} = 1 - \frac{\text{RSS}/(n-p)}{\text{TSS}/(n-1)} = 1 - \frac{n-1}{n-p}(1 - R^2)

Adjusted $R^2$ can decrease when an uninformative predictor is added, providing a rough guard against overfitting. It is not, however, a formal model selection criterion — AIC and BIC (covered in article 10) are preferable.

The F-Test for Overall Significance

The global F-test tests whether any predictor has a nonzero coefficient:

H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0 \quad \text{(intercept-only model)}

The test statistic is:

F = \frac{(\text{TSS} - \text{RSS})/(p - 1)}{\text{RSS}/(n - p)} = \frac{R^2/(p-1)}{(1-R^2)/(n-p)}

Under $H_0$ , $F \sim F_{p-1,\, n-p}$ . Reject $H_0$ if $F$ is large. This test compares the full model to the intercept-only model using the ratio of explained variance per degree of freedom to unexplained variance per degree of freedom.

More generally, to compare a reduced model (with $q$ parameters) to a full model (with $p$ parameters, $q < p$ ):

F = \frac{(\text{RSS}_{\text{reduced}} - \text{RSS}_{\text{full}})/(p - q)}{\text{RSS}_{\text{full}}/(n - p)} \sim F_{p-q,\, n-p} \quad \text{under } H_0

The Gauss-Markov Theorem

Theorem (Gauss-Markov). Under the Gauss-Markov conditions (linearity, independence, homoscedasticity — normality is not required), the OLS estimator $\hat{\beta}$ is the Best Linear Unbiased Estimator (BLUE): among all estimators that are linear functions of $Y$ and unbiased for $\beta$ , OLS has the smallest variance.

Formally, if $\tilde{\beta} = CY$ is any other linear unbiased estimator of $\beta$ , then:

\text{Var}(a^T\tilde{\beta}) \geq \text{Var}(a^T\hat{\beta}) \quad \text{for all } a \in \mathbb{R}^p

What Gauss-Markov does not say:

It does not claim OLS is the best among all estimators, only among linear unbiased ones
Biased estimators (ridge, lasso) can have lower MSE by trading bias for variance
If errors are non-Gaussian, nonlinear estimators may dominate OLS

Connection to ML. The Gauss-Markov theorem justifies OLS as the starting point, but ML practice regularly moves beyond it. Ridge regression ( $\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^TY$ ) and lasso ( $L_1$ penalty) are biased but can achieve lower prediction error when predictors are correlated or numerous. The theorem tells us the price of unbiasedness: OLS is optimal under that constraint, but relaxing it opens the door to regularization, which is nearly always beneficial in high-dimensional settings.

Geometric Interpretation

OLS has a clean geometric meaning. The column space of $X$ , denoted $\mathcal{C}(X)$ , is a $p$ -dimensional subspace of $\mathbb{R}^n$ . The fitted values $\hat{Y} = HY$ are the orthogonal projection of $Y$ onto $\mathcal{C}(X)$ , and the residual vector $e = Y - \hat{Y}$ is orthogonal to $\mathcal{C}(X)$ .

This gives the Pythagorean decomposition:

\|Y - \bar{Y}\mathbf{1}\|^2 = \|\hat{Y} - \bar{Y}\mathbf{1}\|^2 + \|e\|^2

\text{TSS} = \text{RegSS} + \text{RSS}

$R^2 = \text{RegSS}/\text{TSS} = \cos^2(\angle)$ between $Y - \bar{Y}\mathbf{1}$ and its projection. The geometry makes it clear why $R^2$ can only increase with more predictors: projecting onto a larger subspace can only reduce the residual.

Summary

Concept	Key Result	ML Connection
OLS estimator	$\hat{\beta} = (X^TX)^{-1}X^TY$	Equivalent to MSE minimization
OLS = MLE	Under $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$	MSE loss = Gaussian log-likelihood
Hat matrix	$H = X(X^TX)^{-1}X^T$ , projection onto $\mathcal{C}(X)$	Leverage identifies influential points
$R^2$	Fraction of variance explained	Always increases with more features
Adjusted $R^2$	Penalizes for number of predictors	Rough complexity control
F-test	Compares nested models	Significance of feature groups
Gauss-Markov	OLS is BLUE	Regularization beats OLS by accepting bias