10: Inference for Linear Models

With the linear model $Y = X\beta + \varepsilon$ established, we turn to inference: what can we say about individual coefficients, groups of coefficients, and future observations? This article covers the sampling distribution of $\hat{\beta}$ , hypothesis tests, confidence and prediction intervals, model diagnostics, and variable selection. These tools determine whether the patterns found by a regression are statistically meaningful and whether the model’s assumptions are reasonable.

Sampling Distribution of $\hat{\beta}$

Under the linear model with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ :

\hat{\beta} = (X^TX)^{-1}X^TY

Since $\hat{\beta}$ is a linear function of $Y$ and $Y$ is Gaussian:

\hat{\beta} \sim \mathcal{N}\left(\beta,\; \sigma^2 (X^TX)^{-1}\right)

The covariance matrix $\sigma^2(X^TX)^{-1}$ governs the precision of each coefficient estimate. Its diagonal entries give the variances:

\text{Var}(\hat{\beta}_j) = \sigma^2 \left[(X^TX)^{-1}\right]_{jj}

When $\sigma^2$ is unknown (the usual case), we replace it with $s^2 = \text{RSS}/(n-p)$ . The estimated covariance matrix is $\widehat{\text{Cov}}(\hat{\beta}) = s^2(X^TX)^{-1}$ , and the standard error of the $j$ -th coefficient is:

\text{SE}(\hat{\beta}_j) = s\sqrt{[(X^TX)^{-1}]_{jj}}

An important distributional result: the RSS and $\hat{\beta}$ are independent, and

\frac{(n-p)s^2}{\sigma^2} = \frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}

$t$ -Tests for Individual Coefficients

To test whether a single predictor contributes to the model:

H_0: \beta_j = 0 \quad \text{vs.} \quad H_1: \beta_j \neq 0

The test statistic is:

t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}_j}{s\sqrt{[(X^TX)^{-1}]_{jj}}}

Under $H_0$ , $t_j \sim t_{n-p}$ . Reject $H_0$ at level $\alpha$ if $|t_j| > t_{n-p, \alpha/2}$ .

Interpretation. A small $p$ -value for $\beta_j$ indicates that, conditional on all other predictors being in the model, $X_j$ provides statistically significant additional explanatory power. This is a partial test: the same variable can be significant in one model and insignificant in another, depending on which other predictors are included.

Multiple testing caveat. With $p - 1$ predictors, running $p - 1$ individual $t$ -tests inflates the family-wise error rate. If the predictors are independent and all null, the probability of at least one false positive is $1 - (1 - \alpha)^{p-1}$ . Bonferroni or Benjamini-Hochberg corrections can be applied, but the $F$ -test for groups of coefficients (below) is often more appropriate.

Confidence Intervals for Coefficients

A $100(1-\alpha)\%$ confidence interval for $\beta_j$ :

\hat{\beta}_j \pm t_{n-p,\, \alpha/2} \cdot \text{SE}(\hat{\beta}_j)

This interval has the interpretation: if we repeated the experiment many times and constructed an interval each time, $100(1-\alpha)\%$ of those intervals would contain the true $\beta_j$ .

For a confidence region for the entire coefficient vector $\beta$ :

(\hat{\beta} - \beta)^T X^TX (\hat{\beta} - \beta) \leq p \cdot s^2 \cdot F_{p,\, n-p,\, \alpha}

This defines an ellipsoid in $\mathbb{R}^p$ . The shape of the ellipsoid reflects the correlation structure among the predictors.

$F$ -Tests for Groups of Coefficients

To test whether a group of predictors contributes to the model, partition $\beta = (\beta_1^T, \beta_2^T)^T$ where $\beta_2 \in \mathbb{R}^q$ are the coefficients being tested:

H_0: \beta_2 = 0 \quad \text{vs.} \quad H_1: \beta_2 \neq 0

Fit the full model (all predictors, RSS $_F$ ) and the reduced model (without the $q$ predictors, RSS $_R$ ):

F = \frac{(\text{RSS}_R - \text{RSS}_F)/q}{\text{RSS}_F/(n-p)} \sim F_{q,\, n-p} \quad \text{under } H_0

The numerator measures the additional variance explained by the $q$ predictors, normalized by degrees of freedom. The denominator is the usual variance estimate from the full model.

Special cases:

$q = 1$ : reduces to the $t$ -test (in fact, $F = t^2$ and $F_{1, n-p} \equiv t_{n-p}^2$ )
$q = p - 1$ : the global $F$ -test from article 9

ANOVA as an $F$ -test. One-way ANOVA tests whether group means are equal: $H_0: \mu_1 = \cdots = \mu_K$ . This is exactly an $F$ -test comparing the intercept-only model to a model with group indicators.

Prediction Intervals

For a new observation $X_{\text{new}}$ with true response $Y_{\text{new}} = X_{\text{new}}^T\beta + \varepsilon_{\text{new}}$ :

Point prediction: $\hat{Y}_{\text{new}} = X_{\text{new}}^T\hat{\beta}$

Confidence interval for the mean response $E[Y_{\text{new}}] = X_{\text{new}}^T\beta$ :

X_{\text{new}}^T\hat{\beta} \pm t_{n-p,\, \alpha/2} \cdot s\sqrt{X_{\text{new}}^T(X^TX)^{-1}X_{\text{new}}}

Prediction interval for a new observation $Y_{\text{new}}$ :

X_{\text{new}}^T\hat{\beta} \pm t_{n-p,\, \alpha/2} \cdot s\sqrt{1 + X_{\text{new}}^T(X^TX)^{-1}X_{\text{new}}}

The prediction interval is always wider than the confidence interval because it accounts for two sources of uncertainty: estimation error in $\hat{\beta}$ and the irreducible noise $\varepsilon_{\text{new}}$ . The extra " $1 +$ " under the square root comes from $\text{Var}(\varepsilon_{\text{new}}) = \sigma^2$ .

Connection to ML. Prediction intervals quantify uncertainty in individual predictions, which is critical for decision-making. Most ML models produce point predictions without uncertainty estimates. Bayesian regression and conformal prediction are two modern approaches to generating prediction intervals for more complex models.

Model Diagnostics

The validity of inference depends on the model assumptions being approximately correct. Diagnostics help identify violations.

Residual plots. Plot residuals $e_i$ against fitted values $\hat{Y}_i$ . Under the model, this plot should show no pattern:

Funnel shape (residuals spread increasing with $\hat{Y}$ ): heteroscedasticity
Curvature: nonlinearity (model misspecification)
Clusters or outliers: subpopulations or influential points

QQ plot. Plot the ordered standardized residuals against theoretical quantiles of $\mathcal{N}(0,1)$ . Departures from the diagonal indicate non-normality:

Heavy tails (S-shape): $t$ -distribution-like errors, outliers
Light tails (inverted S): bounded error distribution
Skewness (curvature): asymmetric error distribution

Heteroscedasticity. When $\text{Var}(\varepsilon_i)$ depends on $i$ (or on $X_i$ ), OLS remains unbiased but is no longer BLUE, and the standard errors are wrong. Formal tests include the Breusch-Pagan test (regress squared residuals on $X$ ) and White’s test. Remedies: weighted least squares (WLS), heteroscedasticity-consistent (HC) standard errors (also called sandwich or robust standard errors).

Influential observations. Beyond leverage ( $h_{ii}$ , discussed in article 9), Cook’s distance measures the influence of observation $i$ on all fitted values simultaneously:

D_i = \frac{(\hat{Y} - \hat{Y}_{(i)})^T(\hat{Y} - \hat{Y}_{(i)})}{p \cdot s^2} = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}

where $\hat{Y}_{(i)}$ denotes fitted values with observation $i$ deleted. Cook’s distance combines leverage (how unusual $X_i$ is) with the residual (how unusual $Y_i$ is given $X_i$ ). Values $D_i > 1$ or $D_i > 4/n$ are commonly flagged.

Multicollinearity

When predictors are highly correlated, $(X^TX)^{-1}$ has large entries, inflating the variance of $\hat{\beta}$ .

The variance inflation factor (VIF) for predictor $j$ is:

\text{VIF}_j = \frac{1}{1 - R_j^2}

where $R_j^2$ is the $R^2$ from regressing $X_j$ on all other predictors. A VIF of 10 means the variance of $\hat{\beta}_j$ is 10 times what it would be if $X_j$ were uncorrelated with the other predictors.

Common thresholds: VIF $> 5$ warrants attention, VIF $> 10$ indicates serious collinearity. Remedies include removing redundant predictors, combining correlated predictors (e.g., PCA), or using regularization (ridge regression directly addresses collinearity by shrinking $(X^TX + \lambda I)^{-1}$ ).

Connection to ML. Multicollinearity is less problematic for prediction than for inference. Correlated features inflate coefficient variance but may not degrade predictive accuracy. Ridge regression handles collinearity naturally. For interpretation, however, large VIFs make individual coefficients unreliable, which is why feature selection or dimensionality reduction is standard practice.

Variable Selection

With many candidate predictors, we need systematic methods to select which ones to include.

Stepwise methods:

Forward selection: start with no predictors, add the one with the smallest $p$ -value (from the partial $F$ -test), repeat until no predictor has $p$ -value below a threshold
Backward elimination: start with all predictors, remove the one with the largest $p$ -value, repeat until all remaining predictors are significant
Stepwise (bidirectional): at each step, consider both adding and removing predictors

Stepwise methods are greedy and do not guarantee finding the best subset. They are sensitive to the order of operations and can produce unstable selections.

Information criteria provide a more principled approach. For a model with $k$ parameters and maximized log-likelihood $\hat{\ell}$ :

\text{AIC} = -2\hat{\ell} + 2k

\text{BIC} = -2\hat{\ell} + k\log n

Both balance fit (through $\hat{\ell}$ ) against complexity (through the penalty on $k$ ). BIC’s penalty grows with $n$ , so it selects simpler models asymptotically and is consistent (selects the true model with probability approaching 1 if the true model is among the candidates). AIC is not consistent but minimizes prediction error in an asymptotic sense, making it preferable when the goal is forecasting rather than identifying the true model.

For linear regression with Gaussian errors, $-2\hat{\ell} = n\log(\text{RSS}/n) + \text{const}$ , so these criteria can be computed from RSS without refitting the likelihood.

Best subset selection evaluates all $2^{p-1}$ possible models (excluding the intercept), which is computationally feasible only for small $p$ (say, $p \leq 20$ ). Mixed integer optimization has pushed this boundary to $p \approx 1000$ in recent work.

Connection to ML. Regularization-based selection (lasso, elastic net) dominates in high-dimensional settings where $p$ is large or exceeds $n$ . The lasso performs variable selection implicitly by shrinking some coefficients exactly to zero. Cross-validation replaces information criteria as the primary tool for tuning the regularization strength. These methods scale to $p \gg n$ and handle correlated predictors more gracefully than stepwise procedures.

Summary

Concept	Key Result
$\hat{\beta}$ distribution	$\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2(X^TX)^{-1})$
$t$ -test	Tests $H_0: \beta_j = 0$ conditional on other predictors
$F$ -test	Tests significance of groups of predictors
Prediction interval	Wider than CI due to irreducible noise $\sigma^2$
Diagnostics	Residual plots, QQ plots, Cook’s distance
VIF	$1/(1 - R_j^2)$ ; detects multicollinearity
AIC/BIC	Model selection balancing fit and complexity