10: Inference for Linear Models

With the linear model Y=Xβ+εY = X\beta + \varepsilon established, we turn to inference: what can we say about individual coefficients, groups of coefficients, and future observations? This article covers the sampling distribution of β^\hat{\beta}, hypothesis tests, confidence and prediction intervals, model diagnostics, and variable selection. These tools determine whether the patterns found by a regression are statistically meaningful and whether the model’s assumptions are reasonable.


Sampling Distribution of β^\hat{\beta}

Under the linear model with εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I):

β^=(XTX)1XTY\hat{\beta} = (X^TX)^{-1}X^TY

Since β^\hat{\beta} is a linear function of YY and YY is Gaussian:

β^N(β,  σ2(XTX)1)\hat{\beta} \sim \mathcal{N}\left(\beta,\; \sigma^2 (X^TX)^{-1}\right)

The covariance matrix σ2(XTX)1\sigma^2(X^TX)^{-1} governs the precision of each coefficient estimate. Its diagonal entries give the variances:

Var(β^j)=σ2[(XTX)1]jj\text{Var}(\hat{\beta}_j) = \sigma^2 \left[(X^TX)^{-1}\right]_{jj}

When σ2\sigma^2 is unknown (the usual case), we replace it with s2=RSS/(np)s^2 = \text{RSS}/(n-p). The estimated covariance matrix is Cov^(β^)=s2(XTX)1\widehat{\text{Cov}}(\hat{\beta}) = s^2(X^TX)^{-1}, and the standard error of the jj-th coefficient is:

SE(β^j)=s[(XTX)1]jj\text{SE}(\hat{\beta}_j) = s\sqrt{[(X^TX)^{-1}]_{jj}}

An important distributional result: the RSS and β^\hat{\beta} are independent, and

(np)s2σ2=RSSσ2χnp2\frac{(n-p)s^2}{\sigma^2} = \frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}

tt-Tests for Individual Coefficients

To test whether a single predictor contributes to the model:

H0:βj=0vs.H1:βj0H_0: \beta_j = 0 \quad \text{vs.} \quad H_1: \beta_j \neq 0

The test statistic is:

tj=β^jSE(β^j)=β^js[(XTX)1]jjt_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}_j}{s\sqrt{[(X^TX)^{-1}]_{jj}}}

Under H0H_0, tjtnpt_j \sim t_{n-p}. Reject H0H_0 at level α\alpha if tj>tnp,α/2|t_j| > t_{n-p, \alpha/2}.

Interpretation. A small pp-value for βj\beta_j indicates that, conditional on all other predictors being in the model, XjX_j provides statistically significant additional explanatory power. This is a partial test: the same variable can be significant in one model and insignificant in another, depending on which other predictors are included.

Multiple testing caveat. With p1p - 1 predictors, running p1p - 1 individual tt-tests inflates the family-wise error rate. If the predictors are independent and all null, the probability of at least one false positive is 1(1α)p11 - (1 - \alpha)^{p-1}. Bonferroni or Benjamini-Hochberg corrections can be applied, but the FF-test for groups of coefficients (below) is often more appropriate.


Confidence Intervals for Coefficients

A 100(1α)%100(1-\alpha)\% confidence interval for βj\beta_j:

β^j±tnp,α/2SE(β^j)\hat{\beta}_j \pm t_{n-p,\, \alpha/2} \cdot \text{SE}(\hat{\beta}_j)

This interval has the interpretation: if we repeated the experiment many times and constructed an interval each time, 100(1α)%100(1-\alpha)\% of those intervals would contain the true βj\beta_j.

For a confidence region for the entire coefficient vector β\beta:

(β^β)TXTX(β^β)ps2Fp,np,α(\hat{\beta} - \beta)^T X^TX (\hat{\beta} - \beta) \leq p \cdot s^2 \cdot F_{p,\, n-p,\, \alpha}

This defines an ellipsoid in Rp\mathbb{R}^p. The shape of the ellipsoid reflects the correlation structure among the predictors.


FF-Tests for Groups of Coefficients

To test whether a group of predictors contributes to the model, partition β=(β1T,β2T)T\beta = (\beta_1^T, \beta_2^T)^T where β2Rq\beta_2 \in \mathbb{R}^q are the coefficients being tested:

H0:β2=0vs.H1:β20H_0: \beta_2 = 0 \quad \text{vs.} \quad H_1: \beta_2 \neq 0

Fit the full model (all predictors, RSSF_F) and the reduced model (without the qq predictors, RSSR_R):

F=(RSSRRSSF)/qRSSF/(np)Fq,npunder H0F = \frac{(\text{RSS}_R - \text{RSS}_F)/q}{\text{RSS}_F/(n-p)} \sim F_{q,\, n-p} \quad \text{under } H_0

The numerator measures the additional variance explained by the qq predictors, normalized by degrees of freedom. The denominator is the usual variance estimate from the full model.

Special cases:

  • q=1q = 1: reduces to the tt-test (in fact, F=t2F = t^2 and F1,nptnp2F_{1, n-p} \equiv t_{n-p}^2)
  • q=p1q = p - 1: the global FF-test from article 9

ANOVA as an FF-test. One-way ANOVA tests whether group means are equal: H0:μ1==μKH_0: \mu_1 = \cdots = \mu_K. This is exactly an FF-test comparing the intercept-only model to a model with group indicators.


Prediction Intervals

For a new observation XnewX_{\text{new}} with true response Ynew=XnewTβ+εnewY_{\text{new}} = X_{\text{new}}^T\beta + \varepsilon_{\text{new}}:

Point prediction: Y^new=XnewTβ^\hat{Y}_{\text{new}} = X_{\text{new}}^T\hat{\beta}

Confidence interval for the mean response E[Ynew]=XnewTβE[Y_{\text{new}}] = X_{\text{new}}^T\beta:

XnewTβ^±tnp,α/2sXnewT(XTX)1XnewX_{\text{new}}^T\hat{\beta} \pm t_{n-p,\, \alpha/2} \cdot s\sqrt{X_{\text{new}}^T(X^TX)^{-1}X_{\text{new}}}

Prediction interval for a new observation YnewY_{\text{new}}:

XnewTβ^±tnp,α/2s1+XnewT(XTX)1XnewX_{\text{new}}^T\hat{\beta} \pm t_{n-p,\, \alpha/2} \cdot s\sqrt{1 + X_{\text{new}}^T(X^TX)^{-1}X_{\text{new}}}

The prediction interval is always wider than the confidence interval because it accounts for two sources of uncertainty: estimation error in β^\hat{\beta} and the irreducible noise εnew\varepsilon_{\text{new}}. The extra "1+1 +" under the square root comes from Var(εnew)=σ2\text{Var}(\varepsilon_{\text{new}}) = \sigma^2.

Connection to ML. Prediction intervals quantify uncertainty in individual predictions, which is critical for decision-making. Most ML models produce point predictions without uncertainty estimates. Bayesian regression and conformal prediction are two modern approaches to generating prediction intervals for more complex models.


Model Diagnostics

The validity of inference depends on the model assumptions being approximately correct. Diagnostics help identify violations.

Residual plots. Plot residuals eie_i against fitted values Y^i\hat{Y}_i. Under the model, this plot should show no pattern:

  • Funnel shape (residuals spread increasing with Y^\hat{Y}): heteroscedasticity
  • Curvature: nonlinearity (model misspecification)
  • Clusters or outliers: subpopulations or influential points

QQ plot. Plot the ordered standardized residuals against theoretical quantiles of N(0,1)\mathcal{N}(0,1). Departures from the diagonal indicate non-normality:

  • Heavy tails (S-shape): tt-distribution-like errors, outliers
  • Light tails (inverted S): bounded error distribution
  • Skewness (curvature): asymmetric error distribution

Heteroscedasticity. When Var(εi)\text{Var}(\varepsilon_i) depends on ii (or on XiX_i), OLS remains unbiased but is no longer BLUE, and the standard errors are wrong. Formal tests include the Breusch-Pagan test (regress squared residuals on XX) and White’s test. Remedies: weighted least squares (WLS), heteroscedasticity-consistent (HC) standard errors (also called sandwich or robust standard errors).

Influential observations. Beyond leverage (hiih_{ii}, discussed in article 9), Cook’s distance measures the influence of observation ii on all fitted values simultaneously:

Di=(Y^Y^(i))T(Y^Y^(i))ps2=ri2phii1hiiD_i = \frac{(\hat{Y} - \hat{Y}_{(i)})^T(\hat{Y} - \hat{Y}_{(i)})}{p \cdot s^2} = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}

where Y^(i)\hat{Y}_{(i)} denotes fitted values with observation ii deleted. Cook’s distance combines leverage (how unusual XiX_i is) with the residual (how unusual YiY_i is given XiX_i). Values Di>1D_i > 1 or Di>4/nD_i > 4/n are commonly flagged.


Multicollinearity

When predictors are highly correlated, (XTX)1(X^TX)^{-1} has large entries, inflating the variance of β^\hat{\beta}.

The variance inflation factor (VIF) for predictor jj is:

VIFj=11Rj2\text{VIF}_j = \frac{1}{1 - R_j^2}

where Rj2R_j^2 is the R2R^2 from regressing XjX_j on all other predictors. A VIF of 10 means the variance of β^j\hat{\beta}_j is 10 times what it would be if XjX_j were uncorrelated with the other predictors.

Common thresholds: VIF >5> 5 warrants attention, VIF >10> 10 indicates serious collinearity. Remedies include removing redundant predictors, combining correlated predictors (e.g., PCA), or using regularization (ridge regression directly addresses collinearity by shrinking (XTX+λI)1(X^TX + \lambda I)^{-1}).

Connection to ML. Multicollinearity is less problematic for prediction than for inference. Correlated features inflate coefficient variance but may not degrade predictive accuracy. Ridge regression handles collinearity naturally. For interpretation, however, large VIFs make individual coefficients unreliable, which is why feature selection or dimensionality reduction is standard practice.


Variable Selection

With many candidate predictors, we need systematic methods to select which ones to include.

Stepwise methods:

  • Forward selection: start with no predictors, add the one with the smallest pp-value (from the partial FF-test), repeat until no predictor has pp-value below a threshold
  • Backward elimination: start with all predictors, remove the one with the largest pp-value, repeat until all remaining predictors are significant
  • Stepwise (bidirectional): at each step, consider both adding and removing predictors

Stepwise methods are greedy and do not guarantee finding the best subset. They are sensitive to the order of operations and can produce unstable selections.

Information criteria provide a more principled approach. For a model with kk parameters and maximized log-likelihood ^\hat{\ell}:

AIC=2^+2k\text{AIC} = -2\hat{\ell} + 2k BIC=2^+klogn\text{BIC} = -2\hat{\ell} + k\log n

Both balance fit (through ^\hat{\ell}) against complexity (through the penalty on kk). BIC’s penalty grows with nn, so it selects simpler models asymptotically and is consistent (selects the true model with probability approaching 1 if the true model is among the candidates). AIC is not consistent but minimizes prediction error in an asymptotic sense, making it preferable when the goal is forecasting rather than identifying the true model.

For linear regression with Gaussian errors, 2^=nlog(RSS/n)+const-2\hat{\ell} = n\log(\text{RSS}/n) + \text{const}, so these criteria can be computed from RSS without refitting the likelihood.

Best subset selection evaluates all 2p12^{p-1} possible models (excluding the intercept), which is computationally feasible only for small pp (say, p20p \leq 20). Mixed integer optimization has pushed this boundary to p1000p \approx 1000 in recent work.

Connection to ML. Regularization-based selection (lasso, elastic net) dominates in high-dimensional settings where pp is large or exceeds nn. The lasso performs variable selection implicitly by shrinking some coefficients exactly to zero. Cross-validation replaces information criteria as the primary tool for tuning the regularization strength. These methods scale to pnp \gg n and handle correlated predictors more gracefully than stepwise procedures.


Summary

ConceptKey Result
β^\hat{\beta} distributionβ^N(β,σ2(XTX)1)\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2(X^TX)^{-1})
tt-testTests H0:βj=0H_0: \beta_j = 0 conditional on other predictors
FF-testTests significance of groups of predictors
Prediction intervalWider than CI due to irreducible noise σ2\sigma^2
DiagnosticsResidual plots, QQ plots, Cook’s distance
VIF1/(1Rj2)1/(1 - R_j^2); detects multicollinearity
AIC/BICModel selection balancing fit and complexity