10: Inference for Linear Models
With the linear model established, we turn to inference: what can we say about individual coefficients, groups of coefficients, and future observations? This article covers the sampling distribution of , hypothesis tests, confidence and prediction intervals, model diagnostics, and variable selection. These tools determine whether the patterns found by a regression are statistically meaningful and whether the model’s assumptions are reasonable.
Sampling Distribution of
Under the linear model with :
Since is a linear function of and is Gaussian:
The covariance matrix governs the precision of each coefficient estimate. Its diagonal entries give the variances:
When is unknown (the usual case), we replace it with . The estimated covariance matrix is , and the standard error of the -th coefficient is:
An important distributional result: the RSS and are independent, and
-Tests for Individual Coefficients
To test whether a single predictor contributes to the model:
The test statistic is:
Under , . Reject at level if .
Interpretation. A small -value for indicates that, conditional on all other predictors being in the model, provides statistically significant additional explanatory power. This is a partial test: the same variable can be significant in one model and insignificant in another, depending on which other predictors are included.
Multiple testing caveat. With predictors, running individual -tests inflates the family-wise error rate. If the predictors are independent and all null, the probability of at least one false positive is . Bonferroni or Benjamini-Hochberg corrections can be applied, but the -test for groups of coefficients (below) is often more appropriate.
Confidence Intervals for Coefficients
A confidence interval for :
This interval has the interpretation: if we repeated the experiment many times and constructed an interval each time, of those intervals would contain the true .
For a confidence region for the entire coefficient vector :
This defines an ellipsoid in . The shape of the ellipsoid reflects the correlation structure among the predictors.
-Tests for Groups of Coefficients
To test whether a group of predictors contributes to the model, partition where are the coefficients being tested:
Fit the full model (all predictors, RSS) and the reduced model (without the predictors, RSS):
The numerator measures the additional variance explained by the predictors, normalized by degrees of freedom. The denominator is the usual variance estimate from the full model.
Special cases:
- : reduces to the -test (in fact, and )
- : the global -test from article 9
ANOVA as an -test. One-way ANOVA tests whether group means are equal: . This is exactly an -test comparing the intercept-only model to a model with group indicators.
Prediction Intervals
For a new observation with true response :
Point prediction:
Confidence interval for the mean response :
Prediction interval for a new observation :
The prediction interval is always wider than the confidence interval because it accounts for two sources of uncertainty: estimation error in and the irreducible noise . The extra "" under the square root comes from .
Connection to ML. Prediction intervals quantify uncertainty in individual predictions, which is critical for decision-making. Most ML models produce point predictions without uncertainty estimates. Bayesian regression and conformal prediction are two modern approaches to generating prediction intervals for more complex models.
Model Diagnostics
The validity of inference depends on the model assumptions being approximately correct. Diagnostics help identify violations.
Residual plots. Plot residuals against fitted values . Under the model, this plot should show no pattern:
- Funnel shape (residuals spread increasing with ): heteroscedasticity
- Curvature: nonlinearity (model misspecification)
- Clusters or outliers: subpopulations or influential points
QQ plot. Plot the ordered standardized residuals against theoretical quantiles of . Departures from the diagonal indicate non-normality:
- Heavy tails (S-shape): -distribution-like errors, outliers
- Light tails (inverted S): bounded error distribution
- Skewness (curvature): asymmetric error distribution
Heteroscedasticity. When depends on (or on ), OLS remains unbiased but is no longer BLUE, and the standard errors are wrong. Formal tests include the Breusch-Pagan test (regress squared residuals on ) and White’s test. Remedies: weighted least squares (WLS), heteroscedasticity-consistent (HC) standard errors (also called sandwich or robust standard errors).
Influential observations. Beyond leverage (, discussed in article 9), Cook’s distance measures the influence of observation on all fitted values simultaneously:
where denotes fitted values with observation deleted. Cook’s distance combines leverage (how unusual is) with the residual (how unusual is given ). Values or are commonly flagged.
Multicollinearity
When predictors are highly correlated, has large entries, inflating the variance of .
The variance inflation factor (VIF) for predictor is:
where is the from regressing on all other predictors. A VIF of 10 means the variance of is 10 times what it would be if were uncorrelated with the other predictors.
Common thresholds: VIF warrants attention, VIF indicates serious collinearity. Remedies include removing redundant predictors, combining correlated predictors (e.g., PCA), or using regularization (ridge regression directly addresses collinearity by shrinking ).
Connection to ML. Multicollinearity is less problematic for prediction than for inference. Correlated features inflate coefficient variance but may not degrade predictive accuracy. Ridge regression handles collinearity naturally. For interpretation, however, large VIFs make individual coefficients unreliable, which is why feature selection or dimensionality reduction is standard practice.
Variable Selection
With many candidate predictors, we need systematic methods to select which ones to include.
Stepwise methods:
- Forward selection: start with no predictors, add the one with the smallest -value (from the partial -test), repeat until no predictor has -value below a threshold
- Backward elimination: start with all predictors, remove the one with the largest -value, repeat until all remaining predictors are significant
- Stepwise (bidirectional): at each step, consider both adding and removing predictors
Stepwise methods are greedy and do not guarantee finding the best subset. They are sensitive to the order of operations and can produce unstable selections.
Information criteria provide a more principled approach. For a model with parameters and maximized log-likelihood :
Both balance fit (through ) against complexity (through the penalty on ). BIC’s penalty grows with , so it selects simpler models asymptotically and is consistent (selects the true model with probability approaching 1 if the true model is among the candidates). AIC is not consistent but minimizes prediction error in an asymptotic sense, making it preferable when the goal is forecasting rather than identifying the true model.
For linear regression with Gaussian errors, , so these criteria can be computed from RSS without refitting the likelihood.
Best subset selection evaluates all possible models (excluding the intercept), which is computationally feasible only for small (say, ). Mixed integer optimization has pushed this boundary to in recent work.
Connection to ML. Regularization-based selection (lasso, elastic net) dominates in high-dimensional settings where is large or exceeds . The lasso performs variable selection implicitly by shrinking some coefficients exactly to zero. Cross-validation replaces information criteria as the primary tool for tuning the regularization strength. These methods scale to and handle correlated predictors more gracefully than stepwise procedures.
Summary
| Concept | Key Result |
|---|---|
| distribution | |
| -test | Tests conditional on other predictors |
| -test | Tests significance of groups of predictors |
| Prediction interval | Wider than CI due to irreducible noise |
| Diagnostics | Residual plots, QQ plots, Cook’s distance |
| VIF | ; detects multicollinearity |
| AIC/BIC | Model selection balancing fit and complexity |