8: Likelihood Ratio Tests

The likelihood ratio test (LRT) is a general-purpose method for comparing nested statistical models. It provides an optimal test for simple hypotheses (Neyman-Pearson lemma) and extends to composite hypotheses via Wilks’ theorem. LRTs are the foundation for model comparison in both classical statistics and modern ML.


Simple Hypotheses: Neyman-Pearson Lemma

Consider testing:

  • H0H_0: Xf0(x)X \sim f_0(x)
  • H1H_1: Xf1(x)X \sim f_1(x)

where both f0f_0 and f1f_1 are completely specified (simple hypotheses).

Neyman-Pearson Lemma. The most powerful test at significance level α\alpha rejects H0H_0 when:

Λ(x)=L1(x)L0(x)=i=1nf1(xi)i=1nf0(xi)>c\Lambda(\mathbf{x}) = \frac{L_1(\mathbf{x})}{L_0(\mathbf{x})} = \frac{\prod_{i=1}^n f_1(x_i)}{\prod_{i=1}^n f_0(x_i)} > c

where the threshold cc is chosen such that P(Λ>cH0)=αP(\Lambda > c \mid H_0) = \alpha.

“Most powerful” means no other test at the same significance level has higher power (probability of rejecting H0H_0 when H1H_1 is true). The Neyman-Pearson lemma guarantees that the likelihood ratio is the optimal test statistic for simple vs simple hypothesis testing.

Intuition. The likelihood ratio measures how much more likely the data are under H1H_1 versus H0H_0. Large values indicate the data strongly favor H1H_1.


Generalized Likelihood Ratio Test (GLRT)

For composite hypotheses (parameters not fully specified), we replace the simple likelihoods with maximized likelihoods:

Λ=supθΘ0L(θ;x)supθΘL(θ;x)=L(θ^0;x)L(θ^;x)\Lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta; \mathbf{x})}{\sup_{\theta \in \Theta} L(\theta; \mathbf{x})} = \frac{L(\hat{\theta}_0; \mathbf{x})}{L(\hat{\theta}; \mathbf{x})}

where:

  • θ^0\hat{\theta}_0 is the MLE under the null (restricted parameter space Θ0\Theta_0)
  • θ^\hat{\theta} is the unrestricted MLE (full parameter space Θ\Theta)

Since Θ0Θ\Theta_0 \subseteq \Theta, the unrestricted MLE always achieves at least as high a likelihood, so Λ[0,1]\Lambda \in [0, 1]. Values of Λ\Lambda near 0 indicate the null is a poor fit compared to the unrestricted model.

Reject H0H_0 when Λ<cα\Lambda < c_\alpha, or equivalently when 2logΛ>cα-2\log\Lambda > c'_\alpha.


Wilks’ Theorem

Theorem (Wilks, 1938). Under H0H_0 and regularity conditions, as nn \to \infty:

2logΛdχk2-2\log\Lambda \xrightarrow{d} \chi^2_k

where k=dim(Θ)dim(Θ0)k = \dim(\Theta) - \dim(\Theta_0) is the number of parameters constrained by H0H_0.

This is powerful: regardless of the specific distributions involved, the test statistic follows a chi-squared distribution with degrees of freedom equal to the difference in dimensionality between the full and null models. The p-value is:

p=P(χk22logΛ)p = P(\chi^2_k \geq -2\log\Lambda)

Examples

Testing a Normal Mean

H0:μ=μ0H_0: \mu = \mu_0 vs H1:μμ0H_1: \mu \neq \mu_0 with known σ2\sigma^2.

  • Restricted MLE: μ^0=μ0\hat{\mu}_0 = \mu_0 (fixed)
  • Unrestricted MLE: μ^=Xˉ\hat{\mu} = \bar{X}
  • 2logΛ=n(Xˉμ0)2σ2=Z2χ12-2\log\Lambda = \frac{n(\bar{X} - \mu_0)^2}{\sigma^2} = Z^2 \sim \chi^2_1

This recovers the z-test.

Comparing Nested Regression Models

Model 1 (null): Y=β0+β1X1+ϵY = \beta_0 + \beta_1 X_1 + \epsilon (p0=2p_0 = 2 parameters) Model 2 (full): Y=β0+β1X1+β2X2+β3X3+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon (p=4p = 4 parameters)

The LRT statistic 2(12)χpp02=χ22-2(\ell_1 - \ell_2) \sim \chi^2_{p - p_0} = \chi^2_2 under H0H_0. If β2=β3=0\beta_2 = \beta_3 = 0, the additional features do not improve the model.

Testing Independence in Contingency Tables

H0H_0: two categorical variables are independent. The LRT statistic:

G2=2i,jOijlogOijEijχ(r1)(c1)2G^2 = 2\sum_{i,j} O_{ij} \log\frac{O_{ij}}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}

where OijO_{ij} are observed counts and EijE_{ij} are expected counts under independence.


Connection to Information Criteria

The LRT compares two specific models. Information criteria extend this to model selection among non-nested models:

AIC (Akaike Information Criterion):

AIC=2(θ^)+2k\text{AIC} = -2\ell(\hat{\theta}) + 2k

BIC (Bayesian Information Criterion):

BIC=2(θ^)+klogn\text{BIC} = -2\ell(\hat{\theta}) + k\log n

where kk is the number of parameters and nn is the sample size. Both penalize model complexity: AIC asymptotically selects the model that minimizes prediction error (KL divergence), while BIC selects the true model (if it’s among the candidates) with probability approaching 1.

For nested models, the difference in AIC is related to the LRT: ΔAIC=2logΛ+2Δk\Delta\text{AIC} = -2\log\Lambda + 2\Delta k. AIC adds a complexity penalty of 2 per parameter; the LRT adds none.


Application to ML Model Comparison

The likelihood ratio framework extends to comparing ML models:

Deviance. For generalized linear models, the deviance D=2(modelsaturated)D = -2(\ell_{\text{model}} - \ell_{\text{saturated}}) is a likelihood ratio statistic comparing the fitted model to a saturated model (one parameter per observation). The difference in deviance between two nested models follows χp2p12\chi^2_{p_2 - p_1}.

The tracker cost model uses a different evaluation paradigm: bootstrap confidence intervals on MAE rather than LRT. This is because the comparison is between a lookup table (not a parametric model) and XGBoost, which are not nested. The bootstrap CI approach [3,314, 3,627] vs [3,623, 3,984] provides the same kind of rigorous comparison but without requiring nested model structure.

Cross-validation vs LRT. LRT requires nested models and correct specification. Cross-validation works for any two models (nested or not, parametric or not) and directly estimates prediction performance. For this reason, cross-validation is the standard comparison method in ML, while LRT is standard in parametric statistics.


Summary

ConceptKey Result
Neyman-PearsonLikelihood ratio is the most powerful test for simple hypotheses
GLRTΛ=L(θ^0)/L(θ^)\Lambda = L(\hat{\theta}_0)/L(\hat{\theta}); reject H0H_0 when Λ\Lambda is small
Wilks’ theorem2logΛχk2-2\log\Lambda \to \chi^2_k asymptotically
Degrees of freedomkk = number of parameters constrained by H0H_0
AIC/BICLRT + complexity penalty for non-nested model selection

The likelihood ratio provides a principled framework for deciding whether a more complex model is justified by the data. Wilks’ theorem gives a universal asymptotic distribution, making LRTs applicable across a wide range of parametric models.