8: Likelihood Ratio Tests

The likelihood ratio test (LRT) is a general-purpose method for comparing nested statistical models. It provides an optimal test for simple hypotheses (Neyman-Pearson lemma) and extends to composite hypotheses via Wilks’ theorem. LRTs are the foundation for model comparison in both classical statistics and modern ML.

Simple Hypotheses: Neyman-Pearson Lemma

Consider testing:

$H_0$ : $X \sim f_0(x)$
$H_1$ : $X \sim f_1(x)$

where both $f_0$ and $f_1$ are completely specified (simple hypotheses).

Neyman-Pearson Lemma. The most powerful test at significance level $\alpha$ rejects $H_0$ when:

\Lambda(\mathbf{x}) = \frac{L_1(\mathbf{x})}{L_0(\mathbf{x})} = \frac{\prod_{i=1}^n f_1(x_i)}{\prod_{i=1}^n f_0(x_i)} > c

where the threshold $c$ is chosen such that $P(\Lambda > c \mid H_0) = \alpha$ .

“Most powerful” means no other test at the same significance level has higher power (probability of rejecting $H_0$ when $H_1$ is true). The Neyman-Pearson lemma guarantees that the likelihood ratio is the optimal test statistic for simple vs simple hypothesis testing.

Intuition. The likelihood ratio measures how much more likely the data are under $H_1$ versus $H_0$ . Large values indicate the data strongly favor $H_1$ .

Generalized Likelihood Ratio Test (GLRT)

For composite hypotheses (parameters not fully specified), we replace the simple likelihoods with maximized likelihoods:

\Lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta; \mathbf{x})}{\sup_{\theta \in \Theta} L(\theta; \mathbf{x})} = \frac{L(\hat{\theta}_0; \mathbf{x})}{L(\hat{\theta}; \mathbf{x})}

where:

$\hat{\theta}_0$ is the MLE under the null (restricted parameter space $\Theta_0$ )
$\hat{\theta}$ is the unrestricted MLE (full parameter space $\Theta$ )

Since $\Theta_0 \subseteq \Theta$ , the unrestricted MLE always achieves at least as high a likelihood, so $\Lambda \in [0, 1]$ . Values of $\Lambda$ near 0 indicate the null is a poor fit compared to the unrestricted model.

Reject $H_0$ when $\Lambda < c_\alpha$ , or equivalently when $-2\log\Lambda > c'_\alpha$ .

Wilks’ Theorem

Theorem (Wilks, 1938). Under $H_0$ and regularity conditions, as $n \to \infty$ :

-2\log\Lambda \xrightarrow{d} \chi^2_k

where $k = \dim(\Theta) - \dim(\Theta_0)$ is the number of parameters constrained by $H_0$ .

This is powerful: regardless of the specific distributions involved, the test statistic follows a chi-squared distribution with degrees of freedom equal to the difference in dimensionality between the full and null models. The p-value is:

p = P(\chi^2_k \geq -2\log\Lambda)

Examples

Testing a Normal Mean

$H_0: \mu = \mu_0$ vs $H_1: \mu \neq \mu_0$ with known $\sigma^2$ .

Restricted MLE: $\hat{\mu}_0 = \mu_0$ (fixed)
Unrestricted MLE: $\hat{\mu} = \bar{X}$
$-2\log\Lambda = \frac{n(\bar{X} - \mu_0)^2}{\sigma^2} = Z^2 \sim \chi^2_1$

This recovers the z-test.

Comparing Nested Regression Models

Model 1 (null): $Y = \beta_0 + \beta_1 X_1 + \epsilon$ ( $p_0 = 2$ parameters) Model 2 (full): $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon$ ( $p = 4$ parameters)

The LRT statistic $-2(\ell_1 - \ell_2) \sim \chi^2_{p - p_0} = \chi^2_2$ under $H_0$ . If $\beta_2 = \beta_3 = 0$ , the additional features do not improve the model.

Testing Independence in Contingency Tables

$H_0$ : two categorical variables are independent. The LRT statistic:

G^2 = 2\sum_{i,j} O_{ij} \log\frac{O_{ij}}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}

where $O_{ij}$ are observed counts and $E_{ij}$ are expected counts under independence.

Connection to Information Criteria

The LRT compares two specific models. Information criteria extend this to model selection among non-nested models:

AIC (Akaike Information Criterion):

\text{AIC} = -2\ell(\hat{\theta}) + 2k

BIC (Bayesian Information Criterion):

\text{BIC} = -2\ell(\hat{\theta}) + k\log n

where $k$ is the number of parameters and $n$ is the sample size. Both penalize model complexity: AIC asymptotically selects the model that minimizes prediction error (KL divergence), while BIC selects the true model (if it’s among the candidates) with probability approaching 1.

For nested models, the difference in AIC is related to the LRT: $\Delta\text{AIC} = -2\log\Lambda + 2\Delta k$ . AIC adds a complexity penalty of 2 per parameter; the LRT adds none.

Application to ML Model Comparison

The likelihood ratio framework extends to comparing ML models:

Deviance. For generalized linear models, the deviance $D = -2(\ell_{\text{model}} - \ell_{\text{saturated}})$ is a likelihood ratio statistic comparing the fitted model to a saturated model (one parameter per observation). The difference in deviance between two nested models follows $\chi^2_{p_2 - p_1}$ .

The tracker cost model uses a different evaluation paradigm: bootstrap confidence intervals on MAE rather than LRT. This is because the comparison is between a lookup table (not a parametric model) and XGBoost, which are not nested. The bootstrap CI approach [3,314, 3,627] vs [3,623, 3,984] provides the same kind of rigorous comparison but without requiring nested model structure.

Cross-validation vs LRT. LRT requires nested models and correct specification. Cross-validation works for any two models (nested or not, parametric or not) and directly estimates prediction performance. For this reason, cross-validation is the standard comparison method in ML, while LRT is standard in parametric statistics.

Summary

Concept	Key Result
Neyman-Pearson	Likelihood ratio is the most powerful test for simple hypotheses
GLRT	$\Lambda = L(\hat{\theta}_0)/L(\hat{\theta})$ ; reject $H_0$ when $\Lambda$ is small
Wilks’ theorem	$-2\log\Lambda \to \chi^2_k$ asymptotically
Degrees of freedom	$k$ = number of parameters constrained by $H_0$
AIC/BIC	LRT + complexity penalty for non-nested model selection

The likelihood ratio provides a principled framework for deciding whether a more complex model is justified by the data. Wilks’ theorem gives a universal asymptotic distribution, making LRTs applicable across a wide range of parametric models.