7: Hypothesis Testing

Hypothesis testing provides a framework for making binary decisions from data: is the observed effect real, or could it have arisen by chance? This framework underpins A/B testing, model comparison, and scientific inference.


Framework

A hypothesis test has four components:

  1. Null hypothesis H0H_0: the default assumption (no effect, no difference)
  2. Alternative hypothesis H1H_1: the claim we want to support
  3. Test statistic T(X)T(\mathbf{X}): a function of the data that measures evidence against H0H_0
  4. Decision rule: reject H0H_0 if TT falls in the rejection region

Example Setup

Testing whether a model improves over a baseline:

  • H0H_0: μmodelμbaseline=0\mu_{\text{model}} - \mu_{\text{baseline}} = 0 (no improvement)
  • H1H_1: μmodelμbaseline<0\mu_{\text{model}} - \mu_{\text{baseline}} < 0 (model has lower error)
  • Test statistic: T=XˉmodelXˉbaselineSET = \frac{\bar{X}_{\text{model}} - \bar{X}_{\text{baseline}}}{SE}
  • Reject H0H_0 if T<zαT < -z_\alpha

Error Types

H0H_0 TrueH0H_0 False
Reject H0H_0Type I error (α\alpha)Correct (Power)
Fail to rejectCorrectType II error (β\beta)

Type I error rate α\alpha: probability of rejecting H0H_0 when it is true (false positive). Conventionally set to 0.05.

Type II error rate β\beta: probability of failing to reject H0H_0 when it is false (false negative).

Power =1β= 1 - \beta: probability of correctly rejecting a false H0H_0. Depends on:

  • Effect size: larger effects are easier to detect
  • Sample size: more data increases power
  • Significance level: higher α\alpha increases power (at the cost of more false positives)
  • Variance: lower noise increases power

P-Values

The p-value is the probability of observing a test statistic as extreme as or more extreme than the observed value, assuming H0H_0 is true:

p=P(TtobsH0)p = P(T \geq t_{\text{obs}} \mid H_0)

for a one-sided test (or P(Ttobs)P(|T| \geq |t_{\text{obs}}|) for two-sided).

Decision rule. Reject H0H_0 if pαp \leq \alpha.

What a p-value is: The probability of the data (or more extreme) under H0H_0.

What a p-value is not:

  • The probability that H0H_0 is true
  • The probability that the result is due to chance
  • A measure of effect size

A small p-value with a large sample can correspond to a trivially small effect. Always report effect sizes alongside p-values.


Common Tests

One-Sample t-Test

Test H0:μ=μ0H_0: \mu = \mu_0 with unknown variance:

T=Xˉμ0S/ntn1 under H0T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t_{n-1} \text{ under } H_0

Two-Sample t-Test

Test H0:μ1=μ2H_0: \mu_1 = \mu_2 (independent groups):

T=Xˉ1Xˉ2S12/n1+S22/n2T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}

with Welch’s approximation for degrees of freedom (does not assume equal variances).

Paired t-Test

For paired observations (Xi,Yi)(X_i, Y_i), test H0:μD=0H_0: \mu_D = 0 where Di=XiYiD_i = X_i - Y_i:

T=DˉSD/nT = \frac{\bar{D}}{S_D/\sqrt{n}}

Paired tests are more powerful than two-sample tests when pairs are correlated (e.g., two models evaluated on the same test set).

Chi-Squared Test

Test independence in a contingency table or goodness-of-fit:

χ2=i(OiEi)2Ei\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}

where OiO_i are observed counts and EiE_i are expected counts under H0H_0.

Permutation Test

A nonparametric test that computes the test statistic under all (or many random) permutations of group labels:

  1. Compute the observed test statistic tobst_{\text{obs}}
  2. Randomly permute group labels BB times
  3. Compute tbt^*_b for each permutation
  4. p=1Bb=1BI[tbtobs]p = \frac{1}{B}\sum_{b=1}^B \mathbb{I}[t^*_b \geq t_{\text{obs}}]

No distributional assumptions required. Exact when all permutations are enumerated.


Multiple Testing

Testing mm hypotheses simultaneously at level α\alpha each gives a family-wise error rate of 1(1α)mmα1 - (1-\alpha)^m \approx m\alpha for small α\alpha. Testing 20 metrics at α=0.05\alpha = 0.05 gives a 64% chance of at least one false positive.

Bonferroni Correction

Test each hypothesis at α/m\alpha/m. Controls the family-wise error rate (FWER): P(any false rejection)αP(\text{any false rejection}) \leq \alpha. Conservative: may miss real effects.

Benjamini-Hochberg (BH)

Controls the false discovery rate (FDR): E[false rejections/total rejections]qE[\text{false rejections}/\text{total rejections}] \leq q.

Algorithm:

  1. Sort p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmqp_{(k)} \leq \frac{k}{m} q
  3. Reject hypotheses 1,,k1, \ldots, k

BH is less conservative than Bonferroni and is preferred when controlling the proportion of false discoveries (rather than any false discovery) is acceptable.


Application to A/B Testing

A/B testing is hypothesis testing applied to product metrics:

Statistical ConceptA/B Testing Analog
H0H_0Treatment has no effect
H1H_1Treatment improves the metric
α\alphaFalse positive rate (shipping a bad change)
β\betaFalse negative rate (missing a good change)
PowerProbability of detecting a real improvement
Multiple testingTesting many metrics simultaneously

Sample size calculation. For a two-sample z-test detecting effect size δ\delta at significance α\alpha and power 1β1-\beta:

n=(zα/2+zβ)22σ2δ2n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot 2\sigma^2}{\delta^2}

Sequential testing. Standard tests assume a fixed sample size. “Peeking” (checking results before the planned sample size) inflates the false positive rate. Group sequential methods and always-valid p-values allow early stopping while maintaining error guarantees.


The Kolmogorov-Smirnov Test

The KS test compares two empirical distributions (or one empirical vs a reference):

D=supxF1(x)F2(x)D = \sup_x |F_1(x) - F_2(x)|

The tracker model uses KS tests to diagnose feature-space disjointness between labeled and unlabeled domain populations, finding p<0.001p < 0.001 on every feature. This confirmed that the domain-level formulation was structurally inappropriate.


Summary

ConceptKey Point
P-valueProbability of data under H0H_0, not probability H0H_0 is true
Type I errorRejecting a true H0H_0 (false positive), controlled at α\alpha
Type II errorFailing to reject a false H0H_0 (false negative)
Power1β1 - \beta; increases with effect size, sample size, and α\alpha
Multiple testingBonferroni (FWER), BH (FDR) corrections
Paired testsMore powerful when observations are correlated
Permutation testsNonparametric, exact, distribution-free

Hypothesis testing provides the decision framework; confidence intervals provide the uncertainty quantification. Report both: the CI conveys the range of plausible effects, the p-value conveys whether the effect is distinguishable from chance.