7: Hypothesis Testing

Hypothesis testing provides a framework for making binary decisions from data: is the observed effect real, or could it have arisen by chance? This framework underpins A/B testing, model comparison, and scientific inference.

Framework

A hypothesis test has four components:

Null hypothesis $H_0$ : the default assumption (no effect, no difference)
Alternative hypothesis $H_1$ : the claim we want to support
Test statistic $T(\mathbf{X})$ : a function of the data that measures evidence against $H_0$
Decision rule: reject $H_0$ if $T$ falls in the rejection region

Example Setup

Testing whether a model improves over a baseline:

$H_0$ : $\mu_{\text{model}} - \mu_{\text{baseline}} = 0$ (no improvement)
$H_1$ : $\mu_{\text{model}} - \mu_{\text{baseline}} < 0$ (model has lower error)
Test statistic: $T = \frac{\bar{X}_{\text{model}} - \bar{X}_{\text{baseline}}}{SE}$
Reject $H_0$ if $T < -z_\alpha$

Error Types

	$H_0$ True	$H_0$ False
Reject $H_0$	Type I error ( $\alpha$ )	Correct (Power)
Fail to reject	Correct	Type II error ( $\beta$ )

Type I error rate $\alpha$ : probability of rejecting $H_0$ when it is true (false positive). Conventionally set to 0.05.

Type II error rate $\beta$ : probability of failing to reject $H_0$ when it is false (false negative).

Power $= 1 - \beta$ : probability of correctly rejecting a false $H_0$ . Depends on:

Effect size: larger effects are easier to detect
Sample size: more data increases power
Significance level: higher $\alpha$ increases power (at the cost of more false positives)
Variance: lower noise increases power

P-Values

The p-value is the probability of observing a test statistic as extreme as or more extreme than the observed value, assuming $H_0$ is true:

p = P(T \geq t_{\text{obs}} \mid H_0)

for a one-sided test (or $P(|T| \geq |t_{\text{obs}}|)$ for two-sided).

Decision rule. Reject $H_0$ if $p \leq \alpha$ .

What a p-value is: The probability of the data (or more extreme) under $H_0$ .

What a p-value is not:

The probability that $H_0$ is true
The probability that the result is due to chance
A measure of effect size

A small p-value with a large sample can correspond to a trivially small effect. Always report effect sizes alongside p-values.

Common Tests

One-Sample t-Test

Test $H_0: \mu = \mu_0$ with unknown variance:

T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t_{n-1} \text{ under } H_0

Two-Sample t-Test

Test $H_0: \mu_1 = \mu_2$ (independent groups):

T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}

with Welch’s approximation for degrees of freedom (does not assume equal variances).

Paired t-Test

For paired observations $(X_i, Y_i)$ , test $H_0: \mu_D = 0$ where $D_i = X_i - Y_i$ :

T = \frac{\bar{D}}{S_D/\sqrt{n}}

Paired tests are more powerful than two-sample tests when pairs are correlated (e.g., two models evaluated on the same test set).

Chi-Squared Test

Test independence in a contingency table or goodness-of-fit:

\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}

where $O_i$ are observed counts and $E_i$ are expected counts under $H_0$ .

Permutation Test

A nonparametric test that computes the test statistic under all (or many random) permutations of group labels:

Compute the observed test statistic $t_{\text{obs}}$
Randomly permute group labels $B$ times
Compute $t^*_b$ for each permutation
$p = \frac{1}{B}\sum_{b=1}^B \mathbb{I}[t^*_b \geq t_{\text{obs}}]$

No distributional assumptions required. Exact when all permutations are enumerated.

Multiple Testing

Testing $m$ hypotheses simultaneously at level $\alpha$ each gives a family-wise error rate of $1 - (1-\alpha)^m \approx m\alpha$ for small $\alpha$ . Testing 20 metrics at $\alpha = 0.05$ gives a 64% chance of at least one false positive.

Bonferroni Correction

Test each hypothesis at $\alpha/m$ . Controls the family-wise error rate (FWER): $P(\text{any false rejection}) \leq \alpha$ . Conservative: may miss real effects.

Benjamini-Hochberg (BH)

Controls the false discovery rate (FDR): $E[\text{false rejections}/\text{total rejections}] \leq q$ .

Algorithm:

Sort p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} q$
Reject hypotheses $1, \ldots, k$

BH is less conservative than Bonferroni and is preferred when controlling the proportion of false discoveries (rather than any false discovery) is acceptable.

Application to A/B Testing

A/B testing is hypothesis testing applied to product metrics:

Statistical Concept	A/B Testing Analog
$H_0$	Treatment has no effect
$H_1$	Treatment improves the metric
$\alpha$	False positive rate (shipping a bad change)
$\beta$	False negative rate (missing a good change)
Power	Probability of detecting a real improvement
Multiple testing	Testing many metrics simultaneously

Sample size calculation. For a two-sample z-test detecting effect size $\delta$ at significance $\alpha$ and power $1-\beta$ :

n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot 2\sigma^2}{\delta^2}

Sequential testing. Standard tests assume a fixed sample size. “Peeking” (checking results before the planned sample size) inflates the false positive rate. Group sequential methods and always-valid p-values allow early stopping while maintaining error guarantees.

The Kolmogorov-Smirnov Test

The KS test compares two empirical distributions (or one empirical vs a reference):

D = \sup_x |F_1(x) - F_2(x)|

The tracker model uses KS tests to diagnose feature-space disjointness between labeled and unlabeled domain populations, finding $p < 0.001$ on every feature. This confirmed that the domain-level formulation was structurally inappropriate.

Summary

Concept	Key Point
P-value	Probability of data under $H_0$ , not probability $H_0$ is true
Type I error	Rejecting a true $H_0$ (false positive), controlled at $\alpha$
Type II error	Failing to reject a false $H_0$ (false negative)
Power	$1 - \beta$ ; increases with effect size, sample size, and $\alpha$
Multiple testing	Bonferroni (FWER), BH (FDR) corrections
Paired tests	More powerful when observations are correlated
Permutation tests	Nonparametric, exact, distribution-free

Hypothesis testing provides the decision framework; confidence intervals provide the uncertainty quantification. Report both: the CI conveys the range of plausible effects, the p-value conveys whether the effect is distinguishable from chance.