7: Hypothesis Testing
Hypothesis testing provides a framework for making binary decisions from data: is the observed effect real, or could it have arisen by chance? This framework underpins A/B testing, model comparison, and scientific inference.
Framework
A hypothesis test has four components:
- Null hypothesis : the default assumption (no effect, no difference)
- Alternative hypothesis : the claim we want to support
- Test statistic : a function of the data that measures evidence against
- Decision rule: reject if falls in the rejection region
Example Setup
Testing whether a model improves over a baseline:
- : (no improvement)
- : (model has lower error)
- Test statistic:
- Reject if
Error Types
| True | False | |
|---|---|---|
| Reject | Type I error () | Correct (Power) |
| Fail to reject | Correct | Type II error () |
Type I error rate : probability of rejecting when it is true (false positive). Conventionally set to 0.05.
Type II error rate : probability of failing to reject when it is false (false negative).
Power : probability of correctly rejecting a false . Depends on:
- Effect size: larger effects are easier to detect
- Sample size: more data increases power
- Significance level: higher increases power (at the cost of more false positives)
- Variance: lower noise increases power
P-Values
The p-value is the probability of observing a test statistic as extreme as or more extreme than the observed value, assuming is true:
for a one-sided test (or for two-sided).
Decision rule. Reject if .
What a p-value is: The probability of the data (or more extreme) under .
What a p-value is not:
- The probability that is true
- The probability that the result is due to chance
- A measure of effect size
A small p-value with a large sample can correspond to a trivially small effect. Always report effect sizes alongside p-values.
Common Tests
One-Sample t-Test
Test with unknown variance:
Two-Sample t-Test
Test (independent groups):
with Welch’s approximation for degrees of freedom (does not assume equal variances).
Paired t-Test
For paired observations , test where :
Paired tests are more powerful than two-sample tests when pairs are correlated (e.g., two models evaluated on the same test set).
Chi-Squared Test
Test independence in a contingency table or goodness-of-fit:
where are observed counts and are expected counts under .
Permutation Test
A nonparametric test that computes the test statistic under all (or many random) permutations of group labels:
- Compute the observed test statistic
- Randomly permute group labels times
- Compute for each permutation
No distributional assumptions required. Exact when all permutations are enumerated.
Multiple Testing
Testing hypotheses simultaneously at level each gives a family-wise error rate of for small . Testing 20 metrics at gives a 64% chance of at least one false positive.
Bonferroni Correction
Test each hypothesis at . Controls the family-wise error rate (FWER): . Conservative: may miss real effects.
Benjamini-Hochberg (BH)
Controls the false discovery rate (FDR): .
Algorithm:
- Sort p-values:
- Find the largest such that
- Reject hypotheses
BH is less conservative than Bonferroni and is preferred when controlling the proportion of false discoveries (rather than any false discovery) is acceptable.
Application to A/B Testing
A/B testing is hypothesis testing applied to product metrics:
| Statistical Concept | A/B Testing Analog |
|---|---|
| Treatment has no effect | |
| Treatment improves the metric | |
| False positive rate (shipping a bad change) | |
| False negative rate (missing a good change) | |
| Power | Probability of detecting a real improvement |
| Multiple testing | Testing many metrics simultaneously |
Sample size calculation. For a two-sample z-test detecting effect size at significance and power :
Sequential testing. Standard tests assume a fixed sample size. “Peeking” (checking results before the planned sample size) inflates the false positive rate. Group sequential methods and always-valid p-values allow early stopping while maintaining error guarantees.
The Kolmogorov-Smirnov Test
The KS test compares two empirical distributions (or one empirical vs a reference):
The tracker model uses KS tests to diagnose feature-space disjointness between labeled and unlabeled domain populations, finding on every feature. This confirmed that the domain-level formulation was structurally inappropriate.
Summary
| Concept | Key Point |
|---|---|
| P-value | Probability of data under , not probability is true |
| Type I error | Rejecting a true (false positive), controlled at |
| Type II error | Failing to reject a false (false negative) |
| Power | ; increases with effect size, sample size, and |
| Multiple testing | Bonferroni (FWER), BH (FDR) corrections |
| Paired tests | More powerful when observations are correlated |
| Permutation tests | Nonparametric, exact, distribution-free |
Hypothesis testing provides the decision framework; confidence intervals provide the uncertainty quantification. Report both: the CI conveys the range of plausible effects, the p-value conveys whether the effect is distinguishable from chance.