6: Confidence Intervals
A point estimate provides a single best guess for an unknown parameter. A confidence interval quantifies the uncertainty of that estimate, providing a range of plausible values. Confidence intervals are essential for evaluating ML models, reporting A/B test results, and comparing model performance.
Definition
A confidence interval for parameter is a random interval such that:
before observing the data. The standard choice is (95% confidence interval).
Interpretation. If we repeated the experiment many times and constructed a 95% CI each time, approximately 95% of those intervals would contain the true . A specific observed interval either contains or does not; the probability statement is about the procedure, not the specific interval.
Common misinterpretation. “There is a 95% probability that is in this interval” treats as random, which is a Bayesian (credible interval) interpretation, not a frequentist one.
Construction via Pivotal Quantities
A pivotal quantity is a function of the data and the parameter whose distribution does not depend on .
Normal Mean, Known Variance
If with known :
The 95% CI:
Normal Mean, Unknown Variance
Replace with (sample standard deviation). The pivotal quantity changes distribution:
The 95% CI:
For large , and the intervals converge.
Proportion
For with , the Wald interval uses the CLT:
This performs poorly for small or near 0 or 1. The Wilson interval is preferred:
Asymptotic Intervals (MLE-based)
From the asymptotic normality of the MLE:
where is the observed Fisher information. This is the standard approach for parameters estimated by maximum likelihood when exact distributions are unavailable.
Bootstrap Confidence Intervals
When the sampling distribution is unknown or intractable, the bootstrap provides a nonparametric alternative.
Algorithm (percentile bootstrap):
- From the original sample of size , draw bootstrap samples (sample with replacement)
- Compute (one statistic per bootstrap sample)
- The CI is
The tracker cost model uses bootstrap CIs: 1,000 resamples of the 523,624-request test set produce a 95% CI of [3,314, 3,627] for the XGBoost Tweedie MAE. This interval does not overlap with the path LUT’s CI of [3,623, 3,984], confirming statistically significant improvement.
Variants:
- Percentile bootstrap (above): simple, works well for symmetric distributions
- BCa bootstrap (bias-corrected and accelerated): adjusts for bias and skewness, more accurate for small samples
- Studentized bootstrap: bootstraps the -statistic, providing better coverage for asymmetric distributions
Confidence vs Prediction Intervals
| Type | What It Covers | Width as |
|---|---|---|
| Confidence interval | Uncertainty in | Shrinks to zero |
| Prediction interval | Uncertainty in future | Bounded below by |
For regression: the 95% prediction interval for includes both estimation uncertainty and irreducible noise, so it is always wider than the CI for .
Relationship to Hypothesis Testing
A 95% CI and a two-sided test at are dual:
- If is inside the 95% CI, the test fails to reject
- If is outside the 95% CI, the test rejects
CIs are strictly more informative than p-values: they convey both statistical significance (does the CI exclude the null?) and effect size (how far is the estimate from the null?).
Practical Guidelines
Report CIs, not just point estimates. A model with MAE 3,466 and 95% CI [3,314, 3,627] is more informative than MAE 3,466 alone. The width conveys measurement precision.
Use bootstrap CIs for complex statistics. Metrics like median error, Spearman , and aggregation accuracy don’t have simple parametric distributions. Bootstrap CIs are always available and assumption-free.
Sample size determines width. CI width scales as . Doubling precision requires quadrupling the sample size.
Non-overlapping CIs imply significance, but overlapping CIs don’t imply non-significance. Two CIs can overlap and the difference can still be significant. For comparing two estimates, construct a CI for the difference directly.
Summary
| Method | Assumptions | Use When |
|---|---|---|
| Normal/t-interval | Known parametric family | Simple settings (means, proportions) |
| Wald (MLE-based) | Asymptotic normality of MLE | Large samples, parametric models |
| Bootstrap | Exchangeability | Complex statistics, nonparametric settings |
| Bayesian credible interval | Prior distribution | Want probability statements about |