6: Confidence Intervals

A point estimate θ^\hat{\theta} provides a single best guess for an unknown parameter. A confidence interval quantifies the uncertainty of that estimate, providing a range of plausible values. Confidence intervals are essential for evaluating ML models, reporting A/B test results, and comparing model performance.


Definition

A 100(1α)%100(1-\alpha)\% confidence interval for parameter θ\theta is a random interval [L,U][L, U] such that:

P(LθU)=1αP(L \leq \theta \leq U) = 1 - \alpha

before observing the data. The standard choice is α=0.05\alpha = 0.05 (95% confidence interval).

Interpretation. If we repeated the experiment many times and constructed a 95% CI each time, approximately 95% of those intervals would contain the true θ\theta. A specific observed interval either contains θ\theta or does not; the probability statement is about the procedure, not the specific interval.

Common misinterpretation. “There is a 95% probability that θ\theta is in this interval” treats θ\theta as random, which is a Bayesian (credible interval) interpretation, not a frequentist one.


Construction via Pivotal Quantities

A pivotal quantity is a function of the data and the parameter whose distribution does not depend on θ\theta.

Normal Mean, Known Variance

If X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2:

Z=Xˉμσ/nN(0,1)Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0, 1)

The 95% CI:

xˉ±z0.025σn=xˉ±1.96σn\bar{x} \pm z_{0.025} \cdot \frac{\sigma}{\sqrt{n}} = \bar{x} \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}

Normal Mean, Unknown Variance

Replace σ\sigma with SS (sample standard deviation). The pivotal quantity changes distribution:

T=XˉμS/ntn1T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}

The 95% CI:

xˉ±tn1,0.025sn\bar{x} \pm t_{n-1, 0.025} \cdot \frac{s}{\sqrt{n}}

For large nn, tn1N(0,1)t_{n-1} \approx \mathcal{N}(0,1) and the intervals converge.

Proportion

For p^=Xˉ\hat{p} = \bar{X} with XiBernoulli(p)X_i \sim \text{Bernoulli}(p), the Wald interval uses the CLT:

p^±zα/2p^(1p^)n\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

This performs poorly for small nn or pp near 0 or 1. The Wilson interval is preferred:

p^+z22n±zp^(1p^)n+z24n21+z2n\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}

Asymptotic Intervals (MLE-based)

From the asymptotic normality of the MLE:

θ^±zα/21In(θ^)\hat{\theta} \pm z_{\alpha/2} \cdot \frac{1}{\sqrt{I_n(\hat{\theta})}}

where In(θ^)I_n(\hat{\theta}) is the observed Fisher information. This is the standard approach for parameters estimated by maximum likelihood when exact distributions are unavailable.


Bootstrap Confidence Intervals

When the sampling distribution is unknown or intractable, the bootstrap provides a nonparametric alternative.

Algorithm (percentile bootstrap):

  1. From the original sample of size nn, draw BB bootstrap samples (sample with replacement)
  2. Compute θ^1,,θ^B\hat{\theta}^*_1, \ldots, \hat{\theta}^*_B (one statistic per bootstrap sample)
  3. The 100(1α)%100(1-\alpha)\% CI is [θ^(α/2),θ^(1α/2)][\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}]

The tracker cost model uses bootstrap CIs: 1,000 resamples of the 523,624-request test set produce a 95% CI of [3,314, 3,627] for the XGBoost Tweedie MAE. This interval does not overlap with the path LUT’s CI of [3,623, 3,984], confirming statistically significant improvement.

Variants:

  • Percentile bootstrap (above): simple, works well for symmetric distributions
  • BCa bootstrap (bias-corrected and accelerated): adjusts for bias and skewness, more accurate for small samples
  • Studentized bootstrap: bootstraps the tt-statistic, providing better coverage for asymmetric distributions

Confidence vs Prediction Intervals

TypeWhat It CoversWidth as nn \to \infty
Confidence intervalUncertainty in θ^\hat{\theta}Shrinks to zero
Prediction intervalUncertainty in future YnewY_{\text{new}}Bounded below by σ2\sigma^2

For regression: the 95% prediction interval for YnewxY_{\text{new}} \mid \mathbf{x} includes both estimation uncertainty and irreducible noise, so it is always wider than the CI for E[Yx]E[Y \mid \mathbf{x}].


Relationship to Hypothesis Testing

A 95% CI and a two-sided test at α=0.05\alpha = 0.05 are dual:

  • If θ0\theta_0 is inside the 95% CI, the test fails to reject H0:θ=θ0H_0: \theta = \theta_0
  • If θ0\theta_0 is outside the 95% CI, the test rejects H0H_0

CIs are strictly more informative than p-values: they convey both statistical significance (does the CI exclude the null?) and effect size (how far is the estimate from the null?).


Practical Guidelines

Report CIs, not just point estimates. A model with MAE 3,466 and 95% CI [3,314, 3,627] is more informative than MAE 3,466 alone. The width conveys measurement precision.

Use bootstrap CIs for complex statistics. Metrics like median error, Spearman ρ\rho, and aggregation accuracy don’t have simple parametric distributions. Bootstrap CIs are always available and assumption-free.

Sample size determines width. CI width scales as O(1/n)O(1/\sqrt{n}). Doubling precision requires quadrupling the sample size.

Non-overlapping CIs imply significance, but overlapping CIs don’t imply non-significance. Two CIs can overlap and the difference can still be significant. For comparing two estimates, construct a CI for the difference directly.


Summary

MethodAssumptionsUse When
Normal/t-intervalKnown parametric familySimple settings (means, proportions)
Wald (MLE-based)Asymptotic normality of MLELarge samples, parametric models
BootstrapExchangeabilityComplex statistics, nonparametric settings
Bayesian credible intervalPrior distributionWant probability statements about θ\theta