6: Confidence Intervals

A point estimate $\hat{\theta}$ provides a single best guess for an unknown parameter. A confidence interval quantifies the uncertainty of that estimate, providing a range of plausible values. Confidence intervals are essential for evaluating ML models, reporting A/B test results, and comparing model performance.

Definition

A $100(1-\alpha)\%$ confidence interval for parameter $\theta$ is a random interval $[L, U]$ such that:

P(L \leq \theta \leq U) = 1 - \alpha

before observing the data. The standard choice is $\alpha = 0.05$ (95% confidence interval).

Interpretation. If we repeated the experiment many times and constructed a 95% CI each time, approximately 95% of those intervals would contain the true $\theta$ . A specific observed interval either contains $\theta$ or does not; the probability statement is about the procedure, not the specific interval.

Common misinterpretation. “There is a 95% probability that $\theta$ is in this interval” treats $\theta$ as random, which is a Bayesian (credible interval) interpretation, not a frequentist one.

Construction via Pivotal Quantities

A pivotal quantity is a function of the data and the parameter whose distribution does not depend on $\theta$ .

Normal Mean, Known Variance

If $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ :

Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0, 1)

The 95% CI:

\bar{x} \pm z_{0.025} \cdot \frac{\sigma}{\sqrt{n}} = \bar{x} \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}

Normal Mean, Unknown Variance

Replace $\sigma$ with $S$ (sample standard deviation). The pivotal quantity changes distribution:

T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}

The 95% CI:

\bar{x} \pm t_{n-1, 0.025} \cdot \frac{s}{\sqrt{n}}

For large $n$ , $t_{n-1} \approx \mathcal{N}(0,1)$ and the intervals converge.

Proportion

For $\hat{p} = \bar{X}$ with $X_i \sim \text{Bernoulli}(p)$ , the Wald interval uses the CLT:

\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

This performs poorly for small $n$ or $p$ near 0 or 1. The Wilson interval is preferred:

\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}

Asymptotic Intervals (MLE-based)

From the asymptotic normality of the MLE:

\hat{\theta} \pm z_{\alpha/2} \cdot \frac{1}{\sqrt{I_n(\hat{\theta})}}

where $I_n(\hat{\theta})$ is the observed Fisher information. This is the standard approach for parameters estimated by maximum likelihood when exact distributions are unavailable.

Bootstrap Confidence Intervals

When the sampling distribution is unknown or intractable, the bootstrap provides a nonparametric alternative.

Algorithm (percentile bootstrap):

From the original sample of size $n$ , draw $B$ bootstrap samples (sample with replacement)
Compute $\hat{\theta}^*_1, \ldots, \hat{\theta}^*_B$ (one statistic per bootstrap sample)
The $100(1-\alpha)\%$ CI is $[\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}]$

The tracker cost model uses bootstrap CIs: 1,000 resamples of the 523,624-request test set produce a 95% CI of [3,314, 3,627] for the XGBoost Tweedie MAE. This interval does not overlap with the path LUT’s CI of [3,623, 3,984], confirming statistically significant improvement.

Variants:

Percentile bootstrap (above): simple, works well for symmetric distributions
BCa bootstrap (bias-corrected and accelerated): adjusts for bias and skewness, more accurate for small samples
Studentized bootstrap: bootstraps the $t$ -statistic, providing better coverage for asymmetric distributions

Confidence vs Prediction Intervals

Type	What It Covers	Width as $n \to \infty$
Confidence interval	Uncertainty in $\hat{\theta}$	Shrinks to zero
Prediction interval	Uncertainty in future $Y_{\text{new}}$	Bounded below by $\sigma^2$

For regression: the 95% prediction interval for $Y_{\text{new}} \mid \mathbf{x}$ includes both estimation uncertainty and irreducible noise, so it is always wider than the CI for $E[Y \mid \mathbf{x}]$ .

Relationship to Hypothesis Testing

A 95% CI and a two-sided test at $\alpha = 0.05$ are dual:

If $\theta_0$ is inside the 95% CI, the test fails to reject $H_0: \theta = \theta_0$
If $\theta_0$ is outside the 95% CI, the test rejects $H_0$

CIs are strictly more informative than p-values: they convey both statistical significance (does the CI exclude the null?) and effect size (how far is the estimate from the null?).

Practical Guidelines

Report CIs, not just point estimates. A model with MAE 3,466 and 95% CI [3,314, 3,627] is more informative than MAE 3,466 alone. The width conveys measurement precision.

Use bootstrap CIs for complex statistics. Metrics like median error, Spearman $\rho$ , and aggregation accuracy don’t have simple parametric distributions. Bootstrap CIs are always available and assumption-free.

Sample size determines width. CI width scales as $O(1/\sqrt{n})$ . Doubling precision requires quadrupling the sample size.

Non-overlapping CIs imply significance, but overlapping CIs don’t imply non-significance. Two CIs can overlap and the difference can still be significant. For comparing two estimates, construct a CI for the difference directly.

Summary

Method	Assumptions	Use When
Normal/t-interval	Known parametric family	Simple settings (means, proportions)
Wald (MLE-based)	Asymptotic normality of MLE	Large samples, parametric models
Bootstrap	Exchangeability	Complex statistics, nonparametric settings
Bayesian credible interval	Prior distribution	Want probability statements about $\theta$