3: Estimators & Properties

An estimator is a rule for computing an approximate value of a population parameter from sample data. The quality of an estimator is measured by its bias, variance, consistency, and efficiency. These properties determine whether a learning algorithm will converge to the right answer and how much data it needs to get there.


Point Estimation

Given a random sample X1,,XnX_1, \ldots, X_n from a distribution FθF_\theta, a point estimator θ^=g(X1,,Xn)\hat{\theta} = g(X_1, \ldots, X_n) is any function of the data used to estimate the parameter θ\theta.

Examples:

  • Sample mean Xˉ=1nXi\bar{X} = \frac{1}{n}\sum X_i estimates E[X]E[X]
  • Sample variance S2=1n1(XiXˉ)2S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2 estimates Var(X)\text{Var}(X)
  • Sample proportion p^=Xˉ\hat{p} = \bar{X} estimates P(X=1)P(X = 1)

An estimator is a random variable (it depends on the random sample). Its quality is evaluated over the distribution of possible samples, not on any single realization.


Bias

The bias of an estimator θ^\hat{\theta} is:

Bias(θ^)=E[θ^]θ\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta

An estimator is unbiased if Bias(θ^)=0\text{Bias}(\hat{\theta}) = 0, i.e., E[θ^]=θE[\hat{\theta}] = \theta for all θ\theta.

Examples:

  • Xˉ\bar{X} is unbiased for μ\mu: E[Xˉ]=μE[\bar{X}] = \mu
  • S2=1n1(XiXˉ)2S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2 is unbiased for σ2\sigma^2
  • σ^2=1n(XiXˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2 is biased: E[σ^2]=n1nσ2E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2 (underestimates, Bessel’s correction fixes this)

Unbiasedness is not always desirable. A biased estimator with lower variance can have lower total error (see MSE decomposition below). Ridge regression introduces bias to reduce variance, often improving prediction accuracy. The MLE of the Gaussian variance is biased, but it’s still the standard choice due to its other favorable properties.


Variance and Mean Squared Error

The variance of an estimator measures its spread across samples:

Var(θ^)=E[(θ^E[θ^])2]\text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]

The mean squared error combines bias and variance:

MSE(θ^)=E[(θ^θ)2]=Bias2(θ^)+Var(θ^)\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})

This is the bias-variance decomposition for estimators. It shows that total error has two sources:

  • Bias: systematic error from the estimator not centering on θ\theta
  • Variance: random error from sensitivity to the particular sample

For unbiased estimators, MSE=Var\text{MSE} = \text{Var}. For biased estimators, MSE accounts for both components.

Connection to ML. The bias-variance tradeoff in model selection is exactly this decomposition applied to prediction error. A complex model (many parameters) has low bias but high variance; a simple model has high bias but low variance. Regularization (L2, dropout, early stopping) introduces bias to reduce variance.


Consistency

An estimator θ^n\hat{\theta}_n is consistent if it converges in probability to the true value as sample size grows:

θ^npθas n\hat{\theta}_n \xrightarrow{p} \theta \quad \text{as } n \to \infty

A sufficient condition: Bias(θ^n)0\text{Bias}(\hat{\theta}_n) \to 0 and Var(θ^n)0\text{Var}(\hat{\theta}_n) \to 0 as nn \to \infty.

Examples:

  • Xˉ\bar{X} is consistent for μ\mu (by the LLN)
  • S2S^2 is consistent for σ2\sigma^2
  • MLEs are generally consistent under regularity conditions

Consistency is a minimal requirement: an estimator that doesn’t converge to the truth is useless regardless of its finite-sample properties. However, consistency says nothing about the rate of convergence or finite-sample behavior.


Efficiency and the Cramer-Rao Lower Bound

Among all unbiased estimators, the most desirable is the one with the smallest variance. The Cramer-Rao lower bound (CRLB) provides the theoretical minimum:

Var(θ^)1In(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I_n(\theta)}

where In(θ)=nI1(θ)I_n(\theta) = nI_1(\theta) is the Fisher information for nn observations:

I1(θ)=E[2logf(X;θ)θ2]=E[(logf(X;θ)θ)2]I_1(\theta) = -E\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right] = E\left[\left(\frac{\partial \log f(X; \theta)}{\partial \theta}\right)^2\right]

An estimator that achieves the CRLB is called efficient or a minimum variance unbiased estimator (MVUE).

Example: Gaussian mean. For XiN(μ,σ2)X_i \sim \mathcal{N}(\mu, \sigma^2):

  • Fisher information: I1(μ)=1/σ2I_1(\mu) = 1/\sigma^2, so In(μ)=n/σ2I_n(\mu) = n/\sigma^2
  • CRLB: Var(μ^)σ2/n\text{Var}(\hat{\mu}) \geq \sigma^2/n
  • Xˉ\bar{X} achieves this: Var(Xˉ)=σ2/n\text{Var}(\bar{X}) = \sigma^2/n — it is efficient

Example: Bernoulli parameter. For XiBernoulli(p)X_i \sim \text{Bernoulli}(p):

  • Fisher information: I1(p)=1/(p(1p))I_1(p) = 1/(p(1-p))
  • CRLB: Var(p^)p(1p)/n\text{Var}(\hat{p}) \geq p(1-p)/n
  • p^=Xˉ\hat{p} = \bar{X} achieves this — it is efficient

Not all parameters have efficient estimators. When no unbiased estimator achieves the CRLB, the bound is still useful as a benchmark for how well any estimator can perform.


Sufficiency

A statistic T=T(X1,,Xn)T = T(X_1, \ldots, X_n) is sufficient for θ\theta if the conditional distribution of the data given TT does not depend on θ\theta:

P(X1,,XnT;θ)=P(X1,,XnT)P(X_1, \ldots, X_n \mid T; \theta) = P(X_1, \ldots, X_n \mid T)

A sufficient statistic captures all the information in the data about θ\theta. Once you know TT, the rest of the data provides no additional information.

Fisher-Neyman factorization theorem. TT is sufficient iff the likelihood factors as:

L(θ;x)=g(T(x),θ)h(x)L(\theta; \mathbf{x}) = g(T(\mathbf{x}), \theta) \cdot h(\mathbf{x})

where gg depends on the data only through TT and hh does not depend on θ\theta.

Examples:

  • For N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2: Xˉ\bar{X} is sufficient for μ\mu
  • For N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with both unknown: (Xˉ,S2)(\bar{X}, S^2) is jointly sufficient for (μ,σ2)(\mu, \sigma^2)

Rao-Blackwell theorem. Given any unbiased estimator θ^\hat{\theta} and a sufficient statistic TT, the estimator θ~=E[θ^T]\tilde{\theta} = E[\hat{\theta} \mid T] is unbiased and has variance Var(θ^)\leq \text{Var}(\hat{\theta}). This provides a systematic method for improving estimators.


Comparing Estimators

PropertyDefinitionImportance
UnbiasednessE[θ^]=θE[\hat{\theta}] = \thetaCorrect on average
Low varianceVar(θ^)\text{Var}(\hat{\theta}) smallStable across samples
Consistencyθ^nθ\hat{\theta}_n \to \thetaConverges with more data
EfficiencyAchieves CRLBBest possible precision
SufficiencyUses all informationNo data is wasted

In practice, MSE (bias2^2 + variance) is the most useful criterion. A slightly biased estimator with much lower variance (e.g., ridge regression) often outperforms the unbiased MLE in prediction tasks. The James-Stein estimator demonstrates this dramatically: for estimating a multivariate normal mean in d3d \geq 3 dimensions, the MLE (Xˉ\bar{X}) is inadmissible — there exists a biased estimator with uniformly lower MSE.


Summary

ConceptKey Result
MSE decompositionMSE=Bias2+Variance\text{MSE} = \text{Bias}^2 + \text{Variance}
Cramer-Rao boundVar(θ^)1/In(θ)\text{Var}(\hat{\theta}) \geq 1/I_n(\theta) for unbiased θ^\hat{\theta}
Consistencyθ^nθ\hat{\theta}_n \to \theta as nn \to \infty
SufficiencyTT captures all info about θ\theta
Rao-BlackwellConditioning on sufficient statistics reduces variance

These properties form the theoretical foundation for evaluating any estimation procedure, from simple sample statistics to complex ML models. The bias-variance tradeoff, in particular, is the central lens through which model selection decisions are understood.