3: Estimators & Properties

An estimator is a rule for computing an approximate value of a population parameter from sample data. The quality of an estimator is measured by its bias, variance, consistency, and efficiency. These properties determine whether a learning algorithm will converge to the right answer and how much data it needs to get there.

Point Estimation

Given a random sample $X_1, \ldots, X_n$ from a distribution $F_\theta$ , a point estimator $\hat{\theta} = g(X_1, \ldots, X_n)$ is any function of the data used to estimate the parameter $\theta$ .

Examples:

Sample mean $\bar{X} = \frac{1}{n}\sum X_i$ estimates $E[X]$
Sample variance $S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2$ estimates $\text{Var}(X)$
Sample proportion $\hat{p} = \bar{X}$ estimates $P(X = 1)$

An estimator is a random variable (it depends on the random sample). Its quality is evaluated over the distribution of possible samples, not on any single realization.

Bias

The bias of an estimator $\hat{\theta}$ is:

\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta

An estimator is unbiased if $\text{Bias}(\hat{\theta}) = 0$ , i.e., $E[\hat{\theta}] = \theta$ for all $\theta$ .

Examples:

$\bar{X}$ is unbiased for $\mu$ : $E[\bar{X}] = \mu$
$S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2$ is unbiased for $\sigma^2$
$\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$ is biased: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$ (underestimates, Bessel’s correction fixes this)

Unbiasedness is not always desirable. A biased estimator with lower variance can have lower total error (see MSE decomposition below). Ridge regression introduces bias to reduce variance, often improving prediction accuracy. The MLE of the Gaussian variance is biased, but it’s still the standard choice due to its other favorable properties.

Variance and Mean Squared Error

The variance of an estimator measures its spread across samples:

\text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]

The mean squared error combines bias and variance:

\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})

This is the bias-variance decomposition for estimators. It shows that total error has two sources:

Bias: systematic error from the estimator not centering on $\theta$
Variance: random error from sensitivity to the particular sample

For unbiased estimators, $\text{MSE} = \text{Var}$ . For biased estimators, MSE accounts for both components.

Connection to ML. The bias-variance tradeoff in model selection is exactly this decomposition applied to prediction error. A complex model (many parameters) has low bias but high variance; a simple model has high bias but low variance. Regularization (L2, dropout, early stopping) introduces bias to reduce variance.

Consistency

An estimator $\hat{\theta}_n$ is consistent if it converges in probability to the true value as sample size grows:

\hat{\theta}_n \xrightarrow{p} \theta \quad \text{as } n \to \infty

A sufficient condition: $\text{Bias}(\hat{\theta}_n) \to 0$ and $\text{Var}(\hat{\theta}_n) \to 0$ as $n \to \infty$ .

Examples:

$\bar{X}$ is consistent for $\mu$ (by the LLN)
$S^2$ is consistent for $\sigma^2$
MLEs are generally consistent under regularity conditions

Consistency is a minimal requirement: an estimator that doesn’t converge to the truth is useless regardless of its finite-sample properties. However, consistency says nothing about the rate of convergence or finite-sample behavior.

Efficiency and the Cramer-Rao Lower Bound

Among all unbiased estimators, the most desirable is the one with the smallest variance. The Cramer-Rao lower bound (CRLB) provides the theoretical minimum:

\text{Var}(\hat{\theta}) \geq \frac{1}{I_n(\theta)}

where $I_n(\theta) = nI_1(\theta)$ is the Fisher information for $n$ observations:

I_1(\theta) = -E\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right] = E\left[\left(\frac{\partial \log f(X; \theta)}{\partial \theta}\right)^2\right]

An estimator that achieves the CRLB is called efficient or a minimum variance unbiased estimator (MVUE).

Example: Gaussian mean. For $X_i \sim \mathcal{N}(\mu, \sigma^2)$ :

Fisher information: $I_1(\mu) = 1/\sigma^2$ , so $I_n(\mu) = n/\sigma^2$
CRLB: $\text{Var}(\hat{\mu}) \geq \sigma^2/n$
$\bar{X}$ achieves this: $\text{Var}(\bar{X}) = \sigma^2/n$ — it is efficient

Example: Bernoulli parameter. For $X_i \sim \text{Bernoulli}(p)$ :

Fisher information: $I_1(p) = 1/(p(1-p))$
CRLB: $\text{Var}(\hat{p}) \geq p(1-p)/n$
$\hat{p} = \bar{X}$ achieves this — it is efficient

Not all parameters have efficient estimators. When no unbiased estimator achieves the CRLB, the bound is still useful as a benchmark for how well any estimator can perform.

Sufficiency

A statistic $T = T(X_1, \ldots, X_n)$ is sufficient for $\theta$ if the conditional distribution of the data given $T$ does not depend on $\theta$ :

P(X_1, \ldots, X_n \mid T; \theta) = P(X_1, \ldots, X_n \mid T)

A sufficient statistic captures all the information in the data about $\theta$ . Once you know $T$ , the rest of the data provides no additional information.

Fisher-Neyman factorization theorem. $T$ is sufficient iff the likelihood factors as:

L(\theta; \mathbf{x}) = g(T(\mathbf{x}), \theta) \cdot h(\mathbf{x})

where $g$ depends on the data only through $T$ and $h$ does not depend on $\theta$ .

Examples:

For $\mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ : $\bar{X}$ is sufficient for $\mu$
For $\mathcal{N}(\mu, \sigma^2)$ with both unknown: $(\bar{X}, S^2)$ is jointly sufficient for $(\mu, \sigma^2)$

Rao-Blackwell theorem. Given any unbiased estimator $\hat{\theta}$ and a sufficient statistic $T$ , the estimator $\tilde{\theta} = E[\hat{\theta} \mid T]$ is unbiased and has variance $\leq \text{Var}(\hat{\theta})$ . This provides a systematic method for improving estimators.

Comparing Estimators

Property	Definition	Importance
Unbiasedness	$E[\hat{\theta}] = \theta$	Correct on average
Low variance	$\text{Var}(\hat{\theta})$ small	Stable across samples
Consistency	$\hat{\theta}_n \to \theta$	Converges with more data
Efficiency	Achieves CRLB	Best possible precision
Sufficiency	Uses all information	No data is wasted

In practice, MSE (bias $^2$ + variance) is the most useful criterion. A slightly biased estimator with much lower variance (e.g., ridge regression) often outperforms the unbiased MLE in prediction tasks. The James-Stein estimator demonstrates this dramatically: for estimating a multivariate normal mean in $d \geq 3$ dimensions, the MLE ( $\bar{X}$ ) is inadmissible — there exists a biased estimator with uniformly lower MSE.

Summary

Concept	Key Result
MSE decomposition	$\text{MSE} = \text{Bias}^2 + \text{Variance}$
Cramer-Rao bound	$\text{Var}(\hat{\theta}) \geq 1/I_n(\theta)$ for unbiased $\hat{\theta}$
Consistency	$\hat{\theta}_n \to \theta$ as $n \to \infty$
Sufficiency	$T$ captures all info about $\theta$
Rao-Blackwell	Conditioning on sufficient statistics reduces variance

These properties form the theoretical foundation for evaluating any estimation procedure, from simple sample statistics to complex ML models. The bias-variance tradeoff, in particular, is the central lens through which model selection decisions are understood.