3: Estimators & Properties
An estimator is a rule for computing an approximate value of a population parameter from sample data. The quality of an estimator is measured by its bias, variance, consistency, and efficiency. These properties determine whether a learning algorithm will converge to the right answer and how much data it needs to get there.
Point Estimation
Given a random sample from a distribution , a point estimator is any function of the data used to estimate the parameter .
Examples:
- Sample mean estimates
- Sample variance estimates
- Sample proportion estimates
An estimator is a random variable (it depends on the random sample). Its quality is evaluated over the distribution of possible samples, not on any single realization.
Bias
The bias of an estimator is:
An estimator is unbiased if , i.e., for all .
Examples:
- is unbiased for :
- is unbiased for
- is biased: (underestimates, Bessel’s correction fixes this)
Unbiasedness is not always desirable. A biased estimator with lower variance can have lower total error (see MSE decomposition below). Ridge regression introduces bias to reduce variance, often improving prediction accuracy. The MLE of the Gaussian variance is biased, but it’s still the standard choice due to its other favorable properties.
Variance and Mean Squared Error
The variance of an estimator measures its spread across samples:
The mean squared error combines bias and variance:
This is the bias-variance decomposition for estimators. It shows that total error has two sources:
- Bias: systematic error from the estimator not centering on
- Variance: random error from sensitivity to the particular sample
For unbiased estimators, . For biased estimators, MSE accounts for both components.
Connection to ML. The bias-variance tradeoff in model selection is exactly this decomposition applied to prediction error. A complex model (many parameters) has low bias but high variance; a simple model has high bias but low variance. Regularization (L2, dropout, early stopping) introduces bias to reduce variance.
Consistency
An estimator is consistent if it converges in probability to the true value as sample size grows:
A sufficient condition: and as .
Examples:
- is consistent for (by the LLN)
- is consistent for
- MLEs are generally consistent under regularity conditions
Consistency is a minimal requirement: an estimator that doesn’t converge to the truth is useless regardless of its finite-sample properties. However, consistency says nothing about the rate of convergence or finite-sample behavior.
Efficiency and the Cramer-Rao Lower Bound
Among all unbiased estimators, the most desirable is the one with the smallest variance. The Cramer-Rao lower bound (CRLB) provides the theoretical minimum:
where is the Fisher information for observations:
An estimator that achieves the CRLB is called efficient or a minimum variance unbiased estimator (MVUE).
Example: Gaussian mean. For :
- Fisher information: , so
- CRLB:
- achieves this: — it is efficient
Example: Bernoulli parameter. For :
- Fisher information:
- CRLB:
- achieves this — it is efficient
Not all parameters have efficient estimators. When no unbiased estimator achieves the CRLB, the bound is still useful as a benchmark for how well any estimator can perform.
Sufficiency
A statistic is sufficient for if the conditional distribution of the data given does not depend on :
A sufficient statistic captures all the information in the data about . Once you know , the rest of the data provides no additional information.
Fisher-Neyman factorization theorem. is sufficient iff the likelihood factors as:
where depends on the data only through and does not depend on .
Examples:
- For with known : is sufficient for
- For with both unknown: is jointly sufficient for
Rao-Blackwell theorem. Given any unbiased estimator and a sufficient statistic , the estimator is unbiased and has variance . This provides a systematic method for improving estimators.
Comparing Estimators
| Property | Definition | Importance |
|---|---|---|
| Unbiasedness | Correct on average | |
| Low variance | small | Stable across samples |
| Consistency | Converges with more data | |
| Efficiency | Achieves CRLB | Best possible precision |
| Sufficiency | Uses all information | No data is wasted |
In practice, MSE (bias + variance) is the most useful criterion. A slightly biased estimator with much lower variance (e.g., ridge regression) often outperforms the unbiased MLE in prediction tasks. The James-Stein estimator demonstrates this dramatically: for estimating a multivariate normal mean in dimensions, the MLE () is inadmissible — there exists a biased estimator with uniformly lower MSE.
Summary
| Concept | Key Result |
|---|---|
| MSE decomposition | |
| Cramer-Rao bound | for unbiased |
| Consistency | as |
| Sufficiency | captures all info about |
| Rao-Blackwell | Conditioning on sufficient statistics reduces variance |
These properties form the theoretical foundation for evaluating any estimation procedure, from simple sample statistics to complex ML models. The bias-variance tradeoff, in particular, is the central lens through which model selection decisions are understood.