5: Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is the most widely used parameter estimation method in statistics and the foundation of most supervised learning loss functions. Given a parametric model and observed data, MLE finds the parameter values that make the observed data most probable.
The Likelihood Function
Given observations drawn independently from a distribution with density , the likelihood function is:
The likelihood treats the data as fixed and as variable. It is not a probability distribution over ; it measures how well each parameter value explains the observed data.
The log-likelihood is more convenient for computation:
The product becomes a sum, which is numerically stable and easier to differentiate. The log transform is monotonic, so maximizing is equivalent to maximizing .
Finding the MLE
The maximum likelihood estimator is:
For many common distributions, the MLE has a closed-form solution obtained by solving the score equation:
Example: Gaussian Mean
For with known :
The MLE for the Gaussian mean is the sample mean.
Example: Bernoulli Parameter
For :
The MLE for the Bernoulli parameter is the sample proportion.
Example: Exponential Rate
For with density :
Connection to Machine Learning Loss Functions
Most supervised learning loss functions are negative log-likelihoods under specific distributional assumptions:
| Loss Function | Distributional Assumption | Model Output |
|---|---|---|
| MSE | Conditional mean | |
| Binary cross-entropy | Class probability | |
| Categorical cross-entropy | Class probabilities | |
| Tweedie loss | Conditional mean | |
| Poisson loss | Conditional rate |
Minimizing MSE is equivalent to maximizing the Gaussian log-likelihood. Minimizing cross-entropy is equivalent to maximizing the Bernoulli/multinomial log-likelihood. This connection explains why these loss functions are “natural” for their respective tasks: they are the statistically optimal estimators under the assumed noise model.
The tracker cost model uses Tweedie loss, which corresponds to MLE under the compound Poisson-gamma distribution. For zero-inflated, right-skewed transfer size data, this distributional assumption is a better match than Gaussian (MSE), producing a 23% improvement on identical features.
Fisher Information
The Fisher information quantifies how much information the data carries about :
The two expressions are equal under regularity conditions. Higher Fisher information means the likelihood is more sharply peaked around , making estimation easier.
For i.i.d. observations, : information scales linearly with sample size.
The Cramer-Rao Lower Bound
For any unbiased estimator :
No unbiased estimator can have lower variance than . An estimator that achieves this bound is called efficient.
Asymptotic Properties of the MLE
Under regularity conditions, the MLE has three key properties as :
Consistency. (converges in probability to the true parameter).
Asymptotic normality.
The MLE is approximately Gaussian with variance .
Asymptotic efficiency. The MLE achieves the Cramer-Rao lower bound asymptotically. No other consistent estimator has lower asymptotic variance.
These properties justify the widespread use of MLE: it is guaranteed to converge to the right answer, provides approximately Gaussian estimates useful for confidence intervals, and extracts the maximum possible information from the data.
The Invariance Property
If is the MLE of , then is the MLE of for any function . This means we can freely reparameterize: if is the MLE of an exponential rate, then is the MLE of the mean.
Multivariate MLE
For a parameter vector , the score equation becomes a system:
The Fisher information becomes a matrix:
The Cramer-Rao bound becomes: (matrix inequality).
When no closed-form solution exists, the MLE is found via iterative optimization: gradient descent, Newton-Raphson, or Fisher scoring. Logistic regression is a canonical example: the MLE of the weight vector requires iterative optimization because the sigmoid nonlinearity prevents a closed-form solution.
Limitations
Overfitting. MLE maximizes the probability of the training data, which can overfit with limited data. Regularization (L1/L2 penalties) corresponds to MAP (Maximum A Posteriori) estimation with Laplace/Gaussian priors.
Model misspecification. If the true distribution is not in the parametric family, MLE converges to the parameter value that minimizes KL divergence from the true distribution to the model. This is the best approximation within the model class, but may be misleading.
Finite-sample bias. The MLE can be biased in finite samples. For example, the MLE of in a Gaussian is , which underestimates the true variance (Bessel’s correction gives the unbiased version). The bias vanishes asymptotically.
Summary
| Property | Statement |
|---|---|
| Definition | |
| Consistency | Converges to true parameter |
| Asymptotic normality | |
| Efficiency | Achieves Cramer-Rao bound |
| Invariance | is MLE of |
| Connection to ML | Most loss functions are negative log-likelihoods |
MLE is the bridge between statistical theory and machine learning practice. Understanding that MSE is Gaussian MLE and cross-entropy is Bernoulli MLE clarifies why these loss functions work and when to choose alternatives.