5: Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is the most widely used parameter estimation method in statistics and the foundation of most supervised learning loss functions. Given a parametric model and observed data, MLE finds the parameter values that make the observed data most probable.

The Likelihood Function

Given observations $x_1, \ldots, x_n$ drawn independently from a distribution with density $f(x; \theta)$ , the likelihood function is:

L(\theta) = \prod_{i=1}^n f(x_i; \theta)

The likelihood treats the data as fixed and $\theta$ as variable. It is not a probability distribution over $\theta$ ; it measures how well each parameter value explains the observed data.

The log-likelihood is more convenient for computation:

\ell(\theta) = \sum_{i=1}^n \log f(x_i; \theta)

The product becomes a sum, which is numerically stable and easier to differentiate. The log transform is monotonic, so maximizing $\ell(\theta)$ is equivalent to maximizing $L(\theta)$ .

Finding the MLE

The maximum likelihood estimator is:

\hat{\theta}_{\text{MLE}} = \arg\max_\theta \ell(\theta)

For many common distributions, the MLE has a closed-form solution obtained by solving the score equation:

\frac{\partial \ell}{\partial \theta} = 0

Example: Gaussian Mean

For $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ :

\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2

\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0 \implies \hat{\mu} = \bar{x}

The MLE for the Gaussian mean is the sample mean.

Example: Bernoulli Parameter

For $X_1, \ldots, X_n \sim \text{Bernoulli}(p)$ :

\ell(p) = \sum_{i=1}^n [x_i \log p + (1 - x_i) \log(1 - p)]

\frac{\partial \ell}{\partial p} = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1-p} = 0 \implies \hat{p} = \bar{x}

The MLE for the Bernoulli parameter is the sample proportion.

Example: Exponential Rate

For $X_1, \ldots, X_n \sim \text{Exp}(\lambda)$ with density $f(x; \lambda) = \lambda e^{-\lambda x}$ :

\ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i

\frac{\partial \ell}{\partial \lambda} = \frac{n}{\lambda} - \sum x_i = 0 \implies \hat{\lambda} = \frac{1}{\bar{x}}

Connection to Machine Learning Loss Functions

Most supervised learning loss functions are negative log-likelihoods under specific distributional assumptions:

Loss Function	Distributional Assumption	Model Output
MSE $(y - \hat{y})^2$	$Y \sim \mathcal{N}(\hat{y}, \sigma^2)$	Conditional mean
Binary cross-entropy	$Y \sim \text{Bernoulli}(\hat{p})$	Class probability
Categorical cross-entropy	$Y \sim \text{Categorical}(\hat{\mathbf{p}})$	Class probabilities
Tweedie loss	$Y \sim \text{Tweedie}(\hat{\mu}, p)$	Conditional mean
Poisson loss	$Y \sim \text{Poisson}(\hat{\lambda})$	Conditional rate

Minimizing MSE is equivalent to maximizing the Gaussian log-likelihood. Minimizing cross-entropy is equivalent to maximizing the Bernoulli/multinomial log-likelihood. This connection explains why these loss functions are “natural” for their respective tasks: they are the statistically optimal estimators under the assumed noise model.

The tracker cost model uses Tweedie loss, which corresponds to MLE under the compound Poisson-gamma distribution. For zero-inflated, right-skewed transfer size data, this distributional assumption is a better match than Gaussian (MSE), producing a 23% improvement on identical features.

Fisher Information

The Fisher information quantifies how much information the data carries about $\theta$ :

I(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ell}{\partial \theta^2}\right] = \mathbb{E}\left[\left(\frac{\partial \ell}{\partial \theta}\right)^2\right]

The two expressions are equal under regularity conditions. Higher Fisher information means the likelihood is more sharply peaked around $\theta$ , making estimation easier.

For $n$ i.i.d. observations, $I_n(\theta) = n \cdot I_1(\theta)$ : information scales linearly with sample size.

The Cramer-Rao Lower Bound

For any unbiased estimator $\hat{\theta}$ :

\text{Var}(\hat{\theta}) \geq \frac{1}{I_n(\theta)}

No unbiased estimator can have lower variance than $1/I_n(\theta)$ . An estimator that achieves this bound is called efficient.

Asymptotic Properties of the MLE

Under regularity conditions, the MLE has three key properties as $n \to \infty$ :

Consistency. $\hat{\theta}_{\text{MLE}} \xrightarrow{p} \theta_0$ (converges in probability to the true parameter).

Asymptotic normality.

\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I_1(\theta_0)^{-1})

The MLE is approximately Gaussian with variance $1/(nI_1(\theta_0))$ .

Asymptotic efficiency. The MLE achieves the Cramer-Rao lower bound asymptotically. No other consistent estimator has lower asymptotic variance.

These properties justify the widespread use of MLE: it is guaranteed to converge to the right answer, provides approximately Gaussian estimates useful for confidence intervals, and extracts the maximum possible information from the data.

The Invariance Property

If $\hat{\theta}$ is the MLE of $\theta$ , then $g(\hat{\theta})$ is the MLE of $g(\theta)$ for any function $g$ . This means we can freely reparameterize: if $\hat{\lambda}$ is the MLE of an exponential rate, then $1/\hat{\lambda}$ is the MLE of the mean.

Multivariate MLE

For a parameter vector $\boldsymbol{\theta} \in \mathbb{R}^p$ , the score equation becomes a system:

\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = \mathbf{0}

The Fisher information becomes a $p \times p$ matrix:

\mathbf{I}(\boldsymbol{\theta}) = -\mathbb{E}\left[\nabla^2_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})\right]

The Cramer-Rao bound becomes: $\text{Cov}(\hat{\boldsymbol{\theta}}) \succeq \mathbf{I}(\boldsymbol{\theta})^{-1}$ (matrix inequality).

When no closed-form solution exists, the MLE is found via iterative optimization: gradient descent, Newton-Raphson, or Fisher scoring. Logistic regression is a canonical example: the MLE of the weight vector requires iterative optimization because the sigmoid nonlinearity prevents a closed-form solution.

Limitations

Overfitting. MLE maximizes the probability of the training data, which can overfit with limited data. Regularization (L1/L2 penalties) corresponds to MAP (Maximum A Posteriori) estimation with Laplace/Gaussian priors.

Model misspecification. If the true distribution is not in the parametric family, MLE converges to the parameter value that minimizes KL divergence from the true distribution to the model. This is the best approximation within the model class, but may be misleading.

Finite-sample bias. The MLE can be biased in finite samples. For example, the MLE of $\sigma^2$ in a Gaussian is $\frac{1}{n}\sum(x_i - \bar{x})^2$ , which underestimates the true variance (Bessel’s correction gives the unbiased $\frac{1}{n-1}$ version). The bias vanishes asymptotically.

Summary

Property	Statement
Definition	$\hat{\theta} = \arg\max \sum \log f(x_i; \theta)$
Consistency	Converges to true parameter
Asymptotic normality	$\sqrt{n}(\hat{\theta} - \theta_0) \to \mathcal{N}(0, I^{-1})$
Efficiency	Achieves Cramer-Rao bound
Invariance	$g(\hat{\theta})$ is MLE of $g(\theta)$
Connection to ML	Most loss functions are negative log-likelihoods

MLE is the bridge between statistical theory and machine learning practice. Understanding that MSE is Gaussian MLE and cross-entropy is Bernoulli MLE clarifies why these loss functions work and when to choose alternatives.