4: Method of Moments

The method of moments (MoM) is the oldest systematic approach to parameter estimation. The idea is simple: equate population moments to sample moments and solve for the parameters. It produces consistent estimators with minimal computational effort, though it generally sacrifices efficiency compared to maximum likelihood. For models where the MLE is difficult to compute, MoM estimators serve as useful starting points or standalone alternatives.

Population and Sample Moments

The $k$ -th population moment of a random variable $X$ is:

\mu_k = E[X^k]

The $k$ -th central moment is $E[(X - \mu_1)^k]$ . The first moment is the mean, the second central moment is the variance.

The corresponding $k$ -th sample moment from observations $X_1, \ldots, X_n$ is:

\hat{\mu}_k = \frac{1}{n}\sum_{i=1}^n X_i^k

By the law of large numbers, $\hat{\mu}_k \xrightarrow{p} \mu_k$ for each $k$ (assuming the moment exists). This convergence is the foundation of the method.

The Method

Suppose the distribution of $X$ depends on $d$ parameters $\theta = (\theta_1, \ldots, \theta_d)$ . The first $d$ population moments are functions of these parameters:

\mu_k(\theta) = E_\theta[X^k], \quad k = 1, \ldots, d

The method of moments estimator $\hat{\theta}_{\text{MoM}}$ solves the system of equations:

\hat{\mu}_k = \mu_k(\theta), \quad k = 1, \ldots, d

That is, we set each population moment equal to its sample counterpart and solve for $\theta$ .

Algorithm:

Express the first $d$ population moments as functions of $\theta$
Replace each population moment with the sample moment
Solve the resulting system of $d$ equations in $d$ unknowns

Examples

Normal distribution. For $X \sim \mathcal{N}(\mu, \sigma^2)$ , we have two parameters and need two moments:

\mu_1 = \mu, \quad \mu_2 = E[X^2] = \sigma^2 + \mu^2

Setting $\hat{\mu}_1 = \mu$ and $\hat{\mu}_2 = \sigma^2 + \mu^2$ :

\hat{\mu}_{\text{MoM}} = \bar{X}, \quad \hat{\sigma}^2_{\text{MoM}} = \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X})^2

Note this gives the biased variance estimator (dividing by $n$ , not $n-1$ ). The MoM and MLE coincide here.

Gamma distribution. For $X \sim \text{Gamma}(\alpha, \beta)$ with $E[X] = \alpha/\beta$ and $E[X^2] = \alpha(\alpha+1)/\beta^2$ :

\bar{X} = \frac{\alpha}{\beta}, \quad \frac{1}{n}\sum X_i^2 = \frac{\alpha(\alpha+1)}{\beta^2}

Solving: $\hat{\beta} = \bar{X}/S^2_n$ and $\hat{\alpha} = \bar{X}\hat{\beta} = \bar{X}^2/S^2_n$ , where $S^2_n = \hat{\mu}_2 - \hat{\mu}_1^2$ .

The Gamma MLE has no closed form and requires numerical optimization, so MoM provides a convenient analytical alternative.

Uniform distribution. For $X \sim \text{Uniform}(a, b)$ , with $E[X] = (a+b)/2$ and $\text{Var}(X) = (b-a)^2/12$ :

\hat{a} = \bar{X} - \sqrt{3 S_n^2}, \quad \hat{b} = \bar{X} + \sqrt{3 S_n^2}

A known deficiency: $\hat{a}$ can exceed $\min(X_i)$ and $\hat{b}$ can fall below $\max(X_i)$ , producing estimates inconsistent with the observed data. The MLE ( $\hat{a} = X_{(1)}, \hat{b} = X_{(n)}$ ) avoids this problem.

Beta distribution. For $X \sim \text{Beta}(\alpha, \beta)$ :

\hat{\alpha} = \bar{X}\left(\frac{\bar{X}(1-\bar{X})}{S_n^2} - 1\right), \quad \hat{\beta} = (1-\bar{X})\left(\frac{\bar{X}(1-\bar{X})}{S_n^2} - 1\right)

Again, the MLE requires iterative methods, while MoM gives a closed-form initializer.

Properties

Consistency. MoM estimators are consistent under mild conditions. Since $\hat{\mu}_k \xrightarrow{p} \mu_k$ by the LLN, and the mapping from moments to parameters is continuous, the continuous mapping theorem gives $\hat{\theta}_{\text{MoM}} \xrightarrow{p} \theta$ .

Asymptotic normality. By the CLT and the delta method, MoM estimators are asymptotically normal:

\sqrt{n}(\hat{\theta}_{\text{MoM}} - \theta) \xrightarrow{d} \mathcal{N}(0, \Sigma_{\text{MoM}})

where $\Sigma_{\text{MoM}}$ depends on the moments of $X$ up to order $2d$ and the Jacobian of the moment-to-parameter mapping.

Not generally efficient. The asymptotic variance $\Sigma_{\text{MoM}}$ is typically larger than the Cramer-Rao lower bound. MoM estimators use only the first $d$ moments, discarding information present in the full likelihood. The efficiency loss can be substantial.

MoM vs. MLE

	Method of Moments	Maximum Likelihood
Computation	Solve moment equations (often closed-form)	Optimize likelihood (often iterative)
Efficiency	Generally inefficient	Asymptotically efficient
Robustness	Less sensitive to model misspecification	Can be sensitive to distributional assumptions
Existence	Always exists if moments exist	May not exist or may not be unique
Invariance	Not invariant to reparametrization	Invariant: MLE of $g(\theta)$ is $g(\hat{\theta}_{\text{MLE}})$

For exponential family distributions, MoM and MLE produce the same estimator, since the MLE equates the expected sufficient statistic to the sample average (and the sufficient statistics are functions of the moments). Outside the exponential family, the two methods diverge.

In practice, MoM is most useful when:

The MLE has no closed form (Gamma, Beta, mixture models)
A quick initial estimate is needed for iterative MLE algorithms
Robustness to misspecification is more important than efficiency

Generalized Method of Moments

The generalized method of moments (GMM) extends MoM to settings with more moment conditions than parameters. Suppose we have $m > d$ moment conditions:

E[g_j(X; \theta)] = 0, \quad j = 1, \ldots, m

With more equations than unknowns, exact solutions generally do not exist. GMM minimizes a quadratic form:

\hat{\theta}_{\text{GMM}} = \arg\min_\theta \left(\frac{1}{n}\sum_{i=1}^n g(X_i; \theta)\right)^T W \left(\frac{1}{n}\sum_{i=1}^n g(X_i; \theta)\right)

where $W$ is a positive definite weighting matrix. The choice $W = \hat{\Sigma}_g^{-1}$ (the inverse of the estimated covariance of the moment conditions) yields the efficient GMM estimator, which achieves the smallest asymptotic variance among all GMM estimators.

GMM is foundational in econometrics, where moment conditions arise naturally from economic theory (e.g., Euler equations, instrumental variables). The instrumental variables (IV) estimator is a special case of GMM.

Connection to ML. The idea of matching empirical expectations to model expectations appears throughout machine learning. Contrastive divergence training in restricted Boltzmann machines matches the data’s expected sufficient statistics to the model’s. Moment matching is also central to generative adversarial networks (the discriminator implicitly enforces moment conditions) and to kernel methods through maximum mean discrepancy (MMD), which compares all moments simultaneously in a reproducing kernel Hilbert space.

Summary

Concept	Key Result
MoM estimator	Solve $\hat{\mu}_k = \mu_k(\theta)$ for $k = 1, \ldots, d$
Consistency	Follows from LLN + continuous mapping theorem
Efficiency	Generally less efficient than MLE
Best use case	Closed-form estimates when MLE is intractable
GMM	Handles overidentified models with $m > d$ moment conditions