4: Method of Moments

The method of moments (MoM) is the oldest systematic approach to parameter estimation. The idea is simple: equate population moments to sample moments and solve for the parameters. It produces consistent estimators with minimal computational effort, though it generally sacrifices efficiency compared to maximum likelihood. For models where the MLE is difficult to compute, MoM estimators serve as useful starting points or standalone alternatives.


Population and Sample Moments

The kk-th population moment of a random variable XX is:

μk=E[Xk]\mu_k = E[X^k]

The kk-th central moment is E[(Xμ1)k]E[(X - \mu_1)^k]. The first moment is the mean, the second central moment is the variance.

The corresponding kk-th sample moment from observations X1,,XnX_1, \ldots, X_n is:

μ^k=1ni=1nXik\hat{\mu}_k = \frac{1}{n}\sum_{i=1}^n X_i^k

By the law of large numbers, μ^kpμk\hat{\mu}_k \xrightarrow{p} \mu_k for each kk (assuming the moment exists). This convergence is the foundation of the method.


The Method

Suppose the distribution of XX depends on dd parameters θ=(θ1,,θd)\theta = (\theta_1, \ldots, \theta_d). The first dd population moments are functions of these parameters:

μk(θ)=Eθ[Xk],k=1,,d\mu_k(\theta) = E_\theta[X^k], \quad k = 1, \ldots, d

The method of moments estimator θ^MoM\hat{\theta}_{\text{MoM}} solves the system of equations:

μ^k=μk(θ),k=1,,d\hat{\mu}_k = \mu_k(\theta), \quad k = 1, \ldots, d

That is, we set each population moment equal to its sample counterpart and solve for θ\theta.

Algorithm:

  1. Express the first dd population moments as functions of θ\theta
  2. Replace each population moment with the sample moment
  3. Solve the resulting system of dd equations in dd unknowns

Examples

Normal distribution. For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), we have two parameters and need two moments:

μ1=μ,μ2=E[X2]=σ2+μ2\mu_1 = \mu, \quad \mu_2 = E[X^2] = \sigma^2 + \mu^2

Setting μ^1=μ\hat{\mu}_1 = \mu and μ^2=σ2+μ2\hat{\mu}_2 = \sigma^2 + \mu^2:

μ^MoM=Xˉ,σ^MoM2=1ni=1nXi2Xˉ2=1ni=1n(XiXˉ)2\hat{\mu}_{\text{MoM}} = \bar{X}, \quad \hat{\sigma}^2_{\text{MoM}} = \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X})^2

Note this gives the biased variance estimator (dividing by nn, not n1n-1). The MoM and MLE coincide here.

Gamma distribution. For XGamma(α,β)X \sim \text{Gamma}(\alpha, \beta) with E[X]=α/βE[X] = \alpha/\beta and E[X2]=α(α+1)/β2E[X^2] = \alpha(\alpha+1)/\beta^2:

Xˉ=αβ,1nXi2=α(α+1)β2\bar{X} = \frac{\alpha}{\beta}, \quad \frac{1}{n}\sum X_i^2 = \frac{\alpha(\alpha+1)}{\beta^2}

Solving: β^=Xˉ/Sn2\hat{\beta} = \bar{X}/S^2_n and α^=Xˉβ^=Xˉ2/Sn2\hat{\alpha} = \bar{X}\hat{\beta} = \bar{X}^2/S^2_n, where Sn2=μ^2μ^12S^2_n = \hat{\mu}_2 - \hat{\mu}_1^2.

The Gamma MLE has no closed form and requires numerical optimization, so MoM provides a convenient analytical alternative.

Uniform distribution. For XUniform(a,b)X \sim \text{Uniform}(a, b), with E[X]=(a+b)/2E[X] = (a+b)/2 and Var(X)=(ba)2/12\text{Var}(X) = (b-a)^2/12:

a^=Xˉ3Sn2,b^=Xˉ+3Sn2\hat{a} = \bar{X} - \sqrt{3 S_n^2}, \quad \hat{b} = \bar{X} + \sqrt{3 S_n^2}

A known deficiency: a^\hat{a} can exceed min(Xi)\min(X_i) and b^\hat{b} can fall below max(Xi)\max(X_i), producing estimates inconsistent with the observed data. The MLE (a^=X(1),b^=X(n)\hat{a} = X_{(1)}, \hat{b} = X_{(n)}) avoids this problem.

Beta distribution. For XBeta(α,β)X \sim \text{Beta}(\alpha, \beta):

α^=Xˉ(Xˉ(1Xˉ)Sn21),β^=(1Xˉ)(Xˉ(1Xˉ)Sn21)\hat{\alpha} = \bar{X}\left(\frac{\bar{X}(1-\bar{X})}{S_n^2} - 1\right), \quad \hat{\beta} = (1-\bar{X})\left(\frac{\bar{X}(1-\bar{X})}{S_n^2} - 1\right)

Again, the MLE requires iterative methods, while MoM gives a closed-form initializer.


Properties

Consistency. MoM estimators are consistent under mild conditions. Since μ^kpμk\hat{\mu}_k \xrightarrow{p} \mu_k by the LLN, and the mapping from moments to parameters is continuous, the continuous mapping theorem gives θ^MoMpθ\hat{\theta}_{\text{MoM}} \xrightarrow{p} \theta.

Asymptotic normality. By the CLT and the delta method, MoM estimators are asymptotically normal:

n(θ^MoMθ)dN(0,ΣMoM)\sqrt{n}(\hat{\theta}_{\text{MoM}} - \theta) \xrightarrow{d} \mathcal{N}(0, \Sigma_{\text{MoM}})

where ΣMoM\Sigma_{\text{MoM}} depends on the moments of XX up to order 2d2d and the Jacobian of the moment-to-parameter mapping.

Not generally efficient. The asymptotic variance ΣMoM\Sigma_{\text{MoM}} is typically larger than the Cramer-Rao lower bound. MoM estimators use only the first dd moments, discarding information present in the full likelihood. The efficiency loss can be substantial.


MoM vs. MLE

Method of MomentsMaximum Likelihood
ComputationSolve moment equations (often closed-form)Optimize likelihood (often iterative)
EfficiencyGenerally inefficientAsymptotically efficient
RobustnessLess sensitive to model misspecificationCan be sensitive to distributional assumptions
ExistenceAlways exists if moments existMay not exist or may not be unique
InvarianceNot invariant to reparametrizationInvariant: MLE of g(θ)g(\theta) is g(θ^MLE)g(\hat{\theta}_{\text{MLE}})

For exponential family distributions, MoM and MLE produce the same estimator, since the MLE equates the expected sufficient statistic to the sample average (and the sufficient statistics are functions of the moments). Outside the exponential family, the two methods diverge.

In practice, MoM is most useful when:

  • The MLE has no closed form (Gamma, Beta, mixture models)
  • A quick initial estimate is needed for iterative MLE algorithms
  • Robustness to misspecification is more important than efficiency

Generalized Method of Moments

The generalized method of moments (GMM) extends MoM to settings with more moment conditions than parameters. Suppose we have m>dm > d moment conditions:

E[gj(X;θ)]=0,j=1,,mE[g_j(X; \theta)] = 0, \quad j = 1, \ldots, m

With more equations than unknowns, exact solutions generally do not exist. GMM minimizes a quadratic form:

θ^GMM=argminθ(1ni=1ng(Xi;θ))TW(1ni=1ng(Xi;θ))\hat{\theta}_{\text{GMM}} = \arg\min_\theta \left(\frac{1}{n}\sum_{i=1}^n g(X_i; \theta)\right)^T W \left(\frac{1}{n}\sum_{i=1}^n g(X_i; \theta)\right)

where WW is a positive definite weighting matrix. The choice W=Σ^g1W = \hat{\Sigma}_g^{-1} (the inverse of the estimated covariance of the moment conditions) yields the efficient GMM estimator, which achieves the smallest asymptotic variance among all GMM estimators.

GMM is foundational in econometrics, where moment conditions arise naturally from economic theory (e.g., Euler equations, instrumental variables). The instrumental variables (IV) estimator is a special case of GMM.

Connection to ML. The idea of matching empirical expectations to model expectations appears throughout machine learning. Contrastive divergence training in restricted Boltzmann machines matches the data’s expected sufficient statistics to the model’s. Moment matching is also central to generative adversarial networks (the discriminator implicitly enforces moment conditions) and to kernel methods through maximum mean discrepancy (MMD), which compares all moments simultaneously in a reproducing kernel Hilbert space.


Summary

ConceptKey Result
MoM estimatorSolve μ^k=μk(θ)\hat{\mu}_k = \mu_k(\theta) for k=1,,dk = 1, \ldots, d
ConsistencyFollows from LLN + continuous mapping theorem
EfficiencyGenerally less efficient than MLE
Best use caseClosed-form estimates when MLE is intractable
GMMHandles overidentified models with m>dm > d moment conditions