2: Statistical Models

A statistical model is a mathematical description of how data are generated. It specifies a family of probability distributions indexed by parameters, and the goal of inference is to determine which member of that family best explains the observed data. The choice of model — parametric versus nonparametric, exponential family or not — determines which estimation methods apply and what theoretical guarantees are available.


What Is a Statistical Model?

A statistical model is a set of probability distributions P={Pθ:θΘ}\mathcal{P} = \{P_\theta : \theta \in \Theta\} where Θ\Theta is the parameter space. Given data X1,,XnX_1, \ldots, X_n, we assume the data were generated from some PθPP_\theta \in \mathcal{P} and aim to identify θ\theta.

The model is correctly specified if the true data-generating distribution PP^* belongs to P\mathcal{P}. If not, the model is misspecified, and inference targets the member of P\mathcal{P} closest to PP^* (typically in KL divergence).

Model specification requires two choices:

  1. The distributional family (e.g., Gaussian, Poisson, nonparametric)
  2. The parameter space Θ\Theta (constraints on the parameters)

Parametric vs. Nonparametric Models

Parametric models assume the data distribution belongs to a family indexed by a finite-dimensional parameter θΘRd\theta \in \Theta \subseteq \mathbb{R}^d. The number of parameters dd is fixed regardless of sample size.

Examples:

  • XiN(μ,σ2)X_i \sim \mathcal{N}(\mu, \sigma^2): two parameters (μ,σ2)(\mu, \sigma^2)
  • Logistic regression: P(Y=1X)=σ(XTβ)P(Y = 1 \mid X) = \sigma(X^T\beta) with βRp\beta \in \mathbb{R}^p
  • Poisson regression: YiPoisson(exp(XiTβ))Y_i \sim \text{Poisson}(\exp(X_i^T\beta))

Nonparametric models make minimal assumptions about the distribution. The “parameter” is infinite-dimensional — typically an entire function.

Examples:

  • P={F:F is a CDF}\mathcal{P} = \{F : F \text{ is a CDF}\}: the set of all distributions
  • Kernel density estimation: f^(x)=1nhi=1nK(xXih)\hat{f}(x) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x - X_i}{h}\right)
  • The empirical CDF: F^n(x)=1ni=1n1(Xix)\hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i \leq x)

Semiparametric models combine both: a finite-dimensional parameter of interest and an infinite-dimensional nuisance component. The Cox proportional hazards model λ(tX)=λ0(t)exp(XTβ)\lambda(t \mid X) = \lambda_0(t)\exp(X^T\beta) has parametric regression coefficients β\beta but a nonparametric baseline hazard λ0(t)\lambda_0(t).

Connection to ML. Parametric models correspond to fixed-architecture models: a neural network with a specified number of layers and units has a fixed number of weights regardless of dataset size. Nonparametric models correspond to methods whose complexity grows with nn: KNN (stores all training points), kernel SVMs (support vectors grow with nn), and random forests (tree depth adapts to data). Gaussian processes are a canonical nonparametric model — the posterior is a distribution over functions, not a finite parameter vector.


Sufficient Statistics

A statistic T=T(X1,,Xn)T = T(X_1, \ldots, X_n) is sufficient for θ\theta if the conditional distribution of the data given TT does not depend on θ\theta. Informally, TT captures everything the data can tell us about θ\theta.

The Fisher-Neyman factorization theorem provides a practical test: TT is sufficient for θ\theta iff the likelihood factors as

L(θ;x)=g(T(x),θ)h(x)L(\theta; \mathbf{x}) = g(T(\mathbf{x}), \theta) \cdot h(\mathbf{x})

where gg depends on the data only through TT and hh does not depend on θ\theta.

Example. For X1,,XnPoisson(λ)X_1, \ldots, X_n \sim \text{Poisson}(\lambda):

L(λ;x)=i=1nλxieλxi!=λxienλ1xi!L(\lambda; \mathbf{x}) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} = \lambda^{\sum x_i} e^{-n\lambda} \cdot \frac{1}{\prod x_i!}

So T=XiT = \sum X_i is sufficient for λ\lambda. The individual data points are irrelevant once we know their sum.

A sufficient statistic TT is minimal sufficient if it is a function of every other sufficient statistic. Minimal sufficient statistics achieve the greatest possible data reduction without losing information about θ\theta.


The Exponential Family

A distribution belongs to the exponential family if its density can be written as:

f(x;θ)=h(x)exp(η(θ)TT(x)A(θ))f(x; \theta) = h(x) \exp\left(\eta(\theta)^T T(x) - A(\theta)\right)

where:

  • T(x)T(x) is the sufficient statistic (vector-valued in general)
  • η(θ)\eta(\theta) is the natural parameter
  • A(θ)A(\theta) is the log-partition function (ensures integration to 1)
  • h(x)h(x) is the base measure

In the canonical form, we parametrize directly by η\eta:

f(x;η)=h(x)exp(ηTT(x)A(η))f(x; \eta) = h(x) \exp\left(\eta^T T(x) - A(\eta)\right)

Key examples:

Distributionη\etaT(x)T(x)A(η)A(\eta)
N(μ,σ2)\mathcal{N}(\mu, \sigma^2) (known σ2\sigma^2)μ/σ2\mu/\sigma^2xxη2σ2/2\eta^2 \sigma^2 / 2
Bernoulli(pp)log(p/(1p))\log(p/(1-p))xxlog(1+eη)\log(1 + e^\eta)
Poisson(λ\lambda)logλ\log \lambdaxxeηe^\eta
Exponential(λ\lambda)λ-\lambdaxxlog(η)-\log(-\eta)
Gamma(α,β\alpha, \beta)(α1,β)(\alpha - 1, -\beta)(logx,x)(\log x, x)logΓ(α)αlogβ\log\Gamma(\alpha) - \alpha\log\beta

Properties of the Exponential Family

The exponential family has remarkable mathematical properties that make it the backbone of classical and modern statistical modeling.

Moment generation from the log-partition function. The derivatives of A(η)A(\eta) yield the moments of T(X)T(X):

E[T(X)]=A(η),Var(T(X))=2A(η)E[T(X)] = \nabla A(\eta), \qquad \text{Var}(T(X)) = \nabla^2 A(\eta)

Since 2A(η)\nabla^2 A(\eta) is a covariance matrix, it is positive semidefinite, which means A(η)A(\eta) is convex. This convexity guarantees that the log-likelihood is concave in η\eta, so maximum likelihood estimation has no local optima.

Sufficient statistics. For nn i.i.d. observations from an exponential family, the joint density is:

i=1nf(xi;η)=(i=1nh(xi))exp(ηTi=1nT(xi)nA(η))\prod_{i=1}^n f(x_i; \eta) = \left(\prod_{i=1}^n h(x_i)\right) \exp\left(\eta^T \sum_{i=1}^n T(x_i) - nA(\eta)\right)

The sufficient statistic is i=1nT(xi)\sum_{i=1}^n T(x_i), which has fixed dimension regardless of nn. This is the defining computational advantage: no matter how large the dataset, the sufficient statistic summarizes everything relevant about θ\theta.

MLE has a clean form. Setting the score to zero:

A(η^)=1ni=1nT(xi)\nabla A(\hat{\eta}) = \frac{1}{n}\sum_{i=1}^n T(x_i)

The MLE equates the model’s expected sufficient statistic to the sample average of the sufficient statistic. This is also the moment equation, so for exponential families, MLE and the method of moments coincide.


Why the Exponential Family Matters for GLMs

Generalized linear models (GLMs) extend linear regression by allowing the response variable to follow any distribution in the exponential family. A GLM has three components:

  1. Random component: YiY_i follows an exponential family distribution
  2. Systematic component: ηi=XiTβ\eta_i = X_i^T\beta (linear predictor)
  3. Link function: g(μi)=ηig(\mu_i) = \eta_i, connecting the mean μi=E[Yi]\mu_i = E[Y_i] to the linear predictor

The canonical link sets g=(A)1g = (\nabla A)^{-1}, so the natural parameter equals the linear predictor directly:

GLMDistributionCanonical LinkModel
Linear regressionN(μ,σ2)\mathcal{N}(\mu, \sigma^2)Identity: μ=XTβ\mu = X^T\betaContinuous response
Logistic regressionBernoulli(pp)Logit: log(p/(1p))=XTβ\log(p/(1-p)) = X^T\betaBinary classification
Poisson regressionPoisson(λ\lambda)Log: logλ=XTβ\log\lambda = X^T\betaCount data

The exponential family structure guarantees that the log-likelihood of a GLM is concave in β\beta (when using the canonical link), so fitting is a well-posed convex optimization problem solvable by iteratively reweighted least squares (IRLS).


Model Specification in Practice

Choosing a model involves balancing assumptions against flexibility:

Identifiability. A model is identifiable if distinct parameters give distinct distributions: θ1θ2    Pθ1Pθ2\theta_1 \neq \theta_2 \implies P_{\theta_1} \neq P_{\theta_2}. Non-identifiable models create ambiguity — the data cannot distinguish between parameter values. Mixture models, for instance, are non-identifiable up to label permutation.

The likelihood principle. All evidence about θ\theta from the data is contained in the likelihood function L(θ;x)=f(x;θ)L(\theta; \mathbf{x}) = f(\mathbf{x}; \theta). Two datasets yielding the same likelihood function carry the same information about θ\theta.

Model complexity and overfitting. A model with more parameters fits the training data better but may generalize poorly. The parametric-nonparametric spectrum is a spectrum of bias-variance tradeoff: parametric models impose strong assumptions (high bias, low variance), while nonparametric models are more flexible (low bias, high variance). Model selection criteria like AIC and BIC formalize this tradeoff by penalizing model complexity.


Summary

ConceptKey IdeaML Connection
Parametric modelFixed number of parametersFixed-architecture networks
Nonparametric modelComplexity grows with nnKNN, kernel methods, GPs
Sufficient statisticLossless data compression for θ\thetaFeature extraction
Exponential familyf=h(x)exp(ηTT(x)A(η))f = h(x)\exp(\eta^T T(x) - A(\eta))Foundation of GLMs
Log-partition functionConvex; derivatives give momentsConvex loss guarantees
GLMsExponential family + linear predictorLogistic/Poisson regression