2: Statistical Models

A statistical model is a mathematical description of how data are generated. It specifies a family of probability distributions indexed by parameters, and the goal of inference is to determine which member of that family best explains the observed data. The choice of model — parametric versus nonparametric, exponential family or not — determines which estimation methods apply and what theoretical guarantees are available.

What Is a Statistical Model?

A statistical model is a set of probability distributions $\mathcal{P} = \{P_\theta : \theta \in \Theta\}$ where $\Theta$ is the parameter space. Given data $X_1, \ldots, X_n$ , we assume the data were generated from some $P_\theta \in \mathcal{P}$ and aim to identify $\theta$ .

The model is correctly specified if the true data-generating distribution $P^*$ belongs to $\mathcal{P}$ . If not, the model is misspecified, and inference targets the member of $\mathcal{P}$ closest to $P^*$ (typically in KL divergence).

Model specification requires two choices:

The distributional family (e.g., Gaussian, Poisson, nonparametric)
The parameter space $\Theta$ (constraints on the parameters)

Parametric vs. Nonparametric Models

Parametric models assume the data distribution belongs to a family indexed by a finite-dimensional parameter $\theta \in \Theta \subseteq \mathbb{R}^d$ . The number of parameters $d$ is fixed regardless of sample size.

Examples:

$X_i \sim \mathcal{N}(\mu, \sigma^2)$ : two parameters $(\mu, \sigma^2)$
Logistic regression: $P(Y = 1 \mid X) = \sigma(X^T\beta)$ with $\beta \in \mathbb{R}^p$
Poisson regression: $Y_i \sim \text{Poisson}(\exp(X_i^T\beta))$

Nonparametric models make minimal assumptions about the distribution. The “parameter” is infinite-dimensional — typically an entire function.

Examples:

$\mathcal{P} = \{F : F \text{ is a CDF}\}$ : the set of all distributions
Kernel density estimation: $\hat{f}(x) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x - X_i}{h}\right)$
The empirical CDF: $\hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i \leq x)$

Semiparametric models combine both: a finite-dimensional parameter of interest and an infinite-dimensional nuisance component. The Cox proportional hazards model $\lambda(t \mid X) = \lambda_0(t)\exp(X^T\beta)$ has parametric regression coefficients $\beta$ but a nonparametric baseline hazard $\lambda_0(t)$ .

Connection to ML. Parametric models correspond to fixed-architecture models: a neural network with a specified number of layers and units has a fixed number of weights regardless of dataset size. Nonparametric models correspond to methods whose complexity grows with $n$ : KNN (stores all training points), kernel SVMs (support vectors grow with $n$ ), and random forests (tree depth adapts to data). Gaussian processes are a canonical nonparametric model — the posterior is a distribution over functions, not a finite parameter vector.

Sufficient Statistics

A statistic $T = T(X_1, \ldots, X_n)$ is sufficient for $\theta$ if the conditional distribution of the data given $T$ does not depend on $\theta$ . Informally, $T$ captures everything the data can tell us about $\theta$ .

The Fisher-Neyman factorization theorem provides a practical test: $T$ is sufficient for $\theta$ iff the likelihood factors as

L(\theta; \mathbf{x}) = g(T(\mathbf{x}), \theta) \cdot h(\mathbf{x})

where $g$ depends on the data only through $T$ and $h$ does not depend on $\theta$ .

Example. For $X_1, \ldots, X_n \sim \text{Poisson}(\lambda)$ :

L(\lambda; \mathbf{x}) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} = \lambda^{\sum x_i} e^{-n\lambda} \cdot \frac{1}{\prod x_i!}

So $T = \sum X_i$ is sufficient for $\lambda$ . The individual data points are irrelevant once we know their sum.

A sufficient statistic $T$ is minimal sufficient if it is a function of every other sufficient statistic. Minimal sufficient statistics achieve the greatest possible data reduction without losing information about $\theta$ .

The Exponential Family

A distribution belongs to the exponential family if its density can be written as:

f(x; \theta) = h(x) \exp\left(\eta(\theta)^T T(x) - A(\theta)\right)

where:

$T(x)$ is the sufficient statistic (vector-valued in general)
$\eta(\theta)$ is the natural parameter
$A(\theta)$ is the log-partition function (ensures integration to 1)
$h(x)$ is the base measure

In the canonical form, we parametrize directly by $\eta$ :

f(x; \eta) = h(x) \exp\left(\eta^T T(x) - A(\eta)\right)

Key examples:

Distribution	$\eta$	$T(x)$	$A(\eta)$
$\mathcal{N}(\mu, \sigma^2)$ (known $\sigma^2$ )	$\mu/\sigma^2$	$x$	$\eta^2 \sigma^2 / 2$
Bernoulli( $p$ )	$\log(p/(1-p))$	$x$	$\log(1 + e^\eta)$
Poisson( $\lambda$ )	$\log \lambda$	$x$	$e^\eta$
Exponential( $\lambda$ )	$-\lambda$	$x$	$-\log(-\eta)$
Gamma( $\alpha, \beta$ )	$(\alpha - 1, -\beta)$	$(\log x, x)$	$\log\Gamma(\alpha) - \alpha\log\beta$

Properties of the Exponential Family

The exponential family has remarkable mathematical properties that make it the backbone of classical and modern statistical modeling.

Moment generation from the log-partition function. The derivatives of $A(\eta)$ yield the moments of $T(X)$ :

E[T(X)] = \nabla A(\eta), \qquad \text{Var}(T(X)) = \nabla^2 A(\eta)

Since $\nabla^2 A(\eta)$ is a covariance matrix, it is positive semidefinite, which means $A(\eta)$ is convex. This convexity guarantees that the log-likelihood is concave in $\eta$ , so maximum likelihood estimation has no local optima.

Sufficient statistics. For $n$ i.i.d. observations from an exponential family, the joint density is:

\prod_{i=1}^n f(x_i; \eta) = \left(\prod_{i=1}^n h(x_i)\right) \exp\left(\eta^T \sum_{i=1}^n T(x_i) - nA(\eta)\right)

The sufficient statistic is $\sum_{i=1}^n T(x_i)$ , which has fixed dimension regardless of $n$ . This is the defining computational advantage: no matter how large the dataset, the sufficient statistic summarizes everything relevant about $\theta$ .

MLE has a clean form. Setting the score to zero:

\nabla A(\hat{\eta}) = \frac{1}{n}\sum_{i=1}^n T(x_i)

The MLE equates the model’s expected sufficient statistic to the sample average of the sufficient statistic. This is also the moment equation, so for exponential families, MLE and the method of moments coincide.

Why the Exponential Family Matters for GLMs

Generalized linear models (GLMs) extend linear regression by allowing the response variable to follow any distribution in the exponential family. A GLM has three components:

Random component: $Y_i$ follows an exponential family distribution
Systematic component: $\eta_i = X_i^T\beta$ (linear predictor)
Link function: $g(\mu_i) = \eta_i$ , connecting the mean $\mu_i = E[Y_i]$ to the linear predictor

The canonical link sets $g = (\nabla A)^{-1}$ , so the natural parameter equals the linear predictor directly:

GLM	Distribution	Canonical Link	Model
Linear regression	$\mathcal{N}(\mu, \sigma^2)$	Identity: $\mu = X^T\beta$	Continuous response
Logistic regression	Bernoulli( $p$ )	Logit: $\log(p/(1-p)) = X^T\beta$	Binary classification
Poisson regression	Poisson( $\lambda$ )	Log: $\log\lambda = X^T\beta$	Count data

The exponential family structure guarantees that the log-likelihood of a GLM is concave in $\beta$ (when using the canonical link), so fitting is a well-posed convex optimization problem solvable by iteratively reweighted least squares (IRLS).

Model Specification in Practice

Choosing a model involves balancing assumptions against flexibility:

Identifiability. A model is identifiable if distinct parameters give distinct distributions: $\theta_1 \neq \theta_2 \implies P_{\theta_1} \neq P_{\theta_2}$ . Non-identifiable models create ambiguity — the data cannot distinguish between parameter values. Mixture models, for instance, are non-identifiable up to label permutation.

The likelihood principle. All evidence about $\theta$ from the data is contained in the likelihood function $L(\theta; \mathbf{x}) = f(\mathbf{x}; \theta)$ . Two datasets yielding the same likelihood function carry the same information about $\theta$ .

Model complexity and overfitting. A model with more parameters fits the training data better but may generalize poorly. The parametric-nonparametric spectrum is a spectrum of bias-variance tradeoff: parametric models impose strong assumptions (high bias, low variance), while nonparametric models are more flexible (low bias, high variance). Model selection criteria like AIC and BIC formalize this tradeoff by penalizing model complexity.

Summary

Concept	Key Idea	ML Connection
Parametric model	Fixed number of parameters	Fixed-architecture networks
Nonparametric model	Complexity grows with $n$	KNN, kernel methods, GPs
Sufficient statistic	Lossless data compression for $\theta$	Feature extraction
Exponential family	$f = h(x)\exp(\eta^T T(x) - A(\eta))$	Foundation of GLMs
Log-partition function	Convex; derivatives give moments	Convex loss guarantees
GLMs	Exponential family + linear predictor	Logistic/Poisson regression