2: Statistical Models
A statistical model is a mathematical description of how data are generated. It specifies a family of probability distributions indexed by parameters, and the goal of inference is to determine which member of that family best explains the observed data. The choice of model — parametric versus nonparametric, exponential family or not — determines which estimation methods apply and what theoretical guarantees are available.
What Is a Statistical Model?
A statistical model is a set of probability distributions where is the parameter space. Given data , we assume the data were generated from some and aim to identify .
The model is correctly specified if the true data-generating distribution belongs to . If not, the model is misspecified, and inference targets the member of closest to (typically in KL divergence).
Model specification requires two choices:
- The distributional family (e.g., Gaussian, Poisson, nonparametric)
- The parameter space (constraints on the parameters)
Parametric vs. Nonparametric Models
Parametric models assume the data distribution belongs to a family indexed by a finite-dimensional parameter . The number of parameters is fixed regardless of sample size.
Examples:
- : two parameters
- Logistic regression: with
- Poisson regression:
Nonparametric models make minimal assumptions about the distribution. The “parameter” is infinite-dimensional — typically an entire function.
Examples:
- : the set of all distributions
- Kernel density estimation:
- The empirical CDF:
Semiparametric models combine both: a finite-dimensional parameter of interest and an infinite-dimensional nuisance component. The Cox proportional hazards model has parametric regression coefficients but a nonparametric baseline hazard .
Connection to ML. Parametric models correspond to fixed-architecture models: a neural network with a specified number of layers and units has a fixed number of weights regardless of dataset size. Nonparametric models correspond to methods whose complexity grows with : KNN (stores all training points), kernel SVMs (support vectors grow with ), and random forests (tree depth adapts to data). Gaussian processes are a canonical nonparametric model — the posterior is a distribution over functions, not a finite parameter vector.
Sufficient Statistics
A statistic is sufficient for if the conditional distribution of the data given does not depend on . Informally, captures everything the data can tell us about .
The Fisher-Neyman factorization theorem provides a practical test: is sufficient for iff the likelihood factors as
where depends on the data only through and does not depend on .
Example. For :
So is sufficient for . The individual data points are irrelevant once we know their sum.
A sufficient statistic is minimal sufficient if it is a function of every other sufficient statistic. Minimal sufficient statistics achieve the greatest possible data reduction without losing information about .
The Exponential Family
A distribution belongs to the exponential family if its density can be written as:
where:
- is the sufficient statistic (vector-valued in general)
- is the natural parameter
- is the log-partition function (ensures integration to 1)
- is the base measure
In the canonical form, we parametrize directly by :
Key examples:
| Distribution | |||
|---|---|---|---|
| (known ) | |||
| Bernoulli() | |||
| Poisson() | |||
| Exponential() | |||
| Gamma() |
Properties of the Exponential Family
The exponential family has remarkable mathematical properties that make it the backbone of classical and modern statistical modeling.
Moment generation from the log-partition function. The derivatives of yield the moments of :
Since is a covariance matrix, it is positive semidefinite, which means is convex. This convexity guarantees that the log-likelihood is concave in , so maximum likelihood estimation has no local optima.
Sufficient statistics. For i.i.d. observations from an exponential family, the joint density is:
The sufficient statistic is , which has fixed dimension regardless of . This is the defining computational advantage: no matter how large the dataset, the sufficient statistic summarizes everything relevant about .
MLE has a clean form. Setting the score to zero:
The MLE equates the model’s expected sufficient statistic to the sample average of the sufficient statistic. This is also the moment equation, so for exponential families, MLE and the method of moments coincide.
Why the Exponential Family Matters for GLMs
Generalized linear models (GLMs) extend linear regression by allowing the response variable to follow any distribution in the exponential family. A GLM has three components:
- Random component: follows an exponential family distribution
- Systematic component: (linear predictor)
- Link function: , connecting the mean to the linear predictor
The canonical link sets , so the natural parameter equals the linear predictor directly:
| GLM | Distribution | Canonical Link | Model |
|---|---|---|---|
| Linear regression | Identity: | Continuous response | |
| Logistic regression | Bernoulli() | Logit: | Binary classification |
| Poisson regression | Poisson() | Log: | Count data |
The exponential family structure guarantees that the log-likelihood of a GLM is concave in (when using the canonical link), so fitting is a well-posed convex optimization problem solvable by iteratively reweighted least squares (IRLS).
Model Specification in Practice
Choosing a model involves balancing assumptions against flexibility:
Identifiability. A model is identifiable if distinct parameters give distinct distributions: . Non-identifiable models create ambiguity — the data cannot distinguish between parameter values. Mixture models, for instance, are non-identifiable up to label permutation.
The likelihood principle. All evidence about from the data is contained in the likelihood function . Two datasets yielding the same likelihood function carry the same information about .
Model complexity and overfitting. A model with more parameters fits the training data better but may generalize poorly. The parametric-nonparametric spectrum is a spectrum of bias-variance tradeoff: parametric models impose strong assumptions (high bias, low variance), while nonparametric models are more flexible (low bias, high variance). Model selection criteria like AIC and BIC formalize this tradeoff by penalizing model complexity.
Summary
| Concept | Key Idea | ML Connection |
|---|---|---|
| Parametric model | Fixed number of parameters | Fixed-architecture networks |
| Nonparametric model | Complexity grows with | KNN, kernel methods, GPs |
| Sufficient statistic | Lossless data compression for | Feature extraction |
| Exponential family | Foundation of GLMs | |
| Log-partition function | Convex; derivatives give moments | Convex loss guarantees |
| GLMs | Exponential family + linear predictor | Logistic/Poisson regression |