Gaussian Discriminant Analysis
Gaussian Discriminant Analysis (GDA) is a generative classifier that models each class as a multivariate Gaussian distribution. It provides a principled probabilistic framework where classification reduces to comparing Gaussian densities, and its relationship to logistic regression reveals when generative modeling is advantageous.
Model Specification
GDA assumes the following generative process:
- The class label is drawn from a categorical prior: where
- Given the class, the features are drawn from a class-specific Gaussian:
The class-conditional density is:
Classification uses Bayes’ rule:
Linear Discriminant Analysis (LDA)
LDA constrains all classes to share a common covariance matrix: for all .
Decision Boundary
The log-posterior difference between classes and :
With shared covariance, the quadratic terms cancel, yielding:
This is linear in . The decision boundary is a hyperplane. Hence the name “linear” discriminant analysis.
Parameter Estimation (MLE)
Given training data with examples in class :
Prior:
Class means:
Shared covariance:
All estimates are closed-form. Training is (dominated by computing the covariance matrix).
Quadratic Discriminant Analysis (QDA)
QDA allows each class to have its own covariance matrix . The quadratic terms no longer cancel between classes, producing a quadratic decision boundary.
The discriminant function for class :
Parameter count comparison:
| Model | Parameters | Decision Boundary |
|---|---|---|
| LDA | Linear (hyperplane) | |
| QDA | Quadratic (conic section) | |
| Naive Bayes (Gaussian) | Linear (with diagonal ) |
QDA has times as many covariance parameters as LDA. When is large relative to , QDA overfits; LDA’s shared covariance provides regularization.
Connection to Logistic Regression
A fundamental result: LDA’s posterior is logistic regression.
For binary classification with shared covariance:
where and absorbs the constant terms. This is exactly the logistic regression model. LDA and logistic regression produce the same functional form for the posterior, but arrive at it differently:
- LDA estimates by maximum likelihood of the joint , then derives and
- Logistic regression directly estimates and by maximum likelihood of the conditional
When Does Each Win?
LDA is better when:
- The Gaussian assumption approximately holds
- Training data is limited (LDA’s stronger assumptions provide regularization)
- The number of features is large relative to sample size
Logistic regression is better when:
- The Gaussian assumption is violated (e.g., discrete features, multimodal distributions)
- Training data is abundant (logistic regression’s weaker assumptions become advantageous)
- The goal is to model without assumptions about
Ng and Jordan (2001) showed that LDA reaches its asymptotic error rate with samples while logistic regression requires , but logistic regression’s asymptotic error can be lower because it makes fewer assumptions. This is the generative-discriminative tradeoff in action.
Fisher’s Linear Discriminant
Fisher’s formulation of LDA approaches the problem from a dimensionality reduction perspective rather than a probabilistic one. The goal: find the projection direction that maximizes class separation relative to within-class spread.
Between-class scatter:
Within-class scatter:
Fisher’s criterion: Maximize the ratio of between-class to within-class variance after projection:
The optimal direction is:
This is identical to the LDA weight vector, connecting the probabilistic and geometric viewpoints. For classes, Fisher’s LDA produces up to discriminant directions (the eigenvectors of ), providing a principled method for dimensionality reduction that preserves class structure.
Regularized Discriminant Analysis
When is large or is small, the sample covariance matrix may be singular or poorly conditioned. Regularized Discriminant Analysis (RDA) interpolates between LDA and QDA:
where controls the interpolation. gives LDA, gives QDA. Additionally, shrinkage toward the identity:
Both and are tuned via cross-validation. This is closely related to the Ledoit-Wolf shrinkage estimator used in portfolio optimization and other high-dimensional covariance estimation problems.
Practical Considerations
Feature scaling. Unlike logistic regression, GDA is not invariant to feature scaling because the covariance matrix explicitly models feature variances. However, the MLE automatically adapts to different scales through , so scaling is less critical than for distance-based methods like KNN.
Computational cost. Training: (covariance estimation + inversion). Inference: per example (matrix-vector product with ). For high-dimensional data, the inversion cost dominates; use regularization or dimensionality reduction first.
When classes are not Gaussian. GDA degrades when the Gaussian assumption is strongly violated. Transformations (log, Box-Cox) can improve normality. For fundamentally non-Gaussian data, discriminative models (logistic regression, random forests) are preferred.
Summary
| Model | Covariance | Boundary | Parameters | Best When |
|---|---|---|---|---|
| LDA | Shared | Linear | Limited data, approximately Gaussian | |
| QDA | Per-class | Quadratic | Abundant data, different class shapes | |
| Gaussian NB | Diagonal | Linear | High-dimensional, independent features | |
| Logistic Reg | N/A (discriminative) | Linear | Non-Gaussian, abundant data |
GDA provides a complete generative model of the data, enabling density estimation, anomaly detection (via Mahalanobis distance), and synthetic data generation in addition to classification. Its connection to logistic regression through the exponential family illuminates a fundamental principle: generative and discriminative approaches can produce the same classifier, but arrive at it through different assumptions and with different sample complexity guarantees.