Gaussian Discriminant Analysis

Gaussian Discriminant Analysis (GDA) is a generative classifier that models each class as a multivariate Gaussian distribution. It provides a principled probabilistic framework where classification reduces to comparing Gaussian densities, and its relationship to logistic regression reveals when generative modeling is advantageous.

Model Specification

GDA assumes the following generative process:

The class label is drawn from a categorical prior: $Y \sim \text{Categorical}(\boldsymbol{\pi})$ where $\pi_k = P(Y = k)$
Given the class, the features are drawn from a class-specific Gaussian: $\mathbf{X} \mid Y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$

The class-conditional density is:

P(\mathbf{x} \mid Y = k) = \frac{1}{(2\pi)^{D/2} |\boldsymbol{\Sigma}_k|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right)

Classification uses Bayes’ rule:

\hat{y} = \arg\max_k P(Y = k \mid \mathbf{x}) = \arg\max_k \left[\log P(\mathbf{x} \mid Y = k) + \log P(Y = k)\right]

Linear Discriminant Analysis (LDA)

LDA constrains all classes to share a common covariance matrix: $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ for all $k$ .

Decision Boundary

The log-posterior difference between classes $k$ and $l$ :

\log \frac{P(Y = k \mid \mathbf{x})}{P(Y = l \mid \mathbf{x})} = \log \frac{P(\mathbf{x} \mid Y = k)}{P(\mathbf{x} \mid Y = l)} + \log \frac{\pi_k}{\pi_l}

With shared covariance, the quadratic terms $\mathbf{x}^\top \boldsymbol{\Sigma}^{-1} \mathbf{x}$ cancel, yielding:

= (\boldsymbol{\mu}_k - \boldsymbol{\mu}_l)^\top \boldsymbol{\Sigma}^{-1} \mathbf{x} - \frac{1}{2}(\boldsymbol{\mu}_k^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \boldsymbol{\mu}_l^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_l) + \log \frac{\pi_k}{\pi_l}

This is linear in $\mathbf{x}$ . The decision boundary $\{x : P(Y = k \mid \mathbf{x}) = P(Y = l \mid \mathbf{x})\}$ is a hyperplane. Hence the name “linear” discriminant analysis.

Parameter Estimation (MLE)

Given training data $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N$ with $N_k$ examples in class $k$ :

Prior: $\hat{\pi}_k = \frac{N_k}{N}$

Class means: $\hat{\boldsymbol{\mu}}_k = \frac{1}{N_k} \sum_{i: y^{(i)} = k} \mathbf{x}^{(i)}$

Shared covariance: $\hat{\boldsymbol{\Sigma}} = \frac{1}{N} \sum_{k=1}^{K} \sum_{i: y^{(i)} = k} (\mathbf{x}^{(i)} - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}^{(i)} - \hat{\boldsymbol{\mu}}_k)^\top$

All estimates are closed-form. Training is $O(ND^2)$ (dominated by computing the $D \times D$ covariance matrix).

Quadratic Discriminant Analysis (QDA)

QDA allows each class to have its own covariance matrix $\boldsymbol{\Sigma}_k$ . The quadratic terms $\mathbf{x}^\top \boldsymbol{\Sigma}_k^{-1} \mathbf{x}$ no longer cancel between classes, producing a quadratic decision boundary.

The discriminant function for class $k$ :

\delta_k(\mathbf{x}) = -\frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1}(\mathbf{x} - \boldsymbol{\mu}_k) + \log \pi_k

Parameter count comparison:

Model	Parameters	Decision Boundary
LDA	$KD + D(D+1)/2 + K-1$	Linear (hyperplane)
QDA	$KD + KD(D+1)/2 + K-1$	Quadratic (conic section)
Naive Bayes (Gaussian)	$2KD + K-1$	Linear (with diagonal $\Sigma$ )

QDA has $K$ times as many covariance parameters as LDA. When $D$ is large relative to $N$ , QDA overfits; LDA’s shared covariance provides regularization.

Connection to Logistic Regression

A fundamental result: LDA’s posterior is logistic regression.

For binary classification with shared covariance:

\log \frac{P(Y = 1 \mid \mathbf{x})}{P(Y = 0 \mid \mathbf{x})} = \mathbf{w}^\top \mathbf{x} + b

where $\mathbf{w} = \boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)$ and $b$ absorbs the constant terms. This is exactly the logistic regression model. LDA and logistic regression produce the same functional form for the posterior, but arrive at it differently:

LDA estimates $\boldsymbol{\mu}_0, \boldsymbol{\mu}_1, \boldsymbol{\Sigma}$ by maximum likelihood of the joint $P(\mathbf{x}, y)$ , then derives $\mathbf{w}$ and $b$
Logistic regression directly estimates $\mathbf{w}$ and $b$ by maximum likelihood of the conditional $P(y \mid \mathbf{x})$

When Does Each Win?

LDA is better when:

The Gaussian assumption approximately holds
Training data is limited (LDA’s stronger assumptions provide regularization)
The number of features $D$ is large relative to sample size $N$

Logistic regression is better when:

The Gaussian assumption is violated (e.g., discrete features, multimodal distributions)
Training data is abundant (logistic regression’s weaker assumptions become advantageous)
The goal is to model $P(Y \mid \mathbf{x})$ without assumptions about $P(\mathbf{x})$

Ng and Jordan (2001) showed that LDA reaches its asymptotic error rate with $O(\log N)$ samples while logistic regression requires $O(N)$ , but logistic regression’s asymptotic error can be lower because it makes fewer assumptions. This is the generative-discriminative tradeoff in action.

Fisher’s Linear Discriminant

Fisher’s formulation of LDA approaches the problem from a dimensionality reduction perspective rather than a probabilistic one. The goal: find the projection direction $\mathbf{w}$ that maximizes class separation relative to within-class spread.

Between-class scatter: $\mathbf{S}_B = (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)^\top$

Within-class scatter: $\mathbf{S}_W = \sum_{k=0}^{1} \sum_{i: y^{(i)} = k} (\mathbf{x}^{(i)} - \boldsymbol{\mu}_k)(\mathbf{x}^{(i)} - \boldsymbol{\mu}_k)^\top$

Fisher’s criterion: Maximize the ratio of between-class to within-class variance after projection:

J(\mathbf{w}) = \frac{\mathbf{w}^\top \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\top \mathbf{S}_W \mathbf{w}}

The optimal direction is: $\mathbf{w}^* = \mathbf{S}_W^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)$

This is identical to the LDA weight vector, connecting the probabilistic and geometric viewpoints. For $K > 2$ classes, Fisher’s LDA produces up to $K - 1$ discriminant directions (the eigenvectors of $\mathbf{S}_W^{-1}\mathbf{S}_B$ ), providing a principled method for dimensionality reduction that preserves class structure.

Regularized Discriminant Analysis

When $D$ is large or $N$ is small, the sample covariance matrix $\hat{\boldsymbol{\Sigma}}$ may be singular or poorly conditioned. Regularized Discriminant Analysis (RDA) interpolates between LDA and QDA:

\hat{\boldsymbol{\Sigma}}_k(\alpha) = \alpha \hat{\boldsymbol{\Sigma}}_k + (1 - \alpha) \hat{\boldsymbol{\Sigma}}

where $\alpha \in [0, 1]$ controls the interpolation. $\alpha = 0$ gives LDA, $\alpha = 1$ gives QDA. Additionally, shrinkage toward the identity:

\hat{\boldsymbol{\Sigma}}(\gamma) = \gamma \hat{\boldsymbol{\Sigma}} + (1 - \gamma) \frac{\text{tr}(\hat{\boldsymbol{\Sigma}})}{D} \mathbf{I}

Both $\alpha$ and $\gamma$ are tuned via cross-validation. This is closely related to the Ledoit-Wolf shrinkage estimator used in portfolio optimization and other high-dimensional covariance estimation problems.

Practical Considerations

Feature scaling. Unlike logistic regression, GDA is not invariant to feature scaling because the covariance matrix explicitly models feature variances. However, the MLE automatically adapts to different scales through $\hat{\boldsymbol{\Sigma}}$ , so scaling is less critical than for distance-based methods like KNN.

Computational cost. Training: $O(ND^2 + D^3)$ (covariance estimation + inversion). Inference: $O(D^2)$ per example (matrix-vector product with $\boldsymbol{\Sigma}^{-1}$ ). For high-dimensional data, the $D^3$ inversion cost dominates; use regularization or dimensionality reduction first.

When classes are not Gaussian. GDA degrades when the Gaussian assumption is strongly violated. Transformations (log, Box-Cox) can improve normality. For fundamentally non-Gaussian data, discriminative models (logistic regression, random forests) are preferred.

Summary

Model	Covariance	Boundary	Parameters	Best When
LDA	Shared $\boldsymbol{\Sigma}$	Linear	$O(D^2)$	Limited data, approximately Gaussian
QDA	Per-class $\boldsymbol{\Sigma}_k$	Quadratic	$O(KD^2)$	Abundant data, different class shapes
Gaussian NB	Diagonal $\boldsymbol{\Sigma}_k$	Linear	$O(KD)$	High-dimensional, independent features
Logistic Reg	N/A (discriminative)	Linear	$O(D)$	Non-Gaussian, abundant data

GDA provides a complete generative model of the data, enabling density estimation, anomaly detection (via Mahalanobis distance), and synthetic data generation in addition to classification. Its connection to logistic regression through the exponential family illuminates a fundamental principle: generative and discriminative approaches can produce the same classifier, but arrive at it through different assumptions and with different sample complexity guarantees.