Gaussian Discriminant Analysis

Gaussian Discriminant Analysis (GDA) is a generative classifier that models each class as a multivariate Gaussian distribution. It provides a principled probabilistic framework where classification reduces to comparing Gaussian densities, and its relationship to logistic regression reveals when generative modeling is advantageous.


Model Specification

GDA assumes the following generative process:

  1. The class label is drawn from a categorical prior: YCategorical(π)Y \sim \text{Categorical}(\boldsymbol{\pi}) where πk=P(Y=k)\pi_k = P(Y = k)
  2. Given the class, the features are drawn from a class-specific Gaussian: XY=kN(μk,Σk)\mathbf{X} \mid Y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

The class-conditional density is:

P(xY=k)=1(2π)D/2Σk1/2exp(12(xμk)Σk1(xμk))P(\mathbf{x} \mid Y = k) = \frac{1}{(2\pi)^{D/2} |\boldsymbol{\Sigma}_k|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right)

Classification uses Bayes’ rule:

y^=argmaxkP(Y=kx)=argmaxk[logP(xY=k)+logP(Y=k)]\hat{y} = \arg\max_k P(Y = k \mid \mathbf{x}) = \arg\max_k \left[\log P(\mathbf{x} \mid Y = k) + \log P(Y = k)\right]

Linear Discriminant Analysis (LDA)

LDA constrains all classes to share a common covariance matrix: Σk=Σ\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma} for all kk.

Decision Boundary

The log-posterior difference between classes kk and ll:

logP(Y=kx)P(Y=lx)=logP(xY=k)P(xY=l)+logπkπl\log \frac{P(Y = k \mid \mathbf{x})}{P(Y = l \mid \mathbf{x})} = \log \frac{P(\mathbf{x} \mid Y = k)}{P(\mathbf{x} \mid Y = l)} + \log \frac{\pi_k}{\pi_l}

With shared covariance, the quadratic terms xΣ1x\mathbf{x}^\top \boldsymbol{\Sigma}^{-1} \mathbf{x} cancel, yielding:

=(μkμl)Σ1x12(μkΣ1μkμlΣ1μl)+logπkπl= (\boldsymbol{\mu}_k - \boldsymbol{\mu}_l)^\top \boldsymbol{\Sigma}^{-1} \mathbf{x} - \frac{1}{2}(\boldsymbol{\mu}_k^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \boldsymbol{\mu}_l^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_l) + \log \frac{\pi_k}{\pi_l}

This is linear in x\mathbf{x}. The decision boundary {x:P(Y=kx)=P(Y=lx)}\{x : P(Y = k \mid \mathbf{x}) = P(Y = l \mid \mathbf{x})\} is a hyperplane. Hence the name “linear” discriminant analysis.

Parameter Estimation (MLE)

Given training data {(x(i),y(i))}i=1N\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N with NkN_k examples in class kk:

Prior: π^k=NkN\hat{\pi}_k = \frac{N_k}{N}

Class means: μ^k=1Nki:y(i)=kx(i)\hat{\boldsymbol{\mu}}_k = \frac{1}{N_k} \sum_{i: y^{(i)} = k} \mathbf{x}^{(i)}

Shared covariance: Σ^=1Nk=1Ki:y(i)=k(x(i)μ^k)(x(i)μ^k)\hat{\boldsymbol{\Sigma}} = \frac{1}{N} \sum_{k=1}^{K} \sum_{i: y^{(i)} = k} (\mathbf{x}^{(i)} - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}^{(i)} - \hat{\boldsymbol{\mu}}_k)^\top

All estimates are closed-form. Training is O(ND2)O(ND^2) (dominated by computing the D×DD \times D covariance matrix).


Quadratic Discriminant Analysis (QDA)

QDA allows each class to have its own covariance matrix Σk\boldsymbol{\Sigma}_k. The quadratic terms xΣk1x\mathbf{x}^\top \boldsymbol{\Sigma}_k^{-1} \mathbf{x} no longer cancel between classes, producing a quadratic decision boundary.

The discriminant function for class kk:

δk(x)=12logΣk12(xμk)Σk1(xμk)+logπk\delta_k(\mathbf{x}) = -\frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1}(\mathbf{x} - \boldsymbol{\mu}_k) + \log \pi_k

Parameter count comparison:

ModelParametersDecision Boundary
LDAKD+D(D+1)/2+K1KD + D(D+1)/2 + K-1Linear (hyperplane)
QDAKD+KD(D+1)/2+K1KD + KD(D+1)/2 + K-1Quadratic (conic section)
Naive Bayes (Gaussian)2KD+K12KD + K-1Linear (with diagonal Σ\Sigma)

QDA has KK times as many covariance parameters as LDA. When DD is large relative to NN, QDA overfits; LDA’s shared covariance provides regularization.


Connection to Logistic Regression

A fundamental result: LDA’s posterior is logistic regression.

For binary classification with shared covariance:

logP(Y=1x)P(Y=0x)=wx+b\log \frac{P(Y = 1 \mid \mathbf{x})}{P(Y = 0 \mid \mathbf{x})} = \mathbf{w}^\top \mathbf{x} + b

where w=Σ1(μ1μ0)\mathbf{w} = \boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0) and bb absorbs the constant terms. This is exactly the logistic regression model. LDA and logistic regression produce the same functional form for the posterior, but arrive at it differently:

  • LDA estimates μ0,μ1,Σ\boldsymbol{\mu}_0, \boldsymbol{\mu}_1, \boldsymbol{\Sigma} by maximum likelihood of the joint P(x,y)P(\mathbf{x}, y), then derives w\mathbf{w} and bb
  • Logistic regression directly estimates w\mathbf{w} and bb by maximum likelihood of the conditional P(yx)P(y \mid \mathbf{x})

When Does Each Win?

LDA is better when:

  • The Gaussian assumption approximately holds
  • Training data is limited (LDA’s stronger assumptions provide regularization)
  • The number of features DD is large relative to sample size NN

Logistic regression is better when:

  • The Gaussian assumption is violated (e.g., discrete features, multimodal distributions)
  • Training data is abundant (logistic regression’s weaker assumptions become advantageous)
  • The goal is to model P(Yx)P(Y \mid \mathbf{x}) without assumptions about P(x)P(\mathbf{x})

Ng and Jordan (2001) showed that LDA reaches its asymptotic error rate with O(logN)O(\log N) samples while logistic regression requires O(N)O(N), but logistic regression’s asymptotic error can be lower because it makes fewer assumptions. This is the generative-discriminative tradeoff in action.


Fisher’s Linear Discriminant

Fisher’s formulation of LDA approaches the problem from a dimensionality reduction perspective rather than a probabilistic one. The goal: find the projection direction w\mathbf{w} that maximizes class separation relative to within-class spread.

Between-class scatter: SB=(μ1μ0)(μ1μ0)\mathbf{S}_B = (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)^\top

Within-class scatter: SW=k=01i:y(i)=k(x(i)μk)(x(i)μk)\mathbf{S}_W = \sum_{k=0}^{1} \sum_{i: y^{(i)} = k} (\mathbf{x}^{(i)} - \boldsymbol{\mu}_k)(\mathbf{x}^{(i)} - \boldsymbol{\mu}_k)^\top

Fisher’s criterion: Maximize the ratio of between-class to within-class variance after projection:

J(w)=wSBwwSWwJ(\mathbf{w}) = \frac{\mathbf{w}^\top \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\top \mathbf{S}_W \mathbf{w}}

The optimal direction is: w=SW1(μ1μ0)\mathbf{w}^* = \mathbf{S}_W^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)

This is identical to the LDA weight vector, connecting the probabilistic and geometric viewpoints. For K>2K > 2 classes, Fisher’s LDA produces up to K1K - 1 discriminant directions (the eigenvectors of SW1SB\mathbf{S}_W^{-1}\mathbf{S}_B), providing a principled method for dimensionality reduction that preserves class structure.


Regularized Discriminant Analysis

When DD is large or NN is small, the sample covariance matrix Σ^\hat{\boldsymbol{\Sigma}} may be singular or poorly conditioned. Regularized Discriminant Analysis (RDA) interpolates between LDA and QDA:

Σ^k(α)=αΣ^k+(1α)Σ^\hat{\boldsymbol{\Sigma}}_k(\alpha) = \alpha \hat{\boldsymbol{\Sigma}}_k + (1 - \alpha) \hat{\boldsymbol{\Sigma}}

where α[0,1]\alpha \in [0, 1] controls the interpolation. α=0\alpha = 0 gives LDA, α=1\alpha = 1 gives QDA. Additionally, shrinkage toward the identity:

Σ^(γ)=γΣ^+(1γ)tr(Σ^)DI\hat{\boldsymbol{\Sigma}}(\gamma) = \gamma \hat{\boldsymbol{\Sigma}} + (1 - \gamma) \frac{\text{tr}(\hat{\boldsymbol{\Sigma}})}{D} \mathbf{I}

Both α\alpha and γ\gamma are tuned via cross-validation. This is closely related to the Ledoit-Wolf shrinkage estimator used in portfolio optimization and other high-dimensional covariance estimation problems.


Practical Considerations

Feature scaling. Unlike logistic regression, GDA is not invariant to feature scaling because the covariance matrix explicitly models feature variances. However, the MLE automatically adapts to different scales through Σ^\hat{\boldsymbol{\Sigma}}, so scaling is less critical than for distance-based methods like KNN.

Computational cost. Training: O(ND2+D3)O(ND^2 + D^3) (covariance estimation + inversion). Inference: O(D2)O(D^2) per example (matrix-vector product with Σ1\boldsymbol{\Sigma}^{-1}). For high-dimensional data, the D3D^3 inversion cost dominates; use regularization or dimensionality reduction first.

When classes are not Gaussian. GDA degrades when the Gaussian assumption is strongly violated. Transformations (log, Box-Cox) can improve normality. For fundamentally non-Gaussian data, discriminative models (logistic regression, random forests) are preferred.


Summary

ModelCovarianceBoundaryParametersBest When
LDAShared Σ\boldsymbol{\Sigma}LinearO(D2)O(D^2)Limited data, approximately Gaussian
QDAPer-class Σk\boldsymbol{\Sigma}_kQuadraticO(KD2)O(KD^2)Abundant data, different class shapes
Gaussian NBDiagonal Σk\boldsymbol{\Sigma}_kLinearO(KD)O(KD)High-dimensional, independent features
Logistic RegN/A (discriminative)LinearO(D)O(D)Non-Gaussian, abundant data

GDA provides a complete generative model of the data, enabling density estimation, anomaly detection (via Mahalanobis distance), and synthetic data generation in addition to classification. Its connection to logistic regression through the exponential family illuminates a fundamental principle: generative and discriminative approaches can produce the same classifier, but arrive at it through different assumptions and with different sample complexity guarantees.