Naive Bayes

Naive Bayes classifiers apply Bayes’ theorem with the assumption that features are conditionally independent given the class label. Despite this strong assumption rarely holding in practice, Naive Bayes classifiers are competitive for text classification, spam filtering, and other high-dimensional problems where the independence assumption provides beneficial regularization.


Generative vs Discriminative Models

Classification models fall into two categories based on what they learn:

Discriminative models learn the decision boundary P(YX)P(Y \mid \mathbf{X}) directly. Examples: logistic regression, SVMs, neural networks. They model “given these features, what’s the most likely class?”

Generative models learn the joint distribution P(X,Y)=P(XY)P(Y)P(\mathbf{X}, Y) = P(\mathbf{X} \mid Y) P(Y), then apply Bayes’ theorem to classify. Examples: Naive Bayes, Gaussian Discriminant Analysis, Hidden Markov Models. They model “what does data from each class look like?”

Generative models provide richer information (they can generate synthetic examples from each class) but require stronger assumptions about the data distribution. When those assumptions approximately hold, generative models can outperform discriminative models, especially with limited training data (Ng and Jordan, 2001).


Bayes’ Theorem for Classification

Given a feature vector x\mathbf{x} and class label yy:

P(Y=yX=x)=P(X=xY=y)P(Y=y)P(X=x)P(Y = y \mid \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} \mid Y = y) \cdot P(Y = y)}{P(\mathbf{X} = \mathbf{x})}

The components:

  • P(Y=y)P(Y = y): prior probability of class yy
  • P(X=xY=y)P(\mathbf{X} = \mathbf{x} \mid Y = y): likelihood of observing x\mathbf{x} given class yy
  • P(X=x)P(\mathbf{X} = \mathbf{x}): evidence (normalizing constant, same for all classes)
  • P(Y=yX=x)P(Y = y \mid \mathbf{X} = \mathbf{x}): posterior probability

For classification, we choose the class with the highest posterior:

y^=argmaxyP(Y=yx)=argmaxyP(xY=y)P(Y=y)\hat{y} = \arg\max_y P(Y = y \mid \mathbf{x}) = \arg\max_y P(\mathbf{x} \mid Y = y) \cdot P(Y = y)

The evidence P(x)P(\mathbf{x}) is dropped because it’s constant across classes and doesn’t affect the argmax.


The Naive Bayes Assumption

The likelihood P(xY=y)P(\mathbf{x} \mid Y = y) for a DD-dimensional feature vector requires modeling a DD-dimensional distribution per class. Without simplifying assumptions, the number of parameters grows exponentially with DD.

The naive Bayes assumption: features are conditionally independent given the class label:

P(xY=y)=j=1DP(xjY=y)P(\mathbf{x} \mid Y = y) = \prod_{j=1}^{D} P(x_j \mid Y = y)

This reduces the DD-dimensional density estimation problem to DD one-dimensional problems. The classifier becomes:

y^=argmaxyP(Y=y)j=1DP(xjY=y)\hat{y} = \arg\max_y P(Y = y) \prod_{j=1}^{D} P(x_j \mid Y = y)

In log space (for numerical stability):

y^=argmaxy[logP(Y=y)+j=1DlogP(xjY=y)]\hat{y} = \arg\max_y \left[\log P(Y = y) + \sum_{j=1}^{D} \log P(x_j \mid Y = y)\right]

Variants by Feature Distribution

The choice of P(xjY=y)P(x_j \mid Y = y) depends on the feature type.

Gaussian Naive Bayes

For continuous features, assume each feature follows a Gaussian distribution per class:

P(xjY=y)=12πσjy2exp((xjμjy)22σjy2)P(x_j \mid Y = y) = \frac{1}{\sqrt{2\pi \sigma_{jy}^2}} \exp\left(-\frac{(x_j - \mu_{jy})^2}{2\sigma_{jy}^2}\right)

Parameters μjy\mu_{jy} and σjy2\sigma_{jy}^2 are estimated by maximum likelihood (sample mean and variance of feature jj within class yy). Total parameters: 2DK2 \cdot D \cdot K (mean and variance per feature per class).

Multinomial Naive Bayes

For count-valued features (e.g., word counts in text). Each document is modeled as a bag of words drawn from a class-specific multinomial:

P(xY=y)j=1DθjyxjP(\mathbf{x} \mid Y = y) \propto \prod_{j=1}^{D} \theta_{jy}^{x_j}

where θjy=P(word jclass y)\theta_{jy} = P(\text{word } j \mid \text{class } y) is estimated as:

θ^jy=Njy+αj=1DNjy+Dα\hat{\theta}_{jy} = \frac{N_{jy} + \alpha}{\sum_{j'=1}^{D} N_{j'y} + D\alpha}

The α\alpha term is Laplace smoothing (additive smoothing): it prevents zero probabilities for words that appear in test documents but not in training documents of class yy. With α=0\alpha = 0, a single unseen word drives the entire class probability to zero. α=1\alpha = 1 (Laplace smoothing) is the standard default.

Bernoulli Naive Bayes

For binary features (word present/absent, rather than word count):

P(xjY=y)=θjyxj(1θjy)1xjP(x_j \mid Y = y) = \theta_{jy}^{x_j} (1 - \theta_{jy})^{1 - x_j}

Bernoulli NB explicitly models the absence of features, which Multinomial NB does not. For short documents, this can improve performance because the absence of a word carries signal.


Parameter Estimation

Prior. P^(Y=y)=NyN\hat{P}(Y = y) = \frac{N_y}{N} where NyN_y is the count of training examples in class yy.

Likelihood parameters. Estimated from class-conditional statistics:

VariantParameterEstimate
Gaussianμjy,σjy2\mu_{jy}, \sigma_{jy}^2Sample mean, variance of feature jj in class yy
Multinomialθjy\theta_{jy}Smoothed word frequency (see above)
Bernoulliθjy\theta_{jy}Fraction of class-yy documents containing word jj

All estimates are closed-form (no iterative optimization). Training is a single pass through the data: O(ND)O(ND) time and O(DK)O(DK) space.


Text Classification

Naive Bayes is a foundational baseline for text classification. The standard pipeline:

  1. Tokenize documents into words
  2. Vectorize using bag-of-words or TF-IDF representation
  3. Train Multinomial NB on word count features with Laplace smoothing
  4. Classify by computing log-posterior for each class

For spam filtering, sentiment analysis, and topic classification, Multinomial NB achieves competitive accuracy with minimal training time. It serves as a strong baseline before exploring more complex models.

Connection to the tracker cost model. The TF-IDF URL embeddings in the tracker paper use a similar bag-of-tokens representation: URL paths are tokenized on delimiters, converted to TF-IDF vectors, and reduced via SVD. The key difference is that the tracker model uses these as features for a gradient boosted regressor rather than for direct Bayesian classification.


Why Naive Bayes Works Despite Wrong Assumptions

The conditional independence assumption is almost always violated in real data. Features are correlated: in text, the words “machine” and “learning” co-occur far more than independence predicts. Yet Naive Bayes often classifies correctly. Two explanations:

The posterior is what matters, not the likelihood. Even if the estimated P(xY)P(\mathbf{x} \mid Y) is a poor model of the true class-conditional density, the resulting posterior P(Yx)P(Y \mid \mathbf{x}) may still place the decision boundary in the right location. Classification accuracy depends on the decision boundary, not on the quality of the probability estimates.

Bias-variance perspective (Domingos and Pazzani, 1997). The independence assumption introduces bias (the model is misspecified) but dramatically reduces variance (far fewer parameters to estimate). In high-dimensional, limited-data settings, this tradeoff favors Naive Bayes. As training data grows, discriminative models with fewer assumptions eventually overtake it.


Limitations

Probability calibration. Naive Bayes produces poorly calibrated probability estimates: the predicted P(Yx)P(Y \mid \mathbf{x}) values tend to be pushed toward 0 and 1 due to the independence assumption multiplying many factors. The ranking is often correct (the class with highest estimated probability is often the right one), but the probability magnitudes are unreliable. Platt scaling or isotonic regression can recalibrate if calibrated probabilities are needed.

Correlated features. When features are strongly correlated, Naive Bayes double-counts evidence. Feature selection or PCA before Naive Bayes can mitigate this.

Continuous features. Gaussian NB assumes unimodal, symmetric distributions. For multimodal or skewed features, kernel density estimation or discretization into bins may be needed.


Summary

PropertyNaive Bayes
TypeGenerative, probabilistic
AssumptionFeatures conditionally independent given class
TrainingO(ND)O(ND), single pass, closed-form
InferenceO(DK)O(DK) per example
StrengthsFast, scales to high dimensions, good with limited data
WeaknessesPoor probability calibration, struggles with correlated features
Best forText classification, spam filtering, high-dimensional sparse data