Naive Bayes

Naive Bayes classifiers apply Bayes’ theorem with the assumption that features are conditionally independent given the class label. Despite this strong assumption rarely holding in practice, Naive Bayes classifiers are competitive for text classification, spam filtering, and other high-dimensional problems where the independence assumption provides beneficial regularization.

Generative vs Discriminative Models

Classification models fall into two categories based on what they learn:

Discriminative models learn the decision boundary $P(Y \mid \mathbf{X})$ directly. Examples: logistic regression, SVMs, neural networks. They model “given these features, what’s the most likely class?”

Generative models learn the joint distribution $P(\mathbf{X}, Y) = P(\mathbf{X} \mid Y) P(Y)$ , then apply Bayes’ theorem to classify. Examples: Naive Bayes, Gaussian Discriminant Analysis, Hidden Markov Models. They model “what does data from each class look like?”

Generative models provide richer information (they can generate synthetic examples from each class) but require stronger assumptions about the data distribution. When those assumptions approximately hold, generative models can outperform discriminative models, especially with limited training data (Ng and Jordan, 2001).

Bayes’ Theorem for Classification

Given a feature vector $\mathbf{x}$ and class label $y$ :

P(Y = y \mid \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} \mid Y = y) \cdot P(Y = y)}{P(\mathbf{X} = \mathbf{x})}

The components:

$P(Y = y)$ : prior probability of class $y$
$P(\mathbf{X} = \mathbf{x} \mid Y = y)$ : likelihood of observing $\mathbf{x}$ given class $y$
$P(\mathbf{X} = \mathbf{x})$ : evidence (normalizing constant, same for all classes)
$P(Y = y \mid \mathbf{X} = \mathbf{x})$ : posterior probability

For classification, we choose the class with the highest posterior:

\hat{y} = \arg\max_y P(Y = y \mid \mathbf{x}) = \arg\max_y P(\mathbf{x} \mid Y = y) \cdot P(Y = y)

The evidence $P(\mathbf{x})$ is dropped because it’s constant across classes and doesn’t affect the argmax.

The Naive Bayes Assumption

The likelihood $P(\mathbf{x} \mid Y = y)$ for a $D$ -dimensional feature vector requires modeling a $D$ -dimensional distribution per class. Without simplifying assumptions, the number of parameters grows exponentially with $D$ .

The naive Bayes assumption: features are conditionally independent given the class label:

P(\mathbf{x} \mid Y = y) = \prod_{j=1}^{D} P(x_j \mid Y = y)

This reduces the $D$ -dimensional density estimation problem to $D$ one-dimensional problems. The classifier becomes:

\hat{y} = \arg\max_y P(Y = y) \prod_{j=1}^{D} P(x_j \mid Y = y)

In log space (for numerical stability):

\hat{y} = \arg\max_y \left[\log P(Y = y) + \sum_{j=1}^{D} \log P(x_j \mid Y = y)\right]

Variants by Feature Distribution

The choice of $P(x_j \mid Y = y)$ depends on the feature type.

Gaussian Naive Bayes

For continuous features, assume each feature follows a Gaussian distribution per class:

P(x_j \mid Y = y) = \frac{1}{\sqrt{2\pi \sigma_{jy}^2}} \exp\left(-\frac{(x_j - \mu_{jy})^2}{2\sigma_{jy}^2}\right)

Parameters $\mu_{jy}$ and $\sigma_{jy}^2$ are estimated by maximum likelihood (sample mean and variance of feature $j$ within class $y$ ). Total parameters: $2 \cdot D \cdot K$ (mean and variance per feature per class).

Multinomial Naive Bayes

For count-valued features (e.g., word counts in text). Each document is modeled as a bag of words drawn from a class-specific multinomial:

P(\mathbf{x} \mid Y = y) \propto \prod_{j=1}^{D} \theta_{jy}^{x_j}

where $\theta_{jy} = P(\text{word } j \mid \text{class } y)$ is estimated as:

\hat{\theta}_{jy} = \frac{N_{jy} + \alpha}{\sum_{j'=1}^{D} N_{j'y} + D\alpha}

The $\alpha$ term is Laplace smoothing (additive smoothing): it prevents zero probabilities for words that appear in test documents but not in training documents of class $y$ . With $\alpha = 0$ , a single unseen word drives the entire class probability to zero. $\alpha = 1$ (Laplace smoothing) is the standard default.

Bernoulli Naive Bayes

For binary features (word present/absent, rather than word count):

P(x_j \mid Y = y) = \theta_{jy}^{x_j} (1 - \theta_{jy})^{1 - x_j}

Bernoulli NB explicitly models the absence of features, which Multinomial NB does not. For short documents, this can improve performance because the absence of a word carries signal.

Parameter Estimation

Prior. $\hat{P}(Y = y) = \frac{N_y}{N}$ where $N_y$ is the count of training examples in class $y$ .

Likelihood parameters. Estimated from class-conditional statistics:

Variant	Parameter	Estimate
Gaussian	$\mu_{jy}, \sigma_{jy}^2$	Sample mean, variance of feature $j$ in class $y$
Multinomial	$\theta_{jy}$	Smoothed word frequency (see above)
Bernoulli	$\theta_{jy}$	Fraction of class- $y$ documents containing word $j$

All estimates are closed-form (no iterative optimization). Training is a single pass through the data: $O(ND)$ time and $O(DK)$ space.

Text Classification

Naive Bayes is a foundational baseline for text classification. The standard pipeline:

Tokenize documents into words
Vectorize using bag-of-words or TF-IDF representation
Train Multinomial NB on word count features with Laplace smoothing
Classify by computing log-posterior for each class

For spam filtering, sentiment analysis, and topic classification, Multinomial NB achieves competitive accuracy with minimal training time. It serves as a strong baseline before exploring more complex models.

Connection to the tracker cost model. The TF-IDF URL embeddings in the tracker paper use a similar bag-of-tokens representation: URL paths are tokenized on delimiters, converted to TF-IDF vectors, and reduced via SVD. The key difference is that the tracker model uses these as features for a gradient boosted regressor rather than for direct Bayesian classification.

Why Naive Bayes Works Despite Wrong Assumptions

The conditional independence assumption is almost always violated in real data. Features are correlated: in text, the words “machine” and “learning” co-occur far more than independence predicts. Yet Naive Bayes often classifies correctly. Two explanations:

The posterior is what matters, not the likelihood. Even if the estimated $P(\mathbf{x} \mid Y)$ is a poor model of the true class-conditional density, the resulting posterior $P(Y \mid \mathbf{x})$ may still place the decision boundary in the right location. Classification accuracy depends on the decision boundary, not on the quality of the probability estimates.

Bias-variance perspective (Domingos and Pazzani, 1997). The independence assumption introduces bias (the model is misspecified) but dramatically reduces variance (far fewer parameters to estimate). In high-dimensional, limited-data settings, this tradeoff favors Naive Bayes. As training data grows, discriminative models with fewer assumptions eventually overtake it.

Limitations

Probability calibration. Naive Bayes produces poorly calibrated probability estimates: the predicted $P(Y \mid \mathbf{x})$ values tend to be pushed toward 0 and 1 due to the independence assumption multiplying many factors. The ranking is often correct (the class with highest estimated probability is often the right one), but the probability magnitudes are unreliable. Platt scaling or isotonic regression can recalibrate if calibrated probabilities are needed.

Correlated features. When features are strongly correlated, Naive Bayes double-counts evidence. Feature selection or PCA before Naive Bayes can mitigate this.

Continuous features. Gaussian NB assumes unimodal, symmetric distributions. For multimodal or skewed features, kernel density estimation or discretization into bins may be needed.

Summary

Property	Naive Bayes
Type	Generative, probabilistic
Assumption	Features conditionally independent given class
Training	$O(ND)$ , single pass, closed-form
Inference	$O(DK)$ per example
Strengths	Fast, scales to high dimensions, good with limited data
Weaknesses	Poor probability calibration, struggles with correlated features
Best for	Text classification, spam filtering, high-dimensional sparse data