Naive Bayes
Naive Bayes classifiers apply Bayes’ theorem with the assumption that features are conditionally independent given the class label. Despite this strong assumption rarely holding in practice, Naive Bayes classifiers are competitive for text classification, spam filtering, and other high-dimensional problems where the independence assumption provides beneficial regularization.
Generative vs Discriminative Models
Classification models fall into two categories based on what they learn:
Discriminative models learn the decision boundary directly. Examples: logistic regression, SVMs, neural networks. They model “given these features, what’s the most likely class?”
Generative models learn the joint distribution , then apply Bayes’ theorem to classify. Examples: Naive Bayes, Gaussian Discriminant Analysis, Hidden Markov Models. They model “what does data from each class look like?”
Generative models provide richer information (they can generate synthetic examples from each class) but require stronger assumptions about the data distribution. When those assumptions approximately hold, generative models can outperform discriminative models, especially with limited training data (Ng and Jordan, 2001).
Bayes’ Theorem for Classification
Given a feature vector and class label :
The components:
- : prior probability of class
- : likelihood of observing given class
- : evidence (normalizing constant, same for all classes)
- : posterior probability
For classification, we choose the class with the highest posterior:
The evidence is dropped because it’s constant across classes and doesn’t affect the argmax.
The Naive Bayes Assumption
The likelihood for a -dimensional feature vector requires modeling a -dimensional distribution per class. Without simplifying assumptions, the number of parameters grows exponentially with .
The naive Bayes assumption: features are conditionally independent given the class label:
This reduces the -dimensional density estimation problem to one-dimensional problems. The classifier becomes:
In log space (for numerical stability):
Variants by Feature Distribution
The choice of depends on the feature type.
Gaussian Naive Bayes
For continuous features, assume each feature follows a Gaussian distribution per class:
Parameters and are estimated by maximum likelihood (sample mean and variance of feature within class ). Total parameters: (mean and variance per feature per class).
Multinomial Naive Bayes
For count-valued features (e.g., word counts in text). Each document is modeled as a bag of words drawn from a class-specific multinomial:
where is estimated as:
The term is Laplace smoothing (additive smoothing): it prevents zero probabilities for words that appear in test documents but not in training documents of class . With , a single unseen word drives the entire class probability to zero. (Laplace smoothing) is the standard default.
Bernoulli Naive Bayes
For binary features (word present/absent, rather than word count):
Bernoulli NB explicitly models the absence of features, which Multinomial NB does not. For short documents, this can improve performance because the absence of a word carries signal.
Parameter Estimation
Prior. where is the count of training examples in class .
Likelihood parameters. Estimated from class-conditional statistics:
| Variant | Parameter | Estimate |
|---|---|---|
| Gaussian | Sample mean, variance of feature in class | |
| Multinomial | Smoothed word frequency (see above) | |
| Bernoulli | Fraction of class- documents containing word |
All estimates are closed-form (no iterative optimization). Training is a single pass through the data: time and space.
Text Classification
Naive Bayes is a foundational baseline for text classification. The standard pipeline:
- Tokenize documents into words
- Vectorize using bag-of-words or TF-IDF representation
- Train Multinomial NB on word count features with Laplace smoothing
- Classify by computing log-posterior for each class
For spam filtering, sentiment analysis, and topic classification, Multinomial NB achieves competitive accuracy with minimal training time. It serves as a strong baseline before exploring more complex models.
Connection to the tracker cost model. The TF-IDF URL embeddings in the tracker paper use a similar bag-of-tokens representation: URL paths are tokenized on delimiters, converted to TF-IDF vectors, and reduced via SVD. The key difference is that the tracker model uses these as features for a gradient boosted regressor rather than for direct Bayesian classification.
Why Naive Bayes Works Despite Wrong Assumptions
The conditional independence assumption is almost always violated in real data. Features are correlated: in text, the words “machine” and “learning” co-occur far more than independence predicts. Yet Naive Bayes often classifies correctly. Two explanations:
The posterior is what matters, not the likelihood. Even if the estimated is a poor model of the true class-conditional density, the resulting posterior may still place the decision boundary in the right location. Classification accuracy depends on the decision boundary, not on the quality of the probability estimates.
Bias-variance perspective (Domingos and Pazzani, 1997). The independence assumption introduces bias (the model is misspecified) but dramatically reduces variance (far fewer parameters to estimate). In high-dimensional, limited-data settings, this tradeoff favors Naive Bayes. As training data grows, discriminative models with fewer assumptions eventually overtake it.
Limitations
Probability calibration. Naive Bayes produces poorly calibrated probability estimates: the predicted values tend to be pushed toward 0 and 1 due to the independence assumption multiplying many factors. The ranking is often correct (the class with highest estimated probability is often the right one), but the probability magnitudes are unreliable. Platt scaling or isotonic regression can recalibrate if calibrated probabilities are needed.
Correlated features. When features are strongly correlated, Naive Bayes double-counts evidence. Feature selection or PCA before Naive Bayes can mitigate this.
Continuous features. Gaussian NB assumes unimodal, symmetric distributions. For multimodal or skewed features, kernel density estimation or discretization into bins may be needed.
Summary
| Property | Naive Bayes |
|---|---|
| Type | Generative, probabilistic |
| Assumption | Features conditionally independent given class |
| Training | , single pass, closed-form |
| Inference | per example |
| Strengths | Fast, scales to high dimensions, good with limited data |
| Weaknesses | Poor probability calibration, struggles with correlated features |
| Best for | Text classification, spam filtering, high-dimensional sparse data |