1: K-Nearest Neighbors
K-Nearest Neighbors (KNN) is one of the simplest and most intuitive machine learning algorithms. It’s a great starting point for understanding supervised learning because it requires no training phase—you just store the data and make predictions at inference time.
Supervised Learning
Before diving into KNN, let’s establish the supervised learning framework.
Objective: Learn a function that maps an input to an output based on given input-output pairs.
- Given a training set (labeled examples)
- Produce a hypothesis/model (function mapping inputs to outputs)
- To perform predictions on unseen data
The Setup
Input is a vector :
Each scalar is a feature describing a characteristic of the input. Many data types (text, images, audio) can be represented as vectors.
Output types:
- Classification: is discrete (categorical)
- Regression: is continuous (real number)
- Structured prediction: is structured (e.g., text, image)
Training Set Notation
The training set has labeled examples:
Notation convention:
- Superscript = index of data point in dataset
- Subscript = index of feature in input vector
So means the 12th feature of the 35th data point.
Evaluation
For classification, we measure accuracy as the fraction of correct predictions:
where if condition is true, 0 otherwise.
Nearest Neighbours
The simplest approach: find the closest training point to your query and copy its label.
Algorithm
- Calculate distance between every training point and new point
- Find the training point closest to :
- Predict
Distance Metrics
Euclidean distance ( distance) — the default choice:
Cosine similarity — measures angle between vectors:
Decision Boundaries
A decision boundary is the region where the classifier switches from predicting one class to another.
For nearest neighbors, decision boundaries are:
- Perpendicular bisectors between points of different classes
- Form a Voronoi diagram partitioning the space
The problem: 1-NN is extremely sensitive to noise. A single mislabeled point creates an “island” of wrong predictions around it.
K-Nearest Neighbours
Instead of using just the closest point, use closest points and take a majority vote.
Algorithm
- Find the closest training points to
- Predict by majority vote among nearest neighbours
Choosing K
is a hyperparameter—a setting chosen before training that governs model behavior.
| too small | too large |
|---|---|
| Complex decision boundary | Smooth decision boundary |
| Captures fine-grained patterns | Misses local patterns |
| Sensitive to noise | Stable predictions |
| May overfit (high variance) | May underfit (high bias) |
Common heuristic:
Training, Validation, and Test Sets
We split data into three sets for different purposes:
| Set | Purpose |
|---|---|
| Training | Learn model parameters |
| Validation | Tune hyperparameters, choose between models |
| Test | Measure generalization error (use only once, at the end) |
Workflow with KNN
- Split dataset into train/validation/test
- For each , “train” KNN on training set
- Compute accuracy on validation set
- Select with best validation accuracy
- Report final accuracy on test set
Critical: Never use the test set for hyperparameter tuning. That defeats its purpose as an unbiased estimate of generalization.
Limitations of KNN
Curse of Dimensionality
As dimensions grow, data space expands exponentially. Points become very far apart—the “nearest” neighbor isn’t meaningfully closer than any other point.
Problem: KNN struggles with high-dimensional data.
Solutions: Add more data or reduce dimensionality (PCA, feature selection).
Sensitivity to Feature Scale
KNN relies on distance calculations. A feature with larger range dominates the distance.
Example: If one feature ranges 0-1 and another 0-1000, the second feature overwhelms the first.
Solution: Normalize features to mean = 0 and variance = 1:
Computational Cost
- Training: — just store the data
- Inference: — compute distance to all training points
For large datasets, this makes KNN impractical. Solutions include KD-trees, ball trees, or approximate nearest neighbor methods.
No Learned Representation
KNN doesn’t learn anything—it just memorizes. This means:
- No compression of knowledge
- Can’t extrapolate beyond training data
- No insight into what makes classes different
Summary
KNN is a non-parametric, instance-based learning algorithm:
- No training phase (lazy learning)
- Predictions based on local neighborhood
- Simple to understand and implement
- Works well for low-dimensional data with sufficient samples
Key decisions:
- Choice of (use validation set)
- Distance metric (Euclidean is default)
- Feature normalization (almost always necessary)
Despite its simplicity, KNN establishes important concepts we’ll revisit: decision boundaries, bias-variance tradeoff, hyperparameter tuning, and the train/validation/test split.
Comments
Loading comments...