K-Nearest Neighbors

K-Nearest Neighbors (KNN) is one of the simplest and most intuitive machine learning algorithms. It’s a great starting point for understanding supervised learning because it requires no training phase, you just store the data and make predictions at inference time.


Supervised Learning

Before diving into KNN, let’s establish the supervised learning framework.

Objective: Learn a function that maps an input to an output based on given input-output pairs.

  • Given a training set (labeled examples)
  • Produce a hypothesis/model (function mapping inputs to outputs)
  • To perform predictions on unseen data

The Setup

Input is a vector xRD\mathbf{x} \in \mathbb{R}^D:

x=[x1x2xD]\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_D \end{bmatrix}

Each scalar xix_i is a feature describing a characteristic of the input. Many data types (text, images, audio) can be represented as vectors.

Output types:

  • Classification: tt is discrete (categorical)
  • Regression: tt is continuous (real number)
  • Structured prediction: tt is structured (e.g., text, image)

Training Set Notation

The training set has NN labeled examples:

{(x(1),t(1)),(x(2),t(2)),,(x(N),t(N))}\{(\mathbf{x}^{(1)}, t^{(1)}), (\mathbf{x}^{(2)}, t^{(2)}), \cdots, (\mathbf{x}^{(N)}, t^{(N)})\}

Notation convention:

  • Superscript (i)(i) = index of data point in dataset
  • Subscript jj = index of feature in input vector

So x12(35)x_{12}^{(35)} means the 12th feature of the 35th data point.

Evaluation

For classification, we measure accuracy as the fraction of correct predictions:

Accuracy=i=1NI[t(i)=y(i)]N\text{Accuracy} = \frac{\sum_{i=1}^{N} \mathbb{I}[t^{(i)} = y^{(i)}]}{N}

where I[e]=1\mathbb{I}[e] = 1 if condition ee is true, 0 otherwise.


Nearest Neighbours

The simplest approach: find the closest training point to your query and copy its label.

Algorithm

  1. Calculate distance between every training point x(i)\mathbf{x}^{(i)} and new point x\mathbf{x}
  2. Find the training point (x(c),t(c))(\mathbf{x}^{(c)}, t^{(c)}) closest to x\mathbf{x}:
x(c)=argminx(i)training setdistance(x(i),x)\mathbf{x}^{(c)} = \arg\min_{\mathbf{x}^{(i)} \in \text{training set}} \text{distance}(\mathbf{x}^{(i)}, \mathbf{x})
  1. Predict y=t(c)y = t^{(c)}

Distance Metrics

Euclidean distance (L2L^2 distance), the default choice:

x(a)x(b)2=j=1D(xj(a)xj(b))2\|\mathbf{x}^{(a)} - \mathbf{x}^{(b)}\|_2 = \sqrt{\sum_{j=1}^{D} (x_j^{(a)} - x_j^{(b)})^2}

Cosine similarity, measures angle between vectors:

cosine(x(a),x(b))=x(a)x(b)x(a)2x(b)2\text{cosine}(\mathbf{x}^{(a)}, \mathbf{x}^{(b)}) = \frac{\mathbf{x}^{(a)} \cdot \mathbf{x}^{(b)}}{\|\mathbf{x}^{(a)}\|_2 \|\mathbf{x}^{(b)}\|_2}

Decision Boundaries

A decision boundary is the region where the classifier switches from predicting one class to another.

For nearest neighbors, decision boundaries are:

  • Perpendicular bisectors between points of different classes
  • Form a Voronoi diagram partitioning the space

The problem: 1-NN is extremely sensitive to noise. A single mislabeled point creates an “island” of wrong predictions around it.


K-Nearest Neighbours

Instead of using just the closest point, use KK closest points and take a majority vote.

Algorithm

  1. Find the KK closest training points to x\mathbf{x}
  2. Predict by majority vote among KK nearest neighbours

Choosing K

KK is a hyperparameter, a setting chosen before training that governs model behavior.

KK too smallKK too large
Complex decision boundarySmooth decision boundary
Captures fine-grained patternsMisses local patterns
Sensitive to noiseStable predictions
May overfit (high variance)May underfit (high bias)

Common heuristic: K<NK < \sqrt{N}


Training, Validation, and Test Sets

We split data into three sets for different purposes:

SetPurpose
TrainingLearn model parameters
ValidationTune hyperparameters, choose between models
TestMeasure generalization error (use only once, at the end)

Workflow with KNN

  1. Split dataset into train/validation/test
  2. For each K{1,2,3,...}K \in \{1, 2, 3, ...\}, “train” KNN on training set
  3. Compute accuracy on validation set
  4. Select KK with best validation accuracy
  5. Report final accuracy on test set

Critical: Never use the test set for hyperparameter tuning. That defeats its purpose as an unbiased estimate of generalization.


Limitations of KNN

Curse of Dimensionality

As dimensions grow, data space expands exponentially. Points become very far apart, the “nearest” neighbor isn’t meaningfully closer than any other point.

Problem: KNN struggles with high-dimensional data.

Solutions: Add more data or reduce dimensionality (PCA, feature selection).

Sensitivity to Feature Scale

KNN relies on distance calculations. A feature with larger range dominates the distance.

Example: If one feature ranges 0-1 and another 0-1000, the second feature overwhelms the first.

Solution: Normalize features to mean = 0 and variance = 1:

xnorm=xμσx_{\text{norm}} = \frac{x - \mu}{\sigma}

Computational Cost

  • Training: O(1)O(1), just store the data
  • Inference: O(ND)O(ND), compute distance to all NN training points

For large datasets, this makes KNN impractical. Solutions include KD-trees, ball trees, or approximate nearest neighbor methods.

No Learned Representation

KNN doesn’t learn anything, it just memorizes. This means:

  • No compression of knowledge
  • Can’t extrapolate beyond training data
  • No insight into what makes classes different

Summary

KNN is a non-parametric, instance-based learning algorithm:

  • No training phase (lazy learning)
  • Predictions based on local neighborhood
  • Simple to understand and implement
  • Works well for low-dimensional data with sufficient samples

Key decisions:

  • Choice of KK (use validation set)
  • Distance metric (Euclidean is default)
  • Feature normalization (almost always necessary)

Despite its simplicity, KNN establishes important concepts we’ll revisit: decision boundaries, bias-variance tradeoff, hyperparameter tuning, and the train/validation/test split.


Connection to Modern Practice

Approximate Nearest Neighbors

Exact KNN with O(ND)O(ND) inference is impractical for large-scale applications. Modern systems use approximate nearest neighbor (ANN) algorithms that trade a small amount of recall for dramatic speedups:

AlgorithmApproachComplexityUsed In
KD-TreeSpace partitioningO(DlogN)O(D \log N) avgscikit-learn (low DD)
Ball TreeMetric treeO(DlogN)O(D \log N) avgscikit-learn (moderate DD)
LSHLocality-sensitive hashingO(D)O(D) per queryNear-duplicate detection
HNSWHierarchical navigable small world graphO(DlogN)O(D \log N)FAISS, Pinecone, Qdrant
IVFInverted file index (cluster-based)O(Dnprobe)O(D \cdot n_{\text{probe}})FAISS (GPU)

HNSW is the dominant ANN algorithm in production vector databases (used in RAG systems for semantic search). It builds a multi-layer proximity graph where higher layers provide coarse navigation and lower layers provide fine-grained search. See the RAG article for details.

KNN as a Baseline and Diagnostic

KNN serves as a useful diagnostic in modern ML workflows:

  • Baseline comparison. If a complex model doesn’t outperform KNN, the features may lack discriminative power or the task may not require learned representations.
  • Data quality check. High KNN accuracy suggests the classes are well-separated in feature space; low accuracy suggests label noise or insufficient features.
  • Embedding evaluation. After training an embedding model, KNN accuracy on the embeddings measures whether semantically similar items are close in the learned space. This is the standard evaluation protocol for contrastive learning methods (SimCLR, CLIP).

The Kernel Perspective

KNN can be viewed as kernel regression with a uniform kernel:

f^(x)=i=1NK(x,x(i))y(i)i=1NK(x,x(i))\hat{f}(\mathbf{x}) = \frac{\sum_{i=1}^N K(\mathbf{x}, \mathbf{x}^{(i)}) y^{(i)}}{\sum_{i=1}^N K(\mathbf{x}, \mathbf{x}^{(i)})}

where K(x,x(i))=I[x(i)KNN(x)]K(\mathbf{x}, \mathbf{x}^{(i)}) = \mathbb{I}[\mathbf{x}^{(i)} \in \text{KNN}(\mathbf{x})]. Replacing the uniform kernel with a Gaussian kernel produces kernel regression (Nadaraya-Watson estimator), which smoothly weights neighbors by distance rather than applying a hard cutoff at KK. This connects KNN to the broader kernel methods literature and Gaussian processes.