16: Randomized Algorithms

Introduction

A randomized algorithm uses random bits as part of its logic, making decisions based on coin flips rather than purely deterministic rules. Randomization often yields algorithms that are simpler, faster, or more elegant than their deterministic counterparts. The analysis of randomized algorithms relies on probability theory: we reason about expected running times, success probabilities, and concentration inequalities.

Las Vegas vs. Monte Carlo

Randomized algorithms fall into two categories:

Las Vegas algorithms always produce a correct answer; only the running time is random. The guarantee is: for every input, the expected running time is bounded, and the output is always correct. Randomized quicksort is the canonical example.

Monte Carlo algorithms always run in bounded time but may produce an incorrect answer with some probability. The guarantee is: for every input, the algorithm terminates in polynomial time and is correct with probability at least $1 - \delta$ for some parameter $\delta$ . Karger’s minimum cut algorithm is an example. Repeated independent runs reduce the error probability exponentially: $k$ runs reduce failure probability to $\delta^k$ .

Randomized Quicksort

Deterministic quicksort with a fixed pivot rule (e.g., always choosing the first element) runs in $\Theta(n^2)$ on adversarial inputs. Randomized quicksort selects a pivot uniformly at random from the current subarray.

Theorem. The expected number of comparisons made by randomized quicksort is $2n \ln n + O(n) = O(n \log n)$ .

Proof. Let $z_1 < z_2 < \cdots < z_n$ be the elements in sorted order. Define the indicator $X_{ij} = 1$ if $z_i$ and $z_j$ are ever compared. Two elements are compared if and only if one of them is chosen as a pivot before any element between them in sorted order. Among the $j - i + 1$ elements $z_i, z_{i+1}, \ldots, z_j$ , the probability that $z_i$ or $z_j$ is chosen first is $\frac{2}{j - i + 1}$ . Therefore:

$E\left[\sum_{i < j} X_{ij}\right] = \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \frac{2}{j-i+1} = \sum_{i=1}^{n-1} \sum_{k=2}^{n-i+1} \frac{2}{k} \leq 2n \sum_{k=2}^{n} \frac{1}{k} = 2n(H_n - 1) = 2n \ln n + O(n)$

The expected time is $O(n \log n)$ regardless of the input distribution. This is a Las Vegas algorithm: the output is always a correctly sorted array.

Karger’s Minimum Cut Algorithm

Given an undirected multigraph $G = (V, E)$ with $n = |V|$ vertices, Karger’s algorithm finds a global minimum cut using random edge contraction:

While $|V| > 2$ $∣ V ∣ > 2$ :
- Select an edge $(u, v)$ uniformly at random.
- Contract $u$ and $v$ : merge them into a single vertex, preserving all other edges (creating a multigraph; remove self-loops).
Return the edges between the two remaining super-vertices as the cut.

Theorem. The probability that Karger’s algorithm outputs a specific minimum cut is at least $\frac{2}{n(n-1)} = \binom{n}{2}^{-1}$ .

Proof. Let $C$ be a minimum cut of size $k$ . Every vertex has degree at least $k$ (otherwise its edges would form a smaller cut), so $|E| \geq nk/2$ . At the first step, the probability of contracting an edge in $C$ is at most $k / (nk/2) = 2/n$ . Conditioning on survival, after $i$ contractions the graph has $n - i$ vertices, and the probability of not contracting a cut edge at step $i+1$ is at least $1 - 2/(n-i)$ . The overall survival probability is:

$\prod_{i=0}^{n-3} \left(1 - \frac{2}{n-i}\right) = \prod_{i=0}^{n-3} \frac{n-i-2}{n-i} = \frac{2}{n(n-1)}$

Running the algorithm $\binom{n}{2} \ln n$ times and returning the smallest cut found gives a correct answer with probability at least $1 - 1/n$ . The total running time is $O(n^4 \log n)$ . The Karger-Stein improvement uses recursive contraction to achieve $O(n^2 \log^3 n)$ .

Hashing

Universal Hashing

A family $\mathcal{H}$ of hash functions from universe $U$ to $\{0, 1, \ldots, m-1\}$ is universal if for any two distinct keys $x, y \in U$ :

$\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq \frac{1}{m}$

A classic construction for $U = \{0, 1, \ldots, p-1\}$ with $p$ prime: $h_{a,b}(x) = ((ax + b) \mod p) \mod m$ , choosing $a \in \{1, \ldots, p-1\}$ and $b \in \{0, \ldots, p-1\}$ uniformly at random.

Theorem. With a universal hash family, the expected number of collisions for any key in a hash table with $n$ keys and $m$ slots is at most $n/m$ . Choosing $m = \Theta(n)$ gives $O(1)$ expected lookup time.

Universal hashing eliminates the need for assumptions about the input distribution. No adversary can construct a bad input because the hash function is chosen randomly.

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure for approximate set membership. It uses a bit array $B$ of $m$ bits and $k$ independent hash functions $h_1, \ldots, h_k$ . To insert element $x$ , set $B[h_i(x)] = 1$ for all $i$ . To query $x$ , check if all $B[h_i(x)] = 1$ ; if any bit is 0, $x$ is definitely not in the set.

False positives occur but false negatives do not. After inserting $n$ elements, the false positive probability is approximately:

$\left(1 - e^{-kn/m}\right)^k$

This is minimized when $k = (m/n) \ln 2$ , giving a false positive rate of $(1/2)^k = (0.6185)^{m/n}$ . With 10 bits per element and 7 hash functions, the false positive rate is about 0.8%.

Bloom filters are ubiquitous in systems: web caches, database query optimization, network routing, and spell checkers all use them to avoid expensive lookups.

Randomized Rounding

Randomized rounding is a technique for converting fractional LP relaxation solutions into integer solutions with provable approximation guarantees.

Consider Set Cover: given the LP relaxation solution $x_S^* \in [0,1]$ for each set $S$ , include set $S$ in the integer solution independently with probability $\min(1, c \cdot x_S^* \cdot \ln n)$ for an appropriate constant $c$ . The expected cost is $O(\ln n)$ times the LP optimum, and a Chernoff bound argument shows that all elements are covered with high probability.

More precisely, for each element $e$ , the probability that $e$ is not covered in a single round is at most:

$\prod_{S \ni e} (1 - c \cdot x_S^* \ln n) \leq \exp\left(-c \ln n \sum_{S \ni e} x_S^*\right) \leq \exp(-c \ln n) = n^{-c}$

Choosing $c$ large enough and applying a union bound over all $n$ elements gives the result. This technique connects the theory of LP relaxations to combinatorial optimization in a principled way.

Connection to Machine Learning

Randomization is fundamental to modern machine learning at every level.

Stochastic gradient descent. Rather than computing the full gradient $\nabla L(\theta) = \frac{1}{n}\sum_{i=1}^n \nabla \ell(\theta; x_i)$ over all $n$ training examples, SGD samples a random mini-batch $B \subset \{1, \ldots, n\}$ and uses $\frac{1}{|B|}\sum_{i \in B} \nabla \ell(\theta; x_i)$ as an unbiased estimator. This reduces per-iteration cost from $O(n)$ to $O(|B|)$ while preserving convergence guarantees. The noise introduced by random sampling also serves as implicit regularization, helping escape sharp local minima.

Random feature maps. Rahimi and Recht (2007) showed that kernel methods can be approximated by random features: for a shift-invariant kernel $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$ , sample random frequencies $\omega_1, \ldots, \omega_D$ from the kernel’s Fourier transform and approximate $k(\mathbf{x}, \mathbf{y}) \approx \frac{1}{D}\sum_{j=1}^D \cos(\omega_j^T \mathbf{x}) \cos(\omega_j^T \mathbf{y})$ . This converts a nonlinear kernel machine into a linear model on $D$ -dimensional random features, reducing training from $O(n^3)$ to $O(nD^2)$ . This is a Monte Carlo approximation: accuracy improves with more random features.

Dropout. During training, dropout randomly zeros out each neuron’s activation with probability $p$ (typically 0.5 for hidden layers). This can be interpreted as training an exponential ensemble of $2^d$ sub-networks (where $d$ is the number of neurons) simultaneously. At test time, all neurons are active with weights scaled by $(1-p)$ . Dropout provides regularization — it prevents co-adaptation of neurons and approximates Bayesian model averaging.

Random forests. Breiman’s random forest (2001) combines two sources of randomness: bagging (training each tree on a bootstrap sample of the data) and random subspace selection (considering only a random subset of $\sqrt{d}$ features at each split). These randomizations reduce correlation between trees, and the ensemble’s variance decreases as $\sigma^2 / B$ for $B$ trees with average variance $\sigma^2$ (assuming low correlation). Random forests are a Las Vegas ensemble: each tree is deterministic given its random seed, and the forest’s prediction is always well-defined.

This concludes the series on Data Structures and Algorithms.