16: Randomized Algorithms

Introduction

A randomized algorithm uses random bits as part of its logic, making decisions based on coin flips rather than purely deterministic rules. Randomization often yields algorithms that are simpler, faster, or more elegant than their deterministic counterparts. The analysis of randomized algorithms relies on probability theory: we reason about expected running times, success probabilities, and concentration inequalities.

Las Vegas vs. Monte Carlo

Randomized algorithms fall into two categories:

Las Vegas algorithms always produce a correct answer; only the running time is random. The guarantee is: for every input, the expected running time is bounded, and the output is always correct. Randomized quicksort is the canonical example.

Monte Carlo algorithms always run in bounded time but may produce an incorrect answer with some probability. The guarantee is: for every input, the algorithm terminates in polynomial time and is correct with probability at least 1δ1 - \delta for some parameter δ\delta. Karger’s minimum cut algorithm is an example. Repeated independent runs reduce the error probability exponentially: kk runs reduce failure probability to δk\delta^k.

Randomized Quicksort

Deterministic quicksort with a fixed pivot rule (e.g., always choosing the first element) runs in Θ(n2)\Theta(n^2) on adversarial inputs. Randomized quicksort selects a pivot uniformly at random from the current subarray.

Theorem. The expected number of comparisons made by randomized quicksort is 2nlnn+O(n)=O(nlogn)2n \ln n + O(n) = O(n \log n).

Proof. Let z1<z2<<znz_1 < z_2 < \cdots < z_n be the elements in sorted order. Define the indicator Xij=1X_{ij} = 1 if ziz_i and zjz_j are ever compared. Two elements are compared if and only if one of them is chosen as a pivot before any element between them in sorted order. Among the ji+1j - i + 1 elements zi,zi+1,,zjz_i, z_{i+1}, \ldots, z_j, the probability that ziz_i or zjz_j is chosen first is 2ji+1\frac{2}{j - i + 1}. Therefore:

E[i<jXij]=i=1n1j=i+1n2ji+1=i=1n1k=2ni+12k2nk=2n1k=2n(Hn1)=2nlnn+O(n)E\left[\sum_{i < j} X_{ij}\right] = \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \frac{2}{j-i+1} = \sum_{i=1}^{n-1} \sum_{k=2}^{n-i+1} \frac{2}{k} \leq 2n \sum_{k=2}^{n} \frac{1}{k} = 2n(H_n - 1) = 2n \ln n + O(n)

The expected time is O(nlogn)O(n \log n) regardless of the input distribution. This is a Las Vegas algorithm: the output is always a correctly sorted array.

Karger’s Minimum Cut Algorithm

Given an undirected multigraph G=(V,E)G = (V, E) with n=Vn = |V| vertices, Karger’s algorithm finds a global minimum cut using random edge contraction:

  1. While V>2|V| > 2:
    • Select an edge (u,v)(u, v) uniformly at random.
    • Contract uu and vv: merge them into a single vertex, preserving all other edges (creating a multigraph; remove self-loops).
  2. Return the edges between the two remaining super-vertices as the cut.

Theorem. The probability that Karger’s algorithm outputs a specific minimum cut is at least 2n(n1)=(n2)1\frac{2}{n(n-1)} = \binom{n}{2}^{-1}.

Proof. Let CC be a minimum cut of size kk. Every vertex has degree at least kk (otherwise its edges would form a smaller cut), so Enk/2|E| \geq nk/2. At the first step, the probability of contracting an edge in CC is at most k/(nk/2)=2/nk / (nk/2) = 2/n. Conditioning on survival, after ii contractions the graph has nin - i vertices, and the probability of not contracting a cut edge at step i+1i+1 is at least 12/(ni)1 - 2/(n-i). The overall survival probability is:

i=0n3(12ni)=i=0n3ni2ni=2n(n1)\prod_{i=0}^{n-3} \left(1 - \frac{2}{n-i}\right) = \prod_{i=0}^{n-3} \frac{n-i-2}{n-i} = \frac{2}{n(n-1)}

Running the algorithm (n2)lnn\binom{n}{2} \ln n times and returning the smallest cut found gives a correct answer with probability at least 11/n1 - 1/n. The total running time is O(n4logn)O(n^4 \log n). The Karger-Stein improvement uses recursive contraction to achieve O(n2log3n)O(n^2 \log^3 n).

Hashing

Universal Hashing

A family H\mathcal{H} of hash functions from universe UU to {0,1,,m1}\{0, 1, \ldots, m-1\} is universal if for any two distinct keys x,yUx, y \in U:

PrhH[h(x)=h(y)]1m\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq \frac{1}{m}

A classic construction for U={0,1,,p1}U = \{0, 1, \ldots, p-1\} with pp prime: ha,b(x)=((ax+b)modp)modmh_{a,b}(x) = ((ax + b) \mod p) \mod m, choosing a{1,,p1}a \in \{1, \ldots, p-1\} and b{0,,p1}b \in \{0, \ldots, p-1\} uniformly at random.

Theorem. With a universal hash family, the expected number of collisions for any key in a hash table with nn keys and mm slots is at most n/mn/m. Choosing m=Θ(n)m = \Theta(n) gives O(1)O(1) expected lookup time.

Universal hashing eliminates the need for assumptions about the input distribution. No adversary can construct a bad input because the hash function is chosen randomly.

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure for approximate set membership. It uses a bit array BB of mm bits and kk independent hash functions h1,,hkh_1, \ldots, h_k. To insert element xx, set B[hi(x)]=1B[h_i(x)] = 1 for all ii. To query xx, check if all B[hi(x)]=1B[h_i(x)] = 1; if any bit is 0, xx is definitely not in the set.

False positives occur but false negatives do not. After inserting nn elements, the false positive probability is approximately:

(1ekn/m)k\left(1 - e^{-kn/m}\right)^k

This is minimized when k=(m/n)ln2k = (m/n) \ln 2, giving a false positive rate of (1/2)k=(0.6185)m/n(1/2)^k = (0.6185)^{m/n}. With 10 bits per element and 7 hash functions, the false positive rate is about 0.8%.

Bloom filters are ubiquitous in systems: web caches, database query optimization, network routing, and spell checkers all use them to avoid expensive lookups.

Randomized Rounding

Randomized rounding is a technique for converting fractional LP relaxation solutions into integer solutions with provable approximation guarantees.

Consider Set Cover: given the LP relaxation solution xS[0,1]x_S^* \in [0,1] for each set SS, include set SS in the integer solution independently with probability min(1,cxSlnn)\min(1, c \cdot x_S^* \cdot \ln n) for an appropriate constant cc. The expected cost is O(lnn)O(\ln n) times the LP optimum, and a Chernoff bound argument shows that all elements are covered with high probability.

More precisely, for each element ee, the probability that ee is not covered in a single round is at most:

Se(1cxSlnn)exp(clnnSexS)exp(clnn)=nc\prod_{S \ni e} (1 - c \cdot x_S^* \ln n) \leq \exp\left(-c \ln n \sum_{S \ni e} x_S^*\right) \leq \exp(-c \ln n) = n^{-c}

Choosing cc large enough and applying a union bound over all nn elements gives the result. This technique connects the theory of LP relaxations to combinatorial optimization in a principled way.

Connection to Machine Learning

Randomization is fundamental to modern machine learning at every level.

Stochastic gradient descent. Rather than computing the full gradient L(θ)=1ni=1n(θ;xi)\nabla L(\theta) = \frac{1}{n}\sum_{i=1}^n \nabla \ell(\theta; x_i) over all nn training examples, SGD samples a random mini-batch B{1,,n}B \subset \{1, \ldots, n\} and uses 1BiB(θ;xi)\frac{1}{|B|}\sum_{i \in B} \nabla \ell(\theta; x_i) as an unbiased estimator. This reduces per-iteration cost from O(n)O(n) to O(B)O(|B|) while preserving convergence guarantees. The noise introduced by random sampling also serves as implicit regularization, helping escape sharp local minima.

Random feature maps. Rahimi and Recht (2007) showed that kernel methods can be approximated by random features: for a shift-invariant kernel k(x,y)=k(xy)k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y}), sample random frequencies ω1,,ωD\omega_1, \ldots, \omega_D from the kernel’s Fourier transform and approximate k(x,y)1Dj=1Dcos(ωjTx)cos(ωjTy)k(\mathbf{x}, \mathbf{y}) \approx \frac{1}{D}\sum_{j=1}^D \cos(\omega_j^T \mathbf{x}) \cos(\omega_j^T \mathbf{y}). This converts a nonlinear kernel machine into a linear model on DD-dimensional random features, reducing training from O(n3)O(n^3) to O(nD2)O(nD^2). This is a Monte Carlo approximation: accuracy improves with more random features.

Dropout. During training, dropout randomly zeros out each neuron’s activation with probability pp (typically 0.5 for hidden layers). This can be interpreted as training an exponential ensemble of 2d2^d sub-networks (where dd is the number of neurons) simultaneously. At test time, all neurons are active with weights scaled by (1p)(1-p). Dropout provides regularization — it prevents co-adaptation of neurons and approximates Bayesian model averaging.

Random forests. Breiman’s random forest (2001) combines two sources of randomness: bagging (training each tree on a bootstrap sample of the data) and random subspace selection (considering only a random subset of d\sqrt{d} features at each split). These randomizations reduce correlation between trees, and the ensemble’s variance decreases as σ2/B\sigma^2 / B for BB trees with average variance σ2\sigma^2 (assuming low correlation). Random forests are a Las Vegas ensemble: each tree is deterministic given its random seed, and the forest’s prediction is always well-defined.


This concludes the series on Data Structures and Algorithms.