1: Probability Review

Probability theory provides the mathematical language for reasoning about uncertainty. This review covers the foundations needed for statistical inference and machine learning: random variables, distributions, expectations, and the limit theorems that justify many ML algorithms.

Random Variables

A random variable $X$ is a function from the sample space $\Omega$ to $\mathbb{R}$ . It assigns a numerical value to each outcome of a random experiment.

Discrete random variables take values in a countable set. Characterized by the probability mass function (PMF):

p(x) = P(X = x), \quad \sum_x p(x) = 1

Continuous random variables take values in an interval. Characterized by the probability density function (PDF):

f(x) \geq 0, \quad \int_{-\infty}^{\infty} f(x)\, dx = 1, \quad P(a \leq X \leq b) = \int_a^b f(x)\, dx

The cumulative distribution function (CDF) $F(x) = P(X \leq x)$ exists for both types. For continuous RVs, $F'(x) = f(x)$ .

Common Distributions

Discrete

Distribution	PMF	Mean	Variance	Use
Bernoulli( $p$ )	$p^x(1-p)^{1-x}$	$p$	$p(1-p)$	Binary outcomes
Binomial( $n, p$ )	$\binom{n}{x}p^x(1-p)^{n-x}$	$np$	$np(1-p)$	Count of successes in $n$ trials
Poisson( $\lambda$ )	$\frac{\lambda^x e^{-\lambda}}{x!}$	$\lambda$	$\lambda$	Count of rare events
Geometric( $p$ )	$(1-p)^{x-1}p$	$1/p$	$(1-p)/p^2$	Trials until first success

Continuous

Distribution	PDF	Mean	Variance	Use
Uniform( $a,b$ )	$\frac{1}{b-a}$	$\frac{a+b}{2}$	$\frac{(b-a)^2}{12}$	Equally likely outcomes
Normal( $\mu, \sigma^2$ )	$\frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/2\sigma^2}$	$\mu$	$\sigma^2$	Bell curve, CLT limit
Exponential( $\lambda$ )	$\lambda e^{-\lambda x}$	$1/\lambda$	$1/\lambda^2$	Time between events
Gamma( $\alpha, \beta$ )	$\frac{\beta^\alpha x^{\alpha-1}e^{-\beta x}}{\Gamma(\alpha)}$	$\alpha/\beta$	$\alpha/\beta^2$	Sum of exponentials

The normal distribution is central to statistics and ML. Its importance stems from the Central Limit Theorem (below) and the fact that maximum likelihood under Gaussian noise yields the familiar MSE loss.

Expectation and Variance

Expectation (mean) of a random variable:

E[X] = \begin{cases} \sum_x x \cdot p(x) & \text{discrete} \\ \int_{-\infty}^{\infty} x \cdot f(x)\, dx & \text{continuous} \end{cases}

Properties:

$E[aX + b] = aE[X] + b$ (linearity)
$E[X + Y] = E[X] + E[Y]$ (always, even if dependent)
$E[XY] = E[X]E[Y]$ only if $X, Y$ are independent

Variance measures spread around the mean:

\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

Properties:

$\text{Var}(aX + b) = a^2 \text{Var}(X)$
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)$
If independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$

Covariance and correlation:

\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

\rho(X, Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1, 1]

Joint Distributions and Independence

For two random variables $(X, Y)$ :

Joint PMF/PDF: $p(x, y)$ or $f(x, y)$

Marginal: $p_X(x) = \sum_y p(x, y)$ or $f_X(x) = \int f(x, y)\, dy$

Conditional: $f(y \mid x) = \frac{f(x, y)}{f_X(x)}$

Independence: $X \perp Y$ iff $f(x, y) = f_X(x) f_Y(y)$ for all $x, y$ . Equivalently, $P(X \in A, Y \in B) = P(X \in A) P(Y \in B)$ for all events $A, B$ .

Conditional independence: $X \perp Y \mid Z$ iff $f(x, y \mid z) = f(x \mid z) f(y \mid z)$ . This is the foundation of the Naive Bayes assumption: features are conditionally independent given the class label.

Moment Generating Functions

The MGF of $X$ is $M_X(t) = E[e^{tX}]$ (when it exists). Useful properties:

$E[X^k] = M_X^{(k)}(0)$ (derivatives at zero give moments)
If $M_X(t) = M_Y(t)$ for all $t$ in a neighborhood of 0, then $X$ and $Y$ have the same distribution (uniqueness)
For independent $X, Y$ : $M_{X+Y}(t) = M_X(t) M_Y(t)$

The MGF of $X \sim \mathcal{N}(\mu, \sigma^2)$ is $M_X(t) = \exp(\mu t + \sigma^2 t^2/2)$ .

The Law of Large Numbers

Weak LLN. For i.i.d. random variables $X_1, \ldots, X_n$ with mean $\mu$ and finite variance:

\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu \quad \text{as } n \to \infty

The sample mean converges in probability to the population mean. This justifies using sample statistics as estimators.

Strong LLN. Under the same conditions: $\bar{X}_n \to \mu$ almost surely. A stronger form of convergence (implies the weak LLN).

Implication for ML. The training loss $\frac{1}{n}\sum_{i=1}^n \ell(f(x_i), y_i)$ converges to the expected loss $E[\ell(f(X), Y)]$ as $n \to \infty$ . This is why minimizing the empirical risk (training loss) approximately minimizes the population risk (true expected loss).

The Central Limit Theorem

CLT. For i.i.d. random variables $X_1, \ldots, X_n$ with mean $\mu$ and variance $\sigma^2 < \infty$ :

\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty

Equivalently: $\bar{X}_n \dot{\sim} \mathcal{N}(\mu, \sigma^2/n)$ for large $n$ .

The CLT is the most important theorem in statistics:

Justifies normal-based confidence intervals
Explains why the normal distribution appears so frequently in practice
Underpins the asymptotic normality of maximum likelihood estimators

Rate of convergence. The Berry-Esseen theorem bounds the approximation error: $\sup_z |P(Z_n \leq z) - \Phi(z)| \leq \frac{C \rho}{\sigma^3 \sqrt{n}}$ where $\rho = E[|X - \mu|^3]$ . The approximation improves as $O(1/\sqrt{n})$ .

Inequalities

Markov’s inequality. For non-negative $X$ and $a > 0$ : $P(X \geq a) \leq \frac{E[X]}{a}$

Chebyshev’s inequality. For any $X$ with finite variance: $P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$

Hoeffding’s inequality. For i.i.d. bounded random variables $X_i \in [a, b]$ :

P(|\bar{X}_n - \mu| \geq t) \leq 2\exp\left(-\frac{2nt^2}{(b-a)^2}\right)

This exponential concentration bound is tighter than Chebyshev and is the basis for PAC learning bounds: with probability at least $1 - \delta$ , the sample mean is within $\sqrt{\frac{(b-a)^2 \log(2/\delta)}{2n}}$ of the true mean.

Summary

Concept	Key Result	ML Connection
Expectation	$E[aX+b] = aE[X]+b$	Expected loss, risk minimization
Variance	$\text{Var}(\bar{X}) = \sigma^2/n$	Estimation precision scales as $1/n$
LLN	$\bar{X}_n \to \mu$	Training loss approximates expected loss
CLT	$\bar{X}_n \dot{\sim} \mathcal{N}(\mu, \sigma^2/n)$	Confidence intervals, asymptotic inference
Hoeffding	Exponential concentration	PAC bounds, generalization guarantees