1: Probability Review

Probability theory provides the mathematical language for reasoning about uncertainty. This review covers the foundations needed for statistical inference and machine learning: random variables, distributions, expectations, and the limit theorems that justify many ML algorithms.


Random Variables

A random variable XX is a function from the sample space Ω\Omega to R\mathbb{R}. It assigns a numerical value to each outcome of a random experiment.

Discrete random variables take values in a countable set. Characterized by the probability mass function (PMF):

p(x)=P(X=x),xp(x)=1p(x) = P(X = x), \quad \sum_x p(x) = 1

Continuous random variables take values in an interval. Characterized by the probability density function (PDF):

f(x)0,f(x)dx=1,P(aXb)=abf(x)dxf(x) \geq 0, \quad \int_{-\infty}^{\infty} f(x)\, dx = 1, \quad P(a \leq X \leq b) = \int_a^b f(x)\, dx

The cumulative distribution function (CDF) F(x)=P(Xx)F(x) = P(X \leq x) exists for both types. For continuous RVs, F(x)=f(x)F'(x) = f(x).


Common Distributions

Discrete

DistributionPMFMeanVarianceUse
Bernoulli(pp)px(1p)1xp^x(1-p)^{1-x}ppp(1p)p(1-p)Binary outcomes
Binomial(n,pn, p)(nx)px(1p)nx\binom{n}{x}p^x(1-p)^{n-x}npnpnp(1p)np(1-p)Count of successes in nn trials
Poisson(λ\lambda)λxeλx!\frac{\lambda^x e^{-\lambda}}{x!}λ\lambdaλ\lambdaCount of rare events
Geometric(pp)(1p)x1p(1-p)^{x-1}p1/p1/p(1p)/p2(1-p)/p^2Trials until first success

Continuous

DistributionPDFMeanVarianceUse
Uniform(a,ba,b)1ba\frac{1}{b-a}a+b2\frac{a+b}{2}(ba)212\frac{(b-a)^2}{12}Equally likely outcomes
Normal(μ,σ2\mu, \sigma^2)1σ2πe(xμ)2/2σ2\frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/2\sigma^2}μ\muσ2\sigma^2Bell curve, CLT limit
Exponential(λ\lambda)λeλx\lambda e^{-\lambda x}1/λ1/\lambda1/λ21/\lambda^2Time between events
Gamma(α,β\alpha, \beta)βαxα1eβxΓ(α)\frac{\beta^\alpha x^{\alpha-1}e^{-\beta x}}{\Gamma(\alpha)}α/β\alpha/\betaα/β2\alpha/\beta^2Sum of exponentials

The normal distribution is central to statistics and ML. Its importance stems from the Central Limit Theorem (below) and the fact that maximum likelihood under Gaussian noise yields the familiar MSE loss.


Expectation and Variance

Expectation (mean) of a random variable:

E[X]={xxp(x)discretexf(x)dxcontinuousE[X] = \begin{cases} \sum_x x \cdot p(x) & \text{discrete} \\ \int_{-\infty}^{\infty} x \cdot f(x)\, dx & \text{continuous} \end{cases}

Properties:

  • E[aX+b]=aE[X]+bE[aX + b] = aE[X] + b (linearity)
  • E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y] (always, even if dependent)
  • E[XY]=E[X]E[Y]E[XY] = E[X]E[Y] only if X,YX, Y are independent

Variance measures spread around the mean:

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

Properties:

  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)
  • Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)
  • If independent: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Covariance and correlation:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)[1,1]\rho(X, Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1, 1]

Joint Distributions and Independence

For two random variables (X,Y)(X, Y):

Joint PMF/PDF: p(x,y)p(x, y) or f(x,y)f(x, y)

Marginal: pX(x)=yp(x,y)p_X(x) = \sum_y p(x, y) or fX(x)=f(x,y)dyf_X(x) = \int f(x, y)\, dy

Conditional: f(yx)=f(x,y)fX(x)f(y \mid x) = \frac{f(x, y)}{f_X(x)}

Independence: XYX \perp Y iff f(x,y)=fX(x)fY(y)f(x, y) = f_X(x) f_Y(y) for all x,yx, y. Equivalently, P(XA,YB)=P(XA)P(YB)P(X \in A, Y \in B) = P(X \in A) P(Y \in B) for all events A,BA, B.

Conditional independence: XYZX \perp Y \mid Z iff f(x,yz)=f(xz)f(yz)f(x, y \mid z) = f(x \mid z) f(y \mid z). This is the foundation of the Naive Bayes assumption: features are conditionally independent given the class label.


Moment Generating Functions

The MGF of XX is MX(t)=E[etX]M_X(t) = E[e^{tX}] (when it exists). Useful properties:

  • E[Xk]=MX(k)(0)E[X^k] = M_X^{(k)}(0) (derivatives at zero give moments)
  • If MX(t)=MY(t)M_X(t) = M_Y(t) for all tt in a neighborhood of 0, then XX and YY have the same distribution (uniqueness)
  • For independent X,YX, Y: MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) M_Y(t)

The MGF of XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) is MX(t)=exp(μt+σ2t2/2)M_X(t) = \exp(\mu t + \sigma^2 t^2/2).


The Law of Large Numbers

Weak LLN. For i.i.d. random variables X1,,XnX_1, \ldots, X_n with mean μ\mu and finite variance:

Xˉn=1ni=1nXipμas n\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu \quad \text{as } n \to \infty

The sample mean converges in probability to the population mean. This justifies using sample statistics as estimators.

Strong LLN. Under the same conditions: Xˉnμ\bar{X}_n \to \mu almost surely. A stronger form of convergence (implies the weak LLN).

Implication for ML. The training loss 1ni=1n(f(xi),yi)\frac{1}{n}\sum_{i=1}^n \ell(f(x_i), y_i) converges to the expected loss E[(f(X),Y)]E[\ell(f(X), Y)] as nn \to \infty. This is why minimizing the empirical risk (training loss) approximately minimizes the population risk (true expected loss).


The Central Limit Theorem

CLT. For i.i.d. random variables X1,,XnX_1, \ldots, X_n with mean μ\mu and variance σ2<\sigma^2 < \infty:

Xˉnμσ/ndN(0,1)as n\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty

Equivalently: Xˉn˙N(μ,σ2/n)\bar{X}_n \dot{\sim} \mathcal{N}(\mu, \sigma^2/n) for large nn.

The CLT is the most important theorem in statistics:

  • Justifies normal-based confidence intervals
  • Explains why the normal distribution appears so frequently in practice
  • Underpins the asymptotic normality of maximum likelihood estimators

Rate of convergence. The Berry-Esseen theorem bounds the approximation error: supzP(Znz)Φ(z)Cρσ3n\sup_z |P(Z_n \leq z) - \Phi(z)| \leq \frac{C \rho}{\sigma^3 \sqrt{n}} where ρ=E[Xμ3]\rho = E[|X - \mu|^3]. The approximation improves as O(1/n)O(1/\sqrt{n}).


Inequalities

Markov’s inequality. For non-negative XX and a>0a > 0: P(Xa)E[X]aP(X \geq a) \leq \frac{E[X]}{a}

Chebyshev’s inequality. For any XX with finite variance: P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

Hoeffding’s inequality. For i.i.d. bounded random variables Xi[a,b]X_i \in [a, b]:

P(Xˉnμt)2exp(2nt2(ba)2)P(|\bar{X}_n - \mu| \geq t) \leq 2\exp\left(-\frac{2nt^2}{(b-a)^2}\right)

This exponential concentration bound is tighter than Chebyshev and is the basis for PAC learning bounds: with probability at least 1δ1 - \delta, the sample mean is within (ba)2log(2/δ)2n\sqrt{\frac{(b-a)^2 \log(2/\delta)}{2n}} of the true mean.


Summary

ConceptKey ResultML Connection
ExpectationE[aX+b]=aE[X]+bE[aX+b] = aE[X]+bExpected loss, risk minimization
VarianceVar(Xˉ)=σ2/n\text{Var}(\bar{X}) = \sigma^2/nEstimation precision scales as 1/n1/n
LLNXˉnμ\bar{X}_n \to \muTraining loss approximates expected loss
CLTXˉn˙N(μ,σ2/n)\bar{X}_n \dot{\sim} \mathcal{N}(\mu, \sigma^2/n)Confidence intervals, asymptotic inference
HoeffdingExponential concentrationPAC bounds, generalization guarantees