Backpropagation

Backpropagation is the algorithm that makes deep learning tractable. It computes exact gradients of a scalar loss with respect to every parameter in a network by applying the chain rule on a computational graph, in time linear in the number of operations. This article covers the algorithm, its computational graph formulation, and the optimization methods built on top of it.


Computational Graphs

Any differentiable computation can be represented as a directed acyclic graph (DAG) where:

  • Nodes represent operations (addition, multiplication, activation functions)
  • Edges carry intermediate values (tensors)
  • Leaves are inputs and parameters
  • The root is the scalar loss L\mathcal{L}

For a two-layer network computing L=12(w2ReLU(W1x+b1)+b2y)2\mathcal{L} = \frac{1}{2}(\mathbf{w}_2^\top \text{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + b_2 - y)^2, the graph decomposes into:

xW1,b1z1ReLUh1w2,b2y^yL\mathbf{x} \xrightarrow{\mathbf{W}_1, \mathbf{b}_1} \mathbf{z}_1 \xrightarrow{\text{ReLU}} \mathbf{h}_1 \xrightarrow{\mathbf{w}_2, b_2} \hat{y} \xrightarrow{y} \mathcal{L}

The forward pass evaluates nodes in topological order, caching all intermediate values. The backward pass traverses the graph in reverse topological order, computing gradients via the chain rule.


The Chain Rule on Graphs

For a scalar loss L\mathcal{L} that depends on a parameter θ\theta through a chain of intermediate variables z1,z2,,znz_1, z_2, \ldots, z_n:

Lθ=Lznznzn1z2z1z1θ\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial z_n} \cdot \frac{\partial z_n}{\partial z_{n-1}} \cdots \frac{\partial z_2}{\partial z_1} \cdot \frac{\partial z_1}{\partial \theta}

When a variable feeds into multiple downstream nodes (a fan-out), gradients are summed:

Lz=iLziziz\frac{\partial \mathcal{L}}{\partial z} = \sum_{i} \frac{\partial \mathcal{L}}{\partial z_i} \cdot \frac{\partial z_i}{\partial z}

This is the multivariate chain rule. It ensures that every path from zz to L\mathcal{L} contributes to the gradient.


The Backpropagation Algorithm

Forward pass. For each layer l=1,,Ll = 1, \ldots, L:

z(l)=W(l)h(l1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)} h(l)=ϕ(z(l))\mathbf{h}^{(l)} = \phi(\mathbf{z}^{(l)})

Cache z(l)\mathbf{z}^{(l)} and h(l)\mathbf{h}^{(l)} at each layer.

Backward pass. Initialize with the loss gradient at the output:

δ(L+1)=Ly^\boldsymbol{\delta}^{(L+1)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}

For each layer l=L,L1,,1l = L, L-1, \ldots, 1, compute:

δ(l)=(W(l+1)δ(l+1))ϕ(z(l))\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)\top} \boldsymbol{\delta}^{(l+1)}) \odot \phi'(\mathbf{z}^{(l)}) LW(l)=δ(l)h(l1)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \mathbf{h}^{(l-1)\top} Lb(l)=δ(l)\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}

where \odot denotes element-wise multiplication and δ(l)\boldsymbol{\delta}^{(l)} is the error signal at layer ll.

Complexity. The backward pass requires exactly one multiplication per edge in the computational graph, matching the forward pass in cost. Total gradient computation is O(forward pass)O(\text{forward pass}), not O(parameters2)O(\text{parameters}^2) as naive numerical differentiation would require.


Gradient Descent Variants

Batch Gradient Descent

Update using the gradient computed over the entire training set:

θt+1=θtηθL(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

where L=1Ni=1N(y^(i),y(i))\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \ell(\hat{y}^{(i)}, y^{(i)}). This produces the true gradient but requires O(N)O(N) computation per update, which is prohibitive for large datasets.

Stochastic Gradient Descent (SGD)

Update using a single randomly sampled example:

θt+1=θtηθ(y^(it),y(it))\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \ell(\hat{y}^{(i_t)}, y^{(i_t)})

The stochastic gradient (it)\nabla \ell^{(i_t)} is an unbiased estimator of the full gradient: E[(it)]=L\mathbb{E}[\nabla \ell^{(i_t)}] = \nabla \mathcal{L}. The variance of this estimator introduces noise that can help escape local minima but also causes oscillation near convergence.

Mini-batch SGD

The practical compromise. Sample a batch B\mathcal{B} of size BB:

θt+1=θtη1BiBtθ(y^(i),y(i))\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \frac{1}{B}\sum_{i \in \mathcal{B}_t} \nabla_{\boldsymbol{\theta}} \ell(\hat{y}^{(i)}, y^{(i)})

Variance scales as O(1/B)O(1/B), so larger batches produce smoother updates. Typical batch sizes: 32—512 for most tasks, up to 4096+ for large-scale pretraining with appropriate learning rate scaling (linear scaling rule: ηB\eta \propto B).


Momentum

SGD oscillates in directions of high curvature and moves slowly along directions of low curvature. Momentum accumulates a velocity vector that smooths these oscillations:

vt+1=μvt+θL(θt)\mathbf{v}_{t+1} = \mu \mathbf{v}_t + \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t) θt+1=θtηvt+1\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathbf{v}_{t+1}

with μ[0,1)\mu \in [0, 1) (typically 0.9). The velocity acts as an exponential moving average of past gradients. In a quadratic loss landscape with condition number κ\kappa, SGD converges in O(κ)O(\kappa) steps while SGD with momentum converges in O(κ)O(\sqrt{\kappa}).

Nesterov momentum. Evaluates the gradient at the “look-ahead” position:

vt+1=μvt+θL(θtημvt)\mathbf{v}_{t+1} = \mu \mathbf{v}_t + \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t - \eta \mu \mathbf{v}_t)

This provides a correction term that reduces overshooting. Nesterov accelerated gradient achieves optimal convergence rates for convex optimization.


Adaptive Learning Rate Methods

AdaGrad

Accumulates squared gradients per parameter, scaling the learning rate inversely:

Gt+1=Gt+gtgt\mathbf{G}_{t+1} = \mathbf{G}_t + \mathbf{g}_t \odot \mathbf{g}_t θt+1=θtηGt+1+ϵgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{G}_{t+1} + \epsilon}} \odot \mathbf{g}_t

Parameters with large historical gradients get smaller learning rates. This is effective for sparse features but problematic for dense updates: the accumulated G\mathbf{G} grows monotonically, eventually reducing the effective learning rate to near zero.

RMSProp

Fixes AdaGrad’s monotonic accumulation with an exponential moving average:

vt=βvt1+(1β)gtgt\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta) \mathbf{g}_t \odot \mathbf{g}_t θt+1=θtηvt+ϵgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}} \odot \mathbf{g}_t

with β=0.99\beta = 0.99 typical. The moving average forgets old gradients, keeping the effective learning rate from decaying to zero.

Adam

Combines momentum (first moment) with RMSProp (second moment):

mt=β1mt1+(1β1)gt\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \mathbf{g}_t vt=β2vt1+(1β2)gtgt\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) \mathbf{g}_t \odot \mathbf{g}_t

Bias correction (critical for early steps when m0=v0=0\mathbf{m}_0 = \mathbf{v}_0 = 0):

m^t=mt1β1t,v^t=vt1β2t\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t} θt+1=θtηv^t+ϵm^t\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t

Default hyperparameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}. Adam is the default optimizer for most deep learning tasks.

AdamW

Decouples weight decay from the adaptive gradient step (Loshchilov and Hutter, 2019):

θt+1=(1λη)θtηv^t+ϵm^t\boldsymbol{\theta}_{t+1} = (1 - \lambda\eta)\boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t

In standard Adam, L2 regularization interacts poorly with adaptive learning rates: parameters with small gradients (and thus large effective learning rates) receive disproportionately strong regularization. AdamW applies weight decay directly to the parameters, independent of the adaptive scaling. This is the standard optimizer for transformer pretraining.


Learning Rate Schedules

The learning rate η\eta is the single most important hyperparameter. Modern training uses schedules that vary η\eta over training.

Warmup. Start with a small η\eta and linearly increase over the first TwT_w steps. This stabilizes training when the model’s initial random predictions produce large, noisy gradients. Standard for transformers (TwT_w \sim 1—10% of total steps).

Cosine annealing. After warmup, decay η\eta following a cosine curve:

ηt=ηmin+12(ηmaxηmin)(1+cosπtT)\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)

This produces a smooth decay that spends more time at moderate learning rates than linear or exponential schedules.

Step decay. Reduce η\eta by a factor (typically 10x) at fixed epochs. Common in vision (e.g., at epochs 30, 60, 90 for ImageNet). Less popular now than cosine annealing.

One-cycle policy (Smith, 2018). Ramp up then ramp down in a single cycle. Enables training with much larger peak learning rates, which can improve generalization.


Weight Initialization

Proper initialization prevents gradients from vanishing or exploding in the first forward pass.

Xavier/Glorot Initialization

For layers with sigmoid or tanh activations:

WijN(0,2nin+nout)orU(6nin+nout,6nin+nout)W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right) \quad \text{or} \quad \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)

Derived by requiring Var(h(l))=Var(h(l1))\text{Var}(h^{(l)}) = \text{Var}(h^{(l-1)}) and Var(δ(l))=Var(δ(l+1))\text{Var}(\delta^{(l)}) = \text{Var}(\delta^{(l+1)}), maintaining constant variance in both the forward and backward pass.

Kaiming/He Initialization

For ReLU activations, which zero out half the distribution:

WijN(0,2nin)W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

The factor of 2 compensates for ReLU killing half the signal. Without this correction, variance halves at each layer, causing exponential decay in deep networks.


Batch Normalization

Batch normalization (Ioffe and Szegedy, 2015) normalizes layer inputs to zero mean and unit variance, then applies a learned affine transform:

z^j=zjμB,jσB,j2+ϵ\hat{z}_j = \frac{z_j - \mu_{\mathcal{B},j}}{\sqrt{\sigma^2_{\mathcal{B},j} + \epsilon}} z~j=γjz^j+βj\tilde{z}_j = \gamma_j \hat{z}_j + \beta_j

where μB,j\mu_{\mathcal{B},j} and σB,j2\sigma^2_{\mathcal{B},j} are the mean and variance computed over the mini-batch, and γj,βj\gamma_j, \beta_j are learned scale and shift parameters.

Why it works. Batch norm:

  1. Reduces internal covariate shift: each layer receives inputs with stable statistics
  2. Smooths the loss landscape: enables higher learning rates without divergence
  3. Acts as a regularizer: the batch statistics introduce noise proportional to 1/B1/\sqrt{B}
  4. Enables training of very deep networks that would otherwise be unstable

Layer normalization normalizes across features rather than across the batch, making it suitable for variable-length sequences and small batch sizes. It is the standard normalization for transformers:

z^j=zjμjσj2+ϵ,μj=1Dd=1Dzj,d\hat{z}_j = \frac{z_j - \mu_j}{\sqrt{\sigma^2_j + \epsilon}}, \quad \mu_j = \frac{1}{D}\sum_{d=1}^D z_{j,d}

RMSNorm (Zhang and Sennrich, 2019) drops the mean centering, normalizing only by the root mean square. Used in LLaMA and many modern LLMs for computational efficiency:

z^j=zjRMS(zj)γj,RMS(z)=1Dd=1Dzd2\hat{z}_j = \frac{z_j}{\text{RMS}(z_j)} \cdot \gamma_j, \quad \text{RMS}(z) = \sqrt{\frac{1}{D}\sum_{d=1}^D z_d^2}

Residual Connections

Deep networks suffer from the degradation problem: adding more layers can increase training error (not just test error), even when the additional layers could in principle learn the identity. Residual connections (He et al., 2016) address this by providing shortcut paths:

h(l)=h(l1)+F(h(l1);W(l))\mathbf{h}^{(l)} = \mathbf{h}^{(l-1)} + \mathcal{F}(\mathbf{h}^{(l-1)}; \mathbf{W}^{(l)})

where F\mathcal{F} is the residual function (typically two conv layers or an MLP block). The network only needs to learn the residual F\mathcal{F}, which is easier to optimize than the full mapping. When F0\mathcal{F} \approx 0, the layer reduces to identity, making it easy for the network to “skip” unnecessary layers.

Gradient flow. The gradient through a residual block is:

Lh(l1)=Lh(l)(1+Fh(l1))\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l-1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \left(1 + \frac{\partial \mathcal{F}}{\partial \mathbf{h}^{(l-1)}}\right)

The additive 1 ensures that gradients can flow directly to early layers without passing through potentially vanishing multipliers. This is why ResNets can train networks with 100+ layers while plain networks fail beyond ~20 layers.

Residual connections are foundational to modern architectures: every transformer layer uses them, wrapping both the attention and feedforward sublayers.


Gradient Pathologies

Vanishing Gradients

In deep networks with sigmoid/tanh activations, the gradient at layer ll scales as:

Lh(l)k=l+1LW(k)ϕ(z(k))Ly^\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}}\right\| \approx \prod_{k=l+1}^{L} \|\mathbf{W}^{(k)}\| \cdot \|\phi'(\mathbf{z}^{(k)})\| \cdot \left\|\frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}\right\|

Since maxσ(z)=0.25\max|\sigma'(z)| = 0.25 and maxtanh(z)=1.0\max|\tanh'(z)| = 1.0, the product shrinks exponentially with depth for sigmoid activations. Mitigations: ReLU activations, residual connections, proper initialization, normalization layers.

Exploding Gradients

When weight matrices have spectral radius >1> 1, gradients grow exponentially. This manifests as NaN losses or parameter updates that overshoot wildly. Mitigations:

Gradient clipping by norm:

ggmax(1,g/c)\mathbf{g} \leftarrow \frac{\mathbf{g}}{\max(1, \|\mathbf{g}\|/c)}

This rescales the gradient vector to have norm at most cc while preserving direction. Standard in RNN and transformer training (c=1.0c = 1.0 typical).

Gradient clipping by value clips each component independently to [c,c][-c, c]. Simpler but distorts gradient direction.


Summary

ComponentPurposeKey Formula
BackpropagationCompute gradients via reverse-mode autodiffδ(l)=(W(l+1)δ(l+1))ϕ(z(l))\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)\top}\boldsymbol{\delta}^{(l+1)}) \odot \phi'(\mathbf{z}^{(l)})
SGDStochastic optimizationθθηg\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \mathbf{g}
AdamAdaptive per-parameter learning ratesFirst + second moment EMA with bias correction
AdamWDecoupled weight decayStandard for transformer training
Batch/Layer NormStabilize activationsNormalize → learned affine transform
Residual connectionsEnable gradient flow in deep networksh(l)=h(l1)+F(h(l1))\mathbf{h}^{(l)} = \mathbf{h}^{(l-1)} + \mathcal{F}(\mathbf{h}^{(l-1)})
Gradient clippingPrevent exploding gradientsRescale g\mathbf{g} if g>c\|\mathbf{g}\| > c

Backpropagation provides exact gradients; the optimizer converts gradients into parameter updates; normalization and residual connections stabilize training across depth. Together, these components enable training networks with billions of parameters.