Backpropagation

Backpropagation is the algorithm that makes deep learning tractable. It computes exact gradients of a scalar loss with respect to every parameter in a network by applying the chain rule on a computational graph, in time linear in the number of operations. This article covers the algorithm, its computational graph formulation, and the optimization methods built on top of it.

Computational Graphs

Any differentiable computation can be represented as a directed acyclic graph (DAG) where:

Nodes represent operations (addition, multiplication, activation functions)
Edges carry intermediate values (tensors)
Leaves are inputs and parameters
The root is the scalar loss $\mathcal{L}$

For a two-layer network computing $\mathcal{L} = \frac{1}{2}(\mathbf{w}_2^\top \text{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + b_2 - y)^2$ , the graph decomposes into:

\mathbf{x} \xrightarrow{\mathbf{W}_1, \mathbf{b}_1} \mathbf{z}_1 \xrightarrow{\text{ReLU}} \mathbf{h}_1 \xrightarrow{\mathbf{w}_2, b_2} \hat{y} \xrightarrow{y} \mathcal{L}

The forward pass evaluates nodes in topological order, caching all intermediate values. The backward pass traverses the graph in reverse topological order, computing gradients via the chain rule.

The Chain Rule on Graphs

For a scalar loss $\mathcal{L}$ that depends on a parameter $\theta$ through a chain of intermediate variables $z_1, z_2, \ldots, z_n$ :

\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial z_n} \cdot \frac{\partial z_n}{\partial z_{n-1}} \cdots \frac{\partial z_2}{\partial z_1} \cdot \frac{\partial z_1}{\partial \theta}

When a variable feeds into multiple downstream nodes (a fan-out), gradients are summed:

\frac{\partial \mathcal{L}}{\partial z} = \sum_{i} \frac{\partial \mathcal{L}}{\partial z_i} \cdot \frac{\partial z_i}{\partial z}

This is the multivariate chain rule. It ensures that every path from $z$ to $\mathcal{L}$ contributes to the gradient.

The Backpropagation Algorithm

Forward pass. For each layer $l = 1, \ldots, L$ :

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}

\mathbf{h}^{(l)} = \phi(\mathbf{z}^{(l)})

Cache $\mathbf{z}^{(l)}$ and $\mathbf{h}^{(l)}$ at each layer.

Backward pass. Initialize with the loss gradient at the output:

\boldsymbol{\delta}^{(L+1)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}

For each layer $l = L, L-1, \ldots, 1$ , compute:

\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)\top} \boldsymbol{\delta}^{(l+1)}) \odot \phi'(\mathbf{z}^{(l)})

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \mathbf{h}^{(l-1)\top}

\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}

where $\odot$ denotes element-wise multiplication and $\boldsymbol{\delta}^{(l)}$ is the error signal at layer $l$ .

Complexity. The backward pass requires exactly one multiplication per edge in the computational graph, matching the forward pass in cost. Total gradient computation is $O(\text{forward pass})$ , not $O(\text{parameters}^2)$ as naive numerical differentiation would require.

Gradient Descent Variants

Batch Gradient Descent

Update using the gradient computed over the entire training set:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

where $\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \ell(\hat{y}^{(i)}, y^{(i)})$ . This produces the true gradient but requires $O(N)$ computation per update, which is prohibitive for large datasets.

Stochastic Gradient Descent (SGD)

Update using a single randomly sampled example:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \ell(\hat{y}^{(i_t)}, y^{(i_t)})

The stochastic gradient $\nabla \ell^{(i_t)}$ is an unbiased estimator of the full gradient: $\mathbb{E}[\nabla \ell^{(i_t)}] = \nabla \mathcal{L}$ . The variance of this estimator introduces noise that can help escape local minima but also causes oscillation near convergence.

Mini-batch SGD

The practical compromise. Sample a batch $\mathcal{B}$ of size $B$ :

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \frac{1}{B}\sum_{i \in \mathcal{B}_t} \nabla_{\boldsymbol{\theta}} \ell(\hat{y}^{(i)}, y^{(i)})

Variance scales as $O(1/B)$ , so larger batches produce smoother updates. Typical batch sizes: 32—512 for most tasks, up to 4096+ for large-scale pretraining with appropriate learning rate scaling (linear scaling rule: $\eta \propto B$ ).

Momentum

SGD oscillates in directions of high curvature and moves slowly along directions of low curvature. Momentum accumulates a velocity vector that smooths these oscillations:

\mathbf{v}_{t+1} = \mu \mathbf{v}_t + \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathbf{v}_{t+1}

with $\mu \in [0, 1)$ (typically 0.9). The velocity acts as an exponential moving average of past gradients. In a quadratic loss landscape with condition number $\kappa$ , SGD converges in $O(\kappa)$ steps while SGD with momentum converges in $O(\sqrt{\kappa})$ .

Nesterov momentum. Evaluates the gradient at the “look-ahead” position:

\mathbf{v}_{t+1} = \mu \mathbf{v}_t + \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t - \eta \mu \mathbf{v}_t)

This provides a correction term that reduces overshooting. Nesterov accelerated gradient achieves optimal convergence rates for convex optimization.

Adaptive Learning Rate Methods

AdaGrad

Accumulates squared gradients per parameter, scaling the learning rate inversely:

\mathbf{G}_{t+1} = \mathbf{G}_t + \mathbf{g}_t \odot \mathbf{g}_t

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{G}_{t+1} + \epsilon}} \odot \mathbf{g}_t

Parameters with large historical gradients get smaller learning rates. This is effective for sparse features but problematic for dense updates: the accumulated $\mathbf{G}$ grows monotonically, eventually reducing the effective learning rate to near zero.

RMSProp

Fixes AdaGrad’s monotonic accumulation with an exponential moving average:

\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta) \mathbf{g}_t \odot \mathbf{g}_t

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}} \odot \mathbf{g}_t

with $\beta = 0.99$ typical. The moving average forgets old gradients, keeping the effective learning rate from decaying to zero.

Adam

Combines momentum (first moment) with RMSProp (second moment):

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \mathbf{g}_t

\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) \mathbf{g}_t \odot \mathbf{g}_t

Bias correction (critical for early steps when $\mathbf{m}_0 = \mathbf{v}_0 = 0$ ):

\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ . Adam is the default optimizer for most deep learning tasks.

AdamW

Decouples weight decay from the adaptive gradient step (Loshchilov and Hutter, 2019):

\boldsymbol{\theta}_{t+1} = (1 - \lambda\eta)\boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t

In standard Adam, L2 regularization interacts poorly with adaptive learning rates: parameters with small gradients (and thus large effective learning rates) receive disproportionately strong regularization. AdamW applies weight decay directly to the parameters, independent of the adaptive scaling. This is the standard optimizer for transformer pretraining.

Learning Rate Schedules

The learning rate $\eta$ is the single most important hyperparameter. Modern training uses schedules that vary $\eta$ over training.

Warmup. Start with a small $\eta$ and linearly increase over the first $T_w$ steps. This stabilizes training when the model’s initial random predictions produce large, noisy gradients. Standard for transformers ( $T_w \sim$ 1—10% of total steps).

Cosine annealing. After warmup, decay $\eta$ following a cosine curve:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)

This produces a smooth decay that spends more time at moderate learning rates than linear or exponential schedules.

Step decay. Reduce $\eta$ by a factor (typically 10x) at fixed epochs. Common in vision (e.g., at epochs 30, 60, 90 for ImageNet). Less popular now than cosine annealing.

One-cycle policy (Smith, 2018). Ramp up then ramp down in a single cycle. Enables training with much larger peak learning rates, which can improve generalization.

Weight Initialization

Proper initialization prevents gradients from vanishing or exploding in the first forward pass.

Xavier/Glorot Initialization

For layers with sigmoid or tanh activations:

W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right) \quad \text{or} \quad \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)

Derived by requiring $\text{Var}(h^{(l)}) = \text{Var}(h^{(l-1)})$ and $\text{Var}(\delta^{(l)}) = \text{Var}(\delta^{(l+1)})$ , maintaining constant variance in both the forward and backward pass.

Kaiming/He Initialization

For ReLU activations, which zero out half the distribution:

W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

The factor of 2 compensates for ReLU killing half the signal. Without this correction, variance halves at each layer, causing exponential decay in deep networks.

Batch Normalization

Batch normalization (Ioffe and Szegedy, 2015) normalizes layer inputs to zero mean and unit variance, then applies a learned affine transform:

\hat{z}_j = \frac{z_j - \mu_{\mathcal{B},j}}{\sqrt{\sigma^2_{\mathcal{B},j} + \epsilon}}

\tilde{z}_j = \gamma_j \hat{z}_j + \beta_j

where $\mu_{\mathcal{B},j}$ and $\sigma^2_{\mathcal{B},j}$ are the mean and variance computed over the mini-batch, and $\gamma_j, \beta_j$ are learned scale and shift parameters.

Why it works. Batch norm:

Reduces internal covariate shift: each layer receives inputs with stable statistics
Smooths the loss landscape: enables higher learning rates without divergence
Acts as a regularizer: the batch statistics introduce noise proportional to $1/\sqrt{B}$
Enables training of very deep networks that would otherwise be unstable

Layer normalization normalizes across features rather than across the batch, making it suitable for variable-length sequences and small batch sizes. It is the standard normalization for transformers:

\hat{z}_j = \frac{z_j - \mu_j}{\sqrt{\sigma^2_j + \epsilon}}, \quad \mu_j = \frac{1}{D}\sum_{d=1}^D z_{j,d}

RMSNorm (Zhang and Sennrich, 2019) drops the mean centering, normalizing only by the root mean square. Used in LLaMA and many modern LLMs for computational efficiency:

\hat{z}_j = \frac{z_j}{\text{RMS}(z_j)} \cdot \gamma_j, \quad \text{RMS}(z) = \sqrt{\frac{1}{D}\sum_{d=1}^D z_d^2}

Residual Connections

Deep networks suffer from the degradation problem: adding more layers can increase training error (not just test error), even when the additional layers could in principle learn the identity. Residual connections (He et al., 2016) address this by providing shortcut paths:

\mathbf{h}^{(l)} = \mathbf{h}^{(l-1)} + \mathcal{F}(\mathbf{h}^{(l-1)}; \mathbf{W}^{(l)})

where $\mathcal{F}$ is the residual function (typically two conv layers or an MLP block). The network only needs to learn the residual $\mathcal{F}$ , which is easier to optimize than the full mapping. When $\mathcal{F} \approx 0$ , the layer reduces to identity, making it easy for the network to “skip” unnecessary layers.

Gradient flow. The gradient through a residual block is:

\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l-1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \left(1 + \frac{\partial \mathcal{F}}{\partial \mathbf{h}^{(l-1)}}\right)

The additive 1 ensures that gradients can flow directly to early layers without passing through potentially vanishing multipliers. This is why ResNets can train networks with 100+ layers while plain networks fail beyond ~20 layers.

Residual connections are foundational to modern architectures: every transformer layer uses them, wrapping both the attention and feedforward sublayers.

Gradient Pathologies

Vanishing Gradients

In deep networks with sigmoid/tanh activations, the gradient at layer $l$ scales as:

\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}}\right\| \approx \prod_{k=l+1}^{L} \|\mathbf{W}^{(k)}\| \cdot \|\phi'(\mathbf{z}^{(k)})\| \cdot \left\|\frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}\right\|

Since $\max|\sigma'(z)| = 0.25$ and $\max|\tanh'(z)| = 1.0$ , the product shrinks exponentially with depth for sigmoid activations. Mitigations: ReLU activations, residual connections, proper initialization, normalization layers.

Exploding Gradients

When weight matrices have spectral radius $> 1$ , gradients grow exponentially. This manifests as NaN losses or parameter updates that overshoot wildly. Mitigations:

Gradient clipping by norm:

\mathbf{g} \leftarrow \frac{\mathbf{g}}{\max(1, \|\mathbf{g}\|/c)}

This rescales the gradient vector to have norm at most $c$ while preserving direction. Standard in RNN and transformer training ( $c = 1.0$ typical).

Gradient clipping by value clips each component independently to $[-c, c]$ . Simpler but distorts gradient direction.

Summary

Component	Purpose	Key Formula
Backpropagation	Compute gradients via reverse-mode autodiff	$\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)\top}\boldsymbol{\delta}^{(l+1)}) \odot \phi'(\mathbf{z}^{(l)})$
SGD	Stochastic optimization	$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \mathbf{g}$
Adam	Adaptive per-parameter learning rates	First + second moment EMA with bias correction
AdamW	Decoupled weight decay	Standard for transformer training
Batch/Layer Norm	Stabilize activations	Normalize → learned affine transform
Residual connections	Enable gradient flow in deep networks	$\mathbf{h}^{(l)} = \mathbf{h}^{(l-1)} + \mathcal{F}(\mathbf{h}^{(l-1)})$
Gradient clipping	Prevent exploding gradients	Rescale $\mathbf{g}$ if $\\|\mathbf{g}\\| > c$

Backpropagation provides exact gradients; the optimizer converts gradients into parameter updates; normalization and residual connections stabilize training across depth. Together, these components enable training networks with billions of parameters.