Backpropagation
Backpropagation is the algorithm that makes deep learning tractable. It computes exact gradients of a scalar loss with respect to every parameter in a network by applying the chain rule on a computational graph, in time linear in the number of operations. This article covers the algorithm, its computational graph formulation, and the optimization methods built on top of it.
Computational Graphs
Any differentiable computation can be represented as a directed acyclic graph (DAG) where:
- Nodes represent operations (addition, multiplication, activation functions)
- Edges carry intermediate values (tensors)
- Leaves are inputs and parameters
- The root is the scalar loss
For a two-layer network computing , the graph decomposes into:
The forward pass evaluates nodes in topological order, caching all intermediate values. The backward pass traverses the graph in reverse topological order, computing gradients via the chain rule.
The Chain Rule on Graphs
For a scalar loss that depends on a parameter through a chain of intermediate variables :
When a variable feeds into multiple downstream nodes (a fan-out), gradients are summed:
This is the multivariate chain rule. It ensures that every path from to contributes to the gradient.
The Backpropagation Algorithm
Forward pass. For each layer :
Cache and at each layer.
Backward pass. Initialize with the loss gradient at the output:
For each layer , compute:
where denotes element-wise multiplication and is the error signal at layer .
Complexity. The backward pass requires exactly one multiplication per edge in the computational graph, matching the forward pass in cost. Total gradient computation is , not as naive numerical differentiation would require.
Gradient Descent Variants
Batch Gradient Descent
Update using the gradient computed over the entire training set:
where . This produces the true gradient but requires computation per update, which is prohibitive for large datasets.
Stochastic Gradient Descent (SGD)
Update using a single randomly sampled example:
The stochastic gradient is an unbiased estimator of the full gradient: . The variance of this estimator introduces noise that can help escape local minima but also causes oscillation near convergence.
Mini-batch SGD
The practical compromise. Sample a batch of size :
Variance scales as , so larger batches produce smoother updates. Typical batch sizes: 32—512 for most tasks, up to 4096+ for large-scale pretraining with appropriate learning rate scaling (linear scaling rule: ).
Momentum
SGD oscillates in directions of high curvature and moves slowly along directions of low curvature. Momentum accumulates a velocity vector that smooths these oscillations:
with (typically 0.9). The velocity acts as an exponential moving average of past gradients. In a quadratic loss landscape with condition number , SGD converges in steps while SGD with momentum converges in .
Nesterov momentum. Evaluates the gradient at the “look-ahead” position:
This provides a correction term that reduces overshooting. Nesterov accelerated gradient achieves optimal convergence rates for convex optimization.
Adaptive Learning Rate Methods
AdaGrad
Accumulates squared gradients per parameter, scaling the learning rate inversely:
Parameters with large historical gradients get smaller learning rates. This is effective for sparse features but problematic for dense updates: the accumulated grows monotonically, eventually reducing the effective learning rate to near zero.
RMSProp
Fixes AdaGrad’s monotonic accumulation with an exponential moving average:
with typical. The moving average forgets old gradients, keeping the effective learning rate from decaying to zero.
Adam
Combines momentum (first moment) with RMSProp (second moment):
Bias correction (critical for early steps when ):
Default hyperparameters: , , . Adam is the default optimizer for most deep learning tasks.
AdamW
Decouples weight decay from the adaptive gradient step (Loshchilov and Hutter, 2019):
In standard Adam, L2 regularization interacts poorly with adaptive learning rates: parameters with small gradients (and thus large effective learning rates) receive disproportionately strong regularization. AdamW applies weight decay directly to the parameters, independent of the adaptive scaling. This is the standard optimizer for transformer pretraining.
Learning Rate Schedules
The learning rate is the single most important hyperparameter. Modern training uses schedules that vary over training.
Warmup. Start with a small and linearly increase over the first steps. This stabilizes training when the model’s initial random predictions produce large, noisy gradients. Standard for transformers ( 1—10% of total steps).
Cosine annealing. After warmup, decay following a cosine curve:
This produces a smooth decay that spends more time at moderate learning rates than linear or exponential schedules.
Step decay. Reduce by a factor (typically 10x) at fixed epochs. Common in vision (e.g., at epochs 30, 60, 90 for ImageNet). Less popular now than cosine annealing.
One-cycle policy (Smith, 2018). Ramp up then ramp down in a single cycle. Enables training with much larger peak learning rates, which can improve generalization.
Weight Initialization
Proper initialization prevents gradients from vanishing or exploding in the first forward pass.
Xavier/Glorot Initialization
For layers with sigmoid or tanh activations:
Derived by requiring and , maintaining constant variance in both the forward and backward pass.
Kaiming/He Initialization
For ReLU activations, which zero out half the distribution:
The factor of 2 compensates for ReLU killing half the signal. Without this correction, variance halves at each layer, causing exponential decay in deep networks.
Batch Normalization
Batch normalization (Ioffe and Szegedy, 2015) normalizes layer inputs to zero mean and unit variance, then applies a learned affine transform:
where and are the mean and variance computed over the mini-batch, and are learned scale and shift parameters.
Why it works. Batch norm:
- Reduces internal covariate shift: each layer receives inputs with stable statistics
- Smooths the loss landscape: enables higher learning rates without divergence
- Acts as a regularizer: the batch statistics introduce noise proportional to
- Enables training of very deep networks that would otherwise be unstable
Layer normalization normalizes across features rather than across the batch, making it suitable for variable-length sequences and small batch sizes. It is the standard normalization for transformers:
RMSNorm (Zhang and Sennrich, 2019) drops the mean centering, normalizing only by the root mean square. Used in LLaMA and many modern LLMs for computational efficiency:
Residual Connections
Deep networks suffer from the degradation problem: adding more layers can increase training error (not just test error), even when the additional layers could in principle learn the identity. Residual connections (He et al., 2016) address this by providing shortcut paths:
where is the residual function (typically two conv layers or an MLP block). The network only needs to learn the residual , which is easier to optimize than the full mapping. When , the layer reduces to identity, making it easy for the network to “skip” unnecessary layers.
Gradient flow. The gradient through a residual block is:
The additive 1 ensures that gradients can flow directly to early layers without passing through potentially vanishing multipliers. This is why ResNets can train networks with 100+ layers while plain networks fail beyond ~20 layers.
Residual connections are foundational to modern architectures: every transformer layer uses them, wrapping both the attention and feedforward sublayers.
Gradient Pathologies
Vanishing Gradients
In deep networks with sigmoid/tanh activations, the gradient at layer scales as:
Since and , the product shrinks exponentially with depth for sigmoid activations. Mitigations: ReLU activations, residual connections, proper initialization, normalization layers.
Exploding Gradients
When weight matrices have spectral radius , gradients grow exponentially. This manifests as NaN losses or parameter updates that overshoot wildly. Mitigations:
Gradient clipping by norm:
This rescales the gradient vector to have norm at most while preserving direction. Standard in RNN and transformer training ( typical).
Gradient clipping by value clips each component independently to . Simpler but distorts gradient direction.
Summary
| Component | Purpose | Key Formula |
|---|---|---|
| Backpropagation | Compute gradients via reverse-mode autodiff | |
| SGD | Stochastic optimization | |
| Adam | Adaptive per-parameter learning rates | First + second moment EMA with bias correction |
| AdamW | Decoupled weight decay | Standard for transformer training |
| Batch/Layer Norm | Stabilize activations | Normalize → learned affine transform |
| Residual connections | Enable gradient flow in deep networks | |
| Gradient clipping | Prevent exploding gradients | Rescale if |
Backpropagation provides exact gradients; the optimizer converts gradients into parameter updates; normalization and residual connections stabilize training across depth. Together, these components enable training networks with billions of parameters.