Neural Networks
Neural networks are parameterized function approximators composed of alternating linear transformations and element-wise nonlinearities. This article develops the feedforward architecture from the single-neuron perceptron through deep multilayer networks, with emphasis on the representational consequences of depth and activation choice.
The Perceptron
The perceptron computes a linear function of the input followed by a threshold:
where is a step function: . The weight vector and bias define a hyperplane in input space. The perceptron classifies inputs by which side of this hyperplane they fall on.
Geometric interpretation. The decision boundary is a -dimensional hyperplane with normal vector . The bias controls the offset from the origin. The perceptron can represent any linearly separable function, and only linearly separable functions.
The XOR problem. The function is not linearly separable: no single hyperplane in can separate the positive examples from the negative examples . Minsky and Papert (1969) formalized this limitation, demonstrating that single-layer perceptrons cannot compute parity functions. This motivated the development of multilayer architectures.
Multilayer Perceptrons
A multilayer perceptron (MLP) composes multiple layers of linear transformations with nonlinear activations. For a network with hidden layers:
where are weight matrices, are bias vectors, and is a nonlinear activation function applied element-wise. The pre-activation values and post-activation values are cached during the forward pass for use in backpropagation.
Why nonlinearity is essential. Without activation functions, a composition of linear maps is itself linear: . The network collapses to a single linear transformation regardless of depth. Nonlinear activations break this degeneracy and enable the network to represent nonlinear decision boundaries.
Activation Functions
The choice of activation function has significant consequences for optimization dynamics, gradient flow, and representational capacity.
Sigmoid
The sigmoid squashes inputs to . Its derivative has maximum value at and decays exponentially for . This causes the vanishing gradient problem: in deep networks, gradients flowing through many sigmoid layers shrink exponentially, making early layers nearly untrainable. Additionally, sigmoid outputs are not zero-centered, which introduces systematic bias in gradient updates for downstream weights.
Tanh
Tanh maps to and is zero-centered, resolving the bias issue. However, it still saturates for large , producing vanishing gradients in deep networks. Related to sigmoid by .
ReLU
The Rectified Linear Unit is the default activation in modern deep learning. Its gradient is exactly 1 for positive inputs, eliminating vanishing gradients in the active regime. Computation is a single comparison, making it significantly faster than exponential-based activations.
Dying ReLU problem. If a neuron’s pre-activation is negative for all training examples (e.g., due to a large negative bias update), its gradient is permanently zero and the neuron stops learning. This can affect a significant fraction of neurons in wide networks with aggressive learning rates.
Variants
Leaky ReLU: with small (typically 0.01). Provides a non-zero gradient for negative inputs, preventing dead neurons.
GELU (Gaussian Error Linear Unit): where is the standard Gaussian CDF. Used in BERT, GPT, and most modern transformer architectures. GELU smoothly gates the input by its own magnitude, combining the benefits of ReLU (sparse activation) with a smooth, non-zero gradient everywhere. Approximated as .
SiLU / Swish: . Discovered through automated activation function search (Ramachandran et al., 2017). Closely related to GELU, used in EfficientNet and many vision architectures.
Forward Propagation
The forward pass computes the network’s output for a given input by sequentially applying each layer’s transformation. For a two-hidden-layer network with ReLU activations and a regression output:
The intermediate representations are learned features. Unlike hand-crafted feature engineering, the network discovers useful representations automatically through gradient-based optimization. Each hidden layer computes a progressively more abstract representation of the input.
Parameter count. For a network with layer widths , the total parameter count is . A network with input dimension 784, two hidden layers of 256 and 128 units, and 10 output classes has parameters.
Output Layers and Loss Functions
The output layer and loss function are chosen jointly based on the prediction task.
Regression
Output: linear (no activation). Loss: mean squared error.
For heteroscedastic targets or zero-inflated distributions, alternative losses such as Tweedie, Huber, or quantile regression losses may be more appropriate. The tracker cost estimation model uses Tweedie loss for exactly this reason: the zero-inflated transfer size distribution renders MSE suboptimal.
Binary Classification
Output: sigmoid activation producing . Loss: binary cross-entropy.
Cross-entropy is the negative log-likelihood under a Bernoulli model, making it the maximum likelihood objective for binary classification. It produces stronger gradients than MSE when the prediction is confident but wrong.
Multi-class Classification
Output: softmax activation producing a probability distribution over classes.
Loss: categorical cross-entropy (negative log-likelihood of the multinomial).
The softmax function is invariant to additive constants (), so the logits encode relative, not absolute, class scores. In practice, the maximum logit is subtracted for numerical stability before exponentiation.
The Universal Approximation Theorem
Theorem (Cybenko, 1989; Hornik, 1991). A feedforward network with a single hidden layer of finite width and a non-polynomial activation function can approximate any continuous function on a compact subset of to arbitrary precision.
Formally: for any continuous , any , and any non-polynomial continuous activation , there exist , , , such that:
What this means. Neural networks have sufficient representational capacity to approximate any reasonable target function. The theorem is existential: it guarantees the existence of suitable weights but says nothing about whether gradient descent can find them, or how many neurons are required. In practice, the required width for a single hidden layer may be exponentially large in the input dimension.
What this does not mean. The theorem does not guarantee:
- That the required width is tractable
- That gradient-based optimization will converge to the approximating solution
- That the learned function will generalize to unseen data
- Anything about the sample complexity of learning
The gap between approximation theory (what networks can represent) and learning theory (what networks will learn from finite data) is a central theme in deep learning theory.
Depth vs. Width
The universal approximation theorem shows that a single hidden layer suffices in principle, but deeper networks achieve the same approximation with exponentially fewer parameters in many cases.
Depth efficiency. Telgarsky (2016) proved that there exist functions computable by networks of depth that require exponential width to approximate with depth . Intuitively, depth enables hierarchical composition: a deep network can build complex features by composing simple ones, while a shallow network must represent the same complexity in a single layer.
Practical implications. Modern architectures are deep (tens to hundreds of layers) rather than wide. ResNets, transformers, and most production models achieve their performance through depth. The key enablers of training deep networks are residual connections (He et al., 2016), batch/layer normalization, and careful initialization schemes. These are covered in the backpropagation article.
Expressiveness of ReLU Networks
ReLU networks compute piecewise linear functions. Each neuron partitions the input space with a hyperplane (where ), and the network output is a different linear function in each region of the resulting partition.
A single hidden layer with ReLU neurons can produce at most linear regions in (though this bound is rarely tight). A deep ReLU network with layers of width can produce up to linear regions, growing exponentially in depth. This is the formal basis for depth efficiency: deeper networks can carve the input space into exponentially more decision regions.
Summary
| Concept | Key Idea |
|---|---|
| Perceptron | Single linear classifier, limited to linearly separable functions |
| MLP | Composition of linear layers + nonlinearities, can approximate any continuous function |
| Activation functions | Enable nonlinearity; ReLU is default, GELU for transformers |
| Universal approximation | Width suffices in theory; depth is more efficient in practice |
| Forward pass | Sequential layer-by-layer computation; intermediate activations are learned features |
| Output/loss pairing | Sigmoid + BCE for binary, softmax + cross-entropy for multi-class, linear + MSE for regression |
The forward pass computes predictions; the next article covers backpropagation, which computes the gradients needed to train these networks via gradient descent.