Neural Networks

Neural networks are parameterized function approximators composed of alternating linear transformations and element-wise nonlinearities. This article develops the feedforward architecture from the single-neuron perceptron through deep multilayer networks, with emphasis on the representational consequences of depth and activation choice.

The Perceptron

The perceptron computes a linear function of the input followed by a threshold:

y = \sigma(\mathbf{w}^\top \mathbf{x} + b)

where $\sigma$ is a step function: $\sigma(z) = \mathbb{I}[z > 0]$ . The weight vector $\mathbf{w} \in \mathbb{R}^D$ and bias $b \in \mathbb{R}$ define a hyperplane in input space. The perceptron classifies inputs by which side of this hyperplane they fall on.

Geometric interpretation. The decision boundary $\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b = 0\}$ is a $(D-1)$ -dimensional hyperplane with normal vector $\mathbf{w}$ . The bias $b$ controls the offset from the origin. The perceptron can represent any linearly separable function, and only linearly separable functions.

The XOR problem. The function $\text{XOR}(x_1, x_2) = x_1 \oplus x_2$ is not linearly separable: no single hyperplane in $\mathbb{R}^2$ can separate the positive examples $\{(0,1), (1,0)\}$ from the negative examples $\{(0,0), (1,1)\}$ . Minsky and Papert (1969) formalized this limitation, demonstrating that single-layer perceptrons cannot compute parity functions. This motivated the development of multilayer architectures.

Multilayer Perceptrons

A multilayer perceptron (MLP) composes multiple layers of linear transformations with nonlinear activations. For a network with $L$ hidden layers:

\mathbf{h}^{(0)} = \mathbf{x}

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad l = 1, \ldots, L

\mathbf{h}^{(l)} = \phi(\mathbf{z}^{(l)}), \quad l = 1, \ldots, L

\hat{y} = \mathbf{W}^{(L+1)} \mathbf{h}^{(L)} + \mathbf{b}^{(L+1)}

where $\mathbf{W}^{(l)} \in \mathbb{R}^{d_l \times d_{l-1}}$ are weight matrices, $\mathbf{b}^{(l)} \in \mathbb{R}^{d_l}$ are bias vectors, and $\phi$ is a nonlinear activation function applied element-wise. The pre-activation values $\mathbf{z}^{(l)}$ and post-activation values $\mathbf{h}^{(l)}$ are cached during the forward pass for use in backpropagation.

Why nonlinearity is essential. Without activation functions, a composition of linear maps is itself linear: $\mathbf{W}^{(2)}(\mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)} = \mathbf{W}'\mathbf{x} + \mathbf{b}'$ . The network collapses to a single linear transformation regardless of depth. Nonlinear activations break this degeneracy and enable the network to represent nonlinear decision boundaries.

Activation Functions

The choice of activation function has significant consequences for optimization dynamics, gradient flow, and representational capacity.

Sigmoid

\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))

The sigmoid squashes inputs to $(0, 1)$ . Its derivative has maximum value $1/4$ at $z = 0$ and decays exponentially for $|z| \gg 0$ . This causes the vanishing gradient problem: in deep networks, gradients flowing through many sigmoid layers shrink exponentially, making early layers nearly untrainable. Additionally, sigmoid outputs are not zero-centered, which introduces systematic bias in gradient updates for downstream weights.

Tanh

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}, \quad \tanh'(z) = 1 - \tanh^2(z)

Tanh maps to $(-1, 1)$ and is zero-centered, resolving the bias issue. However, it still saturates for large $|z|$ , producing vanishing gradients in deep networks. Related to sigmoid by $\tanh(z) = 2\sigma(2z) - 1$ .

ReLU

\text{ReLU}(z) = \max(0, z), \quad \text{ReLU}'(z) = \mathbb{I}[z > 0]

The Rectified Linear Unit is the default activation in modern deep learning. Its gradient is exactly 1 for positive inputs, eliminating vanishing gradients in the active regime. Computation is a single comparison, making it significantly faster than exponential-based activations.

Dying ReLU problem. If a neuron’s pre-activation is negative for all training examples (e.g., due to a large negative bias update), its gradient is permanently zero and the neuron stops learning. This can affect a significant fraction of neurons in wide networks with aggressive learning rates.

Variants

Leaky ReLU: $f(z) = \max(\alpha z, z)$ with small $\alpha > 0$ (typically 0.01). Provides a non-zero gradient for negative inputs, preventing dead neurons.

GELU (Gaussian Error Linear Unit): $f(z) = z \cdot \Phi(z)$ where $\Phi$ is the standard Gaussian CDF. Used in BERT, GPT, and most modern transformer architectures. GELU smoothly gates the input by its own magnitude, combining the benefits of ReLU (sparse activation) with a smooth, non-zero gradient everywhere. Approximated as $0.5z(1 + \tanh[\sqrt{2/\pi}(z + 0.044715z^3)])$ .

SiLU / Swish: $f(z) = z \cdot \sigma(z)$ . Discovered through automated activation function search (Ramachandran et al., 2017). Closely related to GELU, used in EfficientNet and many vision architectures.

Forward Propagation

The forward pass computes the network’s output for a given input by sequentially applying each layer’s transformation. For a two-hidden-layer network with ReLU activations and a regression output:

\mathbf{h}^{(1)} = \text{ReLU}(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})

\mathbf{h}^{(2)} = \text{ReLU}(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)})

\hat{y} = \mathbf{w}^{(3)\top} \mathbf{h}^{(2)} + b^{(3)}

The intermediate representations $\mathbf{h}^{(1)}, \mathbf{h}^{(2)}$ are learned features. Unlike hand-crafted feature engineering, the network discovers useful representations automatically through gradient-based optimization. Each hidden layer computes a progressively more abstract representation of the input.

Parameter count. For a network with layer widths $[d_0, d_1, \ldots, d_{L+1}]$ , the total parameter count is $\sum_{l=1}^{L+1}(d_{l-1} \cdot d_l + d_l)$ . A network with input dimension 784, two hidden layers of 256 and 128 units, and 10 output classes has $784 \cdot 256 + 256 + 256 \cdot 128 + 128 + 128 \cdot 10 + 10 = 235,146$ parameters.

Output Layers and Loss Functions

The output layer and loss function are chosen jointly based on the prediction task.

Regression

Output: linear (no activation). Loss: mean squared error.

\mathcal{L} = \frac{1}{N}\sum_{i=1}^N (\hat{y}^{(i)} - y^{(i)})^2

For heteroscedastic targets or zero-inflated distributions, alternative losses such as Tweedie, Huber, or quantile regression losses may be more appropriate. The tracker cost estimation model uses Tweedie loss for exactly this reason: the zero-inflated transfer size distribution renders MSE suboptimal.

Binary Classification

Output: sigmoid activation producing $\hat{p} = \sigma(\mathbf{w}^\top \mathbf{h}^{(L)})$ . Loss: binary cross-entropy.

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left[y^{(i)} \log \hat{p}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{p}^{(i)})\right]

Cross-entropy is the negative log-likelihood under a Bernoulli model, making it the maximum likelihood objective for binary classification. It produces stronger gradients than MSE when the prediction is confident but wrong.

Multi-class Classification

Output: softmax activation producing a probability distribution over $K$ classes.

\hat{p}_k = \text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

Loss: categorical cross-entropy (negative log-likelihood of the multinomial).

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K y_k^{(i)} \log \hat{p}_k^{(i)}

The softmax function is invariant to additive constants ( $\text{softmax}(\mathbf{z}) = \text{softmax}(\mathbf{z} + c)$ ), so the logits $\mathbf{z}$ encode relative, not absolute, class scores. In practice, the maximum logit is subtracted for numerical stability before exponentiation.

The Universal Approximation Theorem

Theorem (Cybenko, 1989; Hornik, 1991). A feedforward network with a single hidden layer of finite width and a non-polynomial activation function can approximate any continuous function on a compact subset of $\mathbb{R}^D$ to arbitrary precision.

Formally: for any continuous $f: [0,1]^D \to \mathbb{R}$ , any $\varepsilon > 0$ , and any non-polynomial continuous activation $\phi$ , there exist $n \in \mathbb{N}$ , $\mathbf{W} \in \mathbb{R}^{n \times D}$ , $\mathbf{v} \in \mathbb{R}^n$ , $\mathbf{b} \in \mathbb{R}^n$ such that:

\sup_{\mathbf{x} \in [0,1]^D} \left| f(\mathbf{x}) - \sum_{j=1}^n v_j \phi(\mathbf{w}_j^\top \mathbf{x} + b_j) \right| < \varepsilon

What this means. Neural networks have sufficient representational capacity to approximate any reasonable target function. The theorem is existential: it guarantees the existence of suitable weights but says nothing about whether gradient descent can find them, or how many neurons are required. In practice, the required width for a single hidden layer may be exponentially large in the input dimension.

What this does not mean. The theorem does not guarantee:

That the required width is tractable
That gradient-based optimization will converge to the approximating solution
That the learned function will generalize to unseen data
Anything about the sample complexity of learning

The gap between approximation theory (what networks can represent) and learning theory (what networks will learn from finite data) is a central theme in deep learning theory.

Depth vs. Width

The universal approximation theorem shows that a single hidden layer suffices in principle, but deeper networks achieve the same approximation with exponentially fewer parameters in many cases.

Depth efficiency. Telgarsky (2016) proved that there exist functions computable by networks of depth $k$ that require exponential width to approximate with depth $k-1$ . Intuitively, depth enables hierarchical composition: a deep network can build complex features by composing simple ones, while a shallow network must represent the same complexity in a single layer.

Practical implications. Modern architectures are deep (tens to hundreds of layers) rather than wide. ResNets, transformers, and most production models achieve their performance through depth. The key enablers of training deep networks are residual connections (He et al., 2016), batch/layer normalization, and careful initialization schemes. These are covered in the backpropagation article.

Expressiveness of ReLU Networks

ReLU networks compute piecewise linear functions. Each neuron partitions the input space with a hyperplane (where $\mathbf{w}^\top \mathbf{x} + b = 0$ ), and the network output is a different linear function in each region of the resulting partition.

A single hidden layer with $n$ ReLU neurons can produce at most $\binom{n}{D}$ linear regions in $\mathbb{R}^D$ (though this bound is rarely tight). A deep ReLU network with $L$ layers of width $n$ can produce up to $O((n/D)^{(L-1)D} \cdot n^D)$ linear regions, growing exponentially in depth. This is the formal basis for depth efficiency: deeper networks can carve the input space into exponentially more decision regions.

Summary

Concept	Key Idea
Perceptron	Single linear classifier, limited to linearly separable functions
MLP	Composition of linear layers + nonlinearities, can approximate any continuous function
Activation functions	Enable nonlinearity; ReLU is default, GELU for transformers
Universal approximation	Width suffices in theory; depth is more efficient in practice
Forward pass	Sequential layer-by-layer computation; intermediate activations are learned features
Output/loss pairing	Sigmoid + BCE for binary, softmax + cross-entropy for multi-class, linear + MSE for regression

The forward pass computes predictions; the next article covers backpropagation, which computes the gradients needed to train these networks via gradient descent.