Transformers and Attention

The transformer architecture (Vaswani et al., 2017) replaced recurrence with attention as the fundamental mechanism for sequence modeling. This article develops the architecture from the attention primitive through full encoder-decoder and decoder-only variants, covering positional encoding schemes, computational considerations, and the design choices that have made decoder-only transformers the dominant paradigm in modern language modeling.

Attention as a Differentiable Dictionary

Attention can be understood as a soft lookup in a differentiable dictionary. Given a query $\mathbf{q}$ , a set of keys $\{\mathbf{k}_1, \ldots, \mathbf{k}_n\}$ , and corresponding values $\{\mathbf{v}_1, \ldots, \mathbf{v}_n\}$ , attention computes a weighted combination of values where the weights reflect query-key similarity.

Scaled dot-product attention. For queries $Q \in \mathbb{R}^{n \times d_k}$ , keys $K \in \mathbb{R}^{m \times d_k}$ , and values $V \in \mathbb{R}^{m \times d_v}$ :

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The matrix $QK^\top \in \mathbb{R}^{n \times m}$ contains all pairwise dot products between queries and keys. The softmax is applied row-wise, producing a stochastic matrix of attention weights $A \in \mathbb{R}^{n \times m}$ where each row sums to 1. The output is then a convex combination of value vectors.

Why scale by $\sqrt{d_k}$ ? If the entries of $\mathbf{q}$ and $\mathbf{k}$ are independent with mean 0 and variance 1, their dot product $\mathbf{q}^\top \mathbf{k} = \sum_{i=1}^{d_k} q_i k_i$ has mean 0 and variance $d_k$ . For large $d_k$ , the dot products grow in magnitude, pushing softmax inputs into regions where gradients are vanishingly small. Dividing by $\sqrt{d_k}$ normalizes the variance to 1, keeping the softmax in a regime with useful gradients throughout training.

The dictionary analogy. The query asks a question (“what information do I need?”), the keys describe the available entries (“here is what I contain”), and the values are the actual content to retrieve. Unlike a hard dictionary lookup that returns a single entry, attention returns a weighted mixture of all values, with weights determined by query-key compatibility. This soft retrieval is differentiable, enabling end-to-end training.

Multi-Head Attention

A single attention function captures one type of relationship. Multi-head attention runs $h$ parallel attention operations, each with its own learned projections, allowing the model to jointly attend to information from different representation subspaces.

For head $i$ :

\text{head}_i = \text{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)

where $W_i^Q \in \mathbb{R}^{d_\text{model} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_\text{model} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_\text{model} \times d_v}$ . The heads are concatenated and projected:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O

where $W^O \in \mathbb{R}^{hd_v \times d_\text{model}}$ .

Standard configuration. Typical choices are $h = 8$ to $16$ heads with $d_k = d_v = d_\text{model} / h$ . This keeps the total computational cost comparable to single-head attention with full dimensionality. For example, with $d_\text{model} = 768$ and $h = 12$ , each head operates in $d_k = 64$ dimensions.

What different heads learn. Empirical analysis (Clark et al., 2019; Voita et al., 2019) shows that different heads specialize in different linguistic phenomena:

Head type	Attends to
Positional	Adjacent or fixed-offset tokens
Syntactic	Dependency relations (subject-verb, modifier-noun)
Semantic	Coreference, entity relationships
Delimiter	Sentence boundaries, special tokens

Not all heads are equally important. Voita et al. (2019) showed that many heads can be pruned with minimal performance degradation, suggesting redundancy in standard configurations.

Self-Attention vs Cross-Attention

The attention mechanism can be deployed in different configurations depending on where queries, keys, and values originate.

Self-attention. All three inputs derive from the same sequence: $Q = XW^Q$ , $K = XW^K$ , $V = XW^V$ where $X \in \mathbb{R}^{n \times d_\text{model}}$ . Each position attends to all positions in the same sequence, allowing the model to capture arbitrary pairwise dependencies regardless of distance. This is the core operation in both encoder and decoder blocks.

Cross-attention. Queries come from one sequence (typically the decoder), while keys and values come from another (typically the encoder output):

Q = X_\text{decoder} W^Q, \quad K = X_\text{encoder} W^K, \quad V = X_\text{encoder} W^V

This allows the decoder to selectively attend to relevant parts of the encoder representation. Cross-attention appears in encoder-decoder architectures (translation, summarization) and in multimodal models (attending to image features while generating text).

Causal (masked) self-attention. For autoregressive generation, position $i$ must not attend to positions $j > i$ (future tokens). This is enforced by adding a mask $M$ to the attention logits before softmax:

M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

\text{CausalAttention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V

The $-\infty$ entries become 0 after softmax, effectively zeroing out attention to future positions. This preserves the autoregressive factorization $p(x_1, \ldots, x_n) = \prod_{i=1}^{n} p(x_i \mid x_{<i})$ .

Positional Encoding

Self-attention is permutation-equivariant: shuffling the input sequence and applying attention yields the same result as applying attention and then shuffling. This means the architecture has no inherent notion of position. Positional encodings inject order information.

Sinusoidal Positional Encoding

Vaswani et al. (2017) proposed fixed encodings based on sinusoids at different frequencies:

\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_\text{model}}}\right)

\text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_\text{model}}}\right)

Each dimension $i$ oscillates at a different frequency, from wavelength $2\pi$ (for $i=0$ ) to wavelength $10000 \cdot 2\pi$ (for $i = d_\text{model}/2 - 1$ ). The encoding is added to the token embeddings.

Why sinusoids? For any fixed offset $k$ , the encoding at position $\text{pos} + k$ can be expressed as a linear function of the encoding at position $\text{pos}$ . Specifically, $\text{PE}(\text{pos}+k)$ can be computed from $\text{PE}(\text{pos})$ via a rotation matrix that depends only on $k$ , not on $\text{pos}$ . This makes it possible for attention to learn relative position patterns from absolute position encodings.

Learned Positional Embeddings

BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) use learned position embeddings: a trainable matrix $E_\text{pos} \in \mathbb{R}^{L_\text{max} \times d_\text{model}}$ where $L_\text{max}$ is the maximum sequence length. Simple and effective, but the model cannot generalize to positions beyond $L_\text{max}$ without extrapolation techniques.

Rotary Position Embeddings (RoPE)

Su et al. (2021) proposed encoding position by rotating query and key vectors in 2D subspaces. For a vector $\mathbf{x} \in \mathbb{R}^{d}$ at position $m$ , RoPE applies a block-diagonal rotation:

f(\mathbf{x}, m) = \begin{pmatrix} x_1 \cos m\theta_1 - x_2 \sin m\theta_1 \\ x_1 \sin m\theta_1 + x_2 \cos m\theta_1 \\ \vdots \\ x_{d-1} \cos m\theta_{d/2} - x_d \sin m\theta_{d/2} \\ x_{d-1} \sin m\theta_{d/2} + x_d \cos m\theta_{d/2} \end{pmatrix}

where $\theta_j = 10000^{-2j/d}$ . The key property is that the dot product between rotated queries and keys depends only on relative position:

f(\mathbf{q}, m)^\top f(\mathbf{k}, n) = g(\mathbf{q}, \mathbf{k}, m - n)

RoPE encodes relative position without additive embeddings, preserving the dot-product structure of attention. It has become the standard choice for modern LLMs (LLaMA, Mistral, Qwen).

ALiBi (Attention with Linear Biases)

Press et al. (2022) take a different approach: no positional encoding at all. Instead, ALiBi adds a static, non-learned bias to attention scores based on the distance between query and key positions:

\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} - \lambda \cdot |i - j|\right)

where $\lambda$ is a head-specific slope (set geometrically, not learned). Closer tokens receive higher attention scores. ALiBi enables strong length extrapolation with zero additional parameters.

The Transformer Block

The fundamental building block of the transformer stacks attention with a position-wise feedforward network, connected by residual connections and layer normalization.

Pre-Norm Transformer Block

The modern standard (pre-norm) arrangement:

\mathbf{h}' = \mathbf{x} + \text{MultiHead}(\text{LN}(\mathbf{x}))

\mathbf{h} = \mathbf{h}' + \text{FFN}(\text{LN}(\mathbf{h}'))

where $\text{LN}$ is layer normalization and $\text{FFN}$ is a position-wise feedforward network:

\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}W_1 + \mathbf{b}_1)W_2 + \mathbf{b}_2

with $W_1 \in \mathbb{R}^{d_\text{model} \times d_\text{ff}}$ and $W_2 \in \mathbb{R}^{d_\text{ff} \times d_\text{model}}$ . The standard expansion factor is $d_\text{ff} = 4 \cdot d_\text{model}$ .

Pre-Norm vs Post-Norm

The original transformer (Vaswani et al., 2017) applied layer normalization after the residual connection (post-norm):

\mathbf{h} = \text{LN}(\mathbf{x} + \text{MultiHead}(\mathbf{x}))

Pre-norm places the normalization before the sublayer. This has a significant practical advantage: at initialization, the residual path is approximately an identity function, so gradients flow cleanly through the full depth of the network. Post-norm requires careful learning rate warmup to avoid training instability. Pre-norm is now standard in virtually all large-scale models (GPT-3, PaLM, LLaMA).

Some architectures (PaLM) further simplify by using RMSNorm instead of LayerNorm, removing the mean-centering step.

Parameter Count per Block

For a single transformer block with $d_\text{model} = d$ and $d_\text{ff} = 4d$ :

Component	Parameters
$W^Q, W^K, W^V$ (attention)	$3d^2$
$W^O$ (output projection)	$d^2$
$W_1$ (FFN up-projection)	$4d^2$
$W_2$ (FFN down-projection)	$4d^2$
LayerNorm (x2)	$4d$
Total per block	$\approx 12d^2$

For GPT-3 ( $d = 12288$ , 96 layers), this gives roughly $12 \times 12288^2 \times 96 \approx 174\text{B}$ parameters, consistent with the reported 175B (including embeddings).

Encoder-Decoder Architecture

The original transformer (Vaswani et al., 2017) follows an encoder-decoder structure designed for sequence-to-sequence tasks.

Encoder. A stack of $N$ identical blocks, each containing:

Multi-head self-attention (bidirectional — every position attends to every other position)
Position-wise FFN
Residual connections and layer normalization around each sublayer

The encoder processes the full input sequence in parallel, producing a contextualized representation $H_\text{enc} \in \mathbb{R}^{n \times d_\text{model}}$ .

Decoder. A stack of $N$ identical blocks, each containing:

Masked multi-head self-attention (causal — prevents attending to future positions)
Multi-head cross-attention (queries from decoder, keys/values from $H_\text{enc}$ )
Position-wise FFN
Residual connections and layer normalization

The decoder generates output tokens autoregressively, attending to both its own previous outputs (via masked self-attention) and the encoder representation (via cross-attention).

Use cases. Encoder-decoder architectures are natural for tasks with distinct input and output sequences: machine translation, abstractive summarization, speech recognition. T5 (Raffel et al., 2020) cast all NLP tasks into a text-to-text format using this architecture. BART (Lewis et al., 2020) combined a bidirectional encoder with an autoregressive decoder for denoising pretraining.

Decoder-Only Architecture

GPT (Radford et al., 2018) simplified the transformer to a decoder-only stack: $N$ blocks of masked self-attention and FFN, with no encoder and no cross-attention. The model is trained with a causal language modeling objective:

\mathcal{L} = -\sum_{i=1}^{n} \log p(x_i \mid x_1, \ldots, x_{i-1})

Why decoder-only dominates. Several factors explain the convergence toward decoder-only:

Architectural simplicity. One block type, one attention pattern. Fewer design decisions, easier to scale.
Unified pretraining and generation. The same autoregressive objective serves both “understanding” (in-context) and generation. No separate pretraining objectives needed.
In-context learning. Decoder-only models naturally support few-shot prompting — examples and task are concatenated into a single sequence, and the causal structure processes them left-to-right.
Scaling properties. Empirically, decoder-only models show smooth, predictable scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022). The simplicity of the architecture reduces confounds in scaling analysis.

GPT-2 (Radford et al., 2019) demonstrated zero-shot task transfer. GPT-3 (Brown et al., 2020) demonstrated few-shot in-context learning at 175B parameters. This line of work established decoder-only transformers as the default architecture for large language models.

Encoder-Only Architecture

BERT (Devlin et al., 2018) uses only the encoder with bidirectional self-attention. Every position can attend to every other position — there is no causal mask.

Pretraining objectives:

Masked Language Modeling (MLM). Randomly mask 15% of input tokens; the model predicts the masked tokens from their bidirectional context. This forces the model to build rich contextual representations.
Next Sentence Prediction (NSP). Classify whether two segments are consecutive (later shown to be unnecessary by RoBERTa).

Strengths. Bidirectional attention means each token’s representation is informed by the full context, making BERT representations powerful for:

Classification (sentiment, NLI)
Token-level tasks (NER, POS tagging)
Sentence embeddings and retrieval
Extractive question answering

Limitation. Encoder-only models cannot generate text autoregressively. The bidirectional attention pattern means there is no natural left-to-right factorization of the output distribution. This limits their applicability to discriminative and embedding tasks.

Why Transformers Replaced RNNs

Recurrent neural networks (LSTMs, GRUs) process sequences token by token, maintaining a hidden state that summarizes the history. Transformers replaced them for three fundamental reasons.

Parallelism. RNNs have $O(n)$ sequential operations: each time step depends on the previous hidden state. Self-attention computes all pairwise interactions in $O(1)$ sequential steps (with $O(n^2)$ parallel work). On modern hardware with massive parallelism, this is dramatically faster for training.

Long-range dependencies. In RNNs, information from token $i$ must pass through $O(n)$ transformations to reach token $j$ , with vanishing or exploding gradients along the way. In self-attention, every pair of positions is connected by a single attention operation — the path length is $O(1)$ .

Representational capacity. Each layer of self-attention can implement arbitrary pairwise interactions, while each RNN step applies the same transition function. Transformers can express a richer class of sequence-to-sequence functions per layer.

The quadratic cost. The obvious drawback: self-attention is $O(n^2)$ in both time and memory with respect to sequence length. For a sequence of length $n$ with dimension $d$ , computing the attention matrix requires $O(n^2 d)$ operations and $O(n^2)$ memory. This motivates several lines of work:

Approach	Complexity	Mechanism
Sparse attention (Beltagy et al., 2020)	$O(n\sqrt{n})$	Attend to local window + global tokens
Linear attention (Katharopoulos et al., 2020)	$O(n)$	Kernel approximation of softmax
Flash Attention (Dao et al., 2022)	$O(n^2)$ time, $O(n)$ memory	IO-aware tiling, no materialized attention matrix

Flash Attention does not reduce the asymptotic time complexity but eliminates the memory bottleneck, making standard attention practical for much longer sequences.

Computational Considerations

KV Cache

During autoregressive generation, the model generates one token at a time. Without caching, generating token $t$ requires recomputing attention over all $t$ previous tokens, leading to $O(n^2)$ total computation for a sequence of length $n$ .

The KV cache stores the key and value projections from all previous positions. At each generation step, only the new token’s query, key, and value are computed. The new key and value are appended to the cache, and attention is computed between the single new query and all cached keys/values. This reduces per-step attention from $O(n \cdot d)$ to $O(d)$ for the projection (the attention computation itself is $O(n \cdot d_k)$ for the dot products).

Memory cost. For a model with $L$ layers, $h$ heads, and $d_k$ per head, the KV cache for a sequence of length $n$ requires $2 \cdot L \cdot h \cdot d_k \cdot n$ parameters (factor of 2 for keys and values). For LLaMA-2 70B ( $L=80$ , $h=64$ , $d_k=128$ ) at sequence length $n=4096$ in FP16, this is approximately 40 GB — a significant fraction of total GPU memory.

Flash Attention

Standard attention materializes the full $n \times n$ attention matrix, requiring $O(n^2)$ memory. Flash Attention (Dao et al., 2022) reformulates the computation using tiling and online softmax to avoid ever storing the full matrix.

Key ideas:

Tiling. Partition $Q$ , $K$ , $V$ into blocks that fit in GPU SRAM (fast on-chip memory). Compute partial attention within tiles.
Online softmax. Maintain running statistics (max and sum) to compute the exact softmax incrementally across tiles, without needing the full row of logits.
Recomputation. In the backward pass, recompute attention weights from $Q$ , $K$ , $V$ rather than storing them. This trades compute for memory.

The result: exact attention (no approximation) with $O(n)$ memory instead of $O(n^2)$ . Flash Attention achieves 2-4x wall-clock speedup over standard PyTorch attention by reducing HBM (high-bandwidth memory) reads/writes, which are the true bottleneck on modern GPUs.

Grouped Query Attention (GQA)

Multi-head attention requires separate $K$ and $V$ projections for each head, which inflates the KV cache. Grouped Query Attention (Ainslie et al., 2023) reduces this by sharing key and value heads across groups of query heads.

Variant	KV heads	Query heads	KV cache size
Multi-Head Attention (MHA)	$h$	$h$	$2Lhd_k n$
Multi-Query Attention (MQA)	1	$h$	$2Ld_k n$
Grouped Query Attention (GQA)	$g$	$h$	$2Lgd_k n$

GQA with $g$ groups interpolates between MHA ( $g = h$ ) and MQA ( $g = 1$ ). LLaMA 2 70B uses GQA with 8 KV heads and 64 query heads ( $g = 8$ ), reducing KV cache by 8x relative to MHA with minimal quality degradation. Mistral 7B similarly uses GQA. The reduction in KV cache size directly increases the maximum batch size that fits in memory during inference, improving throughput.

Key Papers

Paper	Year	Contribution
Vaswani et al., Attention Is All You Need	2017	The transformer architecture: scaled dot-product attention, multi-head attention, encoder-decoder structure
Devlin et al., BERT	2018	Bidirectional encoder with masked language modeling; demonstrated that pretraining + finetuning dominates task-specific architectures
Radford et al., Improving Language Understanding by Generative Pre-Training (GPT)	2018	Decoder-only transformer pretrained with causal LM; showed generative pretraining transfers to discriminative tasks
Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2)	2019	Scaled decoder-only to 1.5B parameters; demonstrated zero-shot task transfer
Brown et al., Language Models are Few-Shot Learners (GPT-3)	2020	175B parameter decoder-only model; established in-context learning as a paradigm
Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding	2021	RoPE: relative position encoding via rotation, now standard in open-source LLMs
Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	2022	IO-aware attention algorithm achieving $O(n)$ memory; changed how attention is implemented in practice
Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	2023	Grouped Query Attention for efficient KV caching; adopted by LLaMA 2, Mistral