Transformers and Attention

The transformer architecture (Vaswani et al., 2017) replaced recurrence with attention as the fundamental mechanism for sequence modeling. This article develops the architecture from the attention primitive through full encoder-decoder and decoder-only variants, covering positional encoding schemes, computational considerations, and the design choices that have made decoder-only transformers the dominant paradigm in modern language modeling.


Attention as a Differentiable Dictionary

Attention can be understood as a soft lookup in a differentiable dictionary. Given a query q\mathbf{q}, a set of keys {k1,,kn}\{\mathbf{k}_1, \ldots, \mathbf{k}_n\}, and corresponding values {v1,,vn}\{\mathbf{v}_1, \ldots, \mathbf{v}_n\}, attention computes a weighted combination of values where the weights reflect query-key similarity.

Scaled dot-product attention. For queries QRn×dkQ \in \mathbb{R}^{n \times d_k}, keys KRm×dkK \in \mathbb{R}^{m \times d_k}, and values VRm×dvV \in \mathbb{R}^{m \times d_v}:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The matrix QKRn×mQK^\top \in \mathbb{R}^{n \times m} contains all pairwise dot products between queries and keys. The softmax is applied row-wise, producing a stochastic matrix of attention weights ARn×mA \in \mathbb{R}^{n \times m} where each row sums to 1. The output is then a convex combination of value vectors.

Why scale by dk\sqrt{d_k}? If the entries of q\mathbf{q} and k\mathbf{k} are independent with mean 0 and variance 1, their dot product qk=i=1dkqiki\mathbf{q}^\top \mathbf{k} = \sum_{i=1}^{d_k} q_i k_i has mean 0 and variance dkd_k. For large dkd_k, the dot products grow in magnitude, pushing softmax inputs into regions where gradients are vanishingly small. Dividing by dk\sqrt{d_k} normalizes the variance to 1, keeping the softmax in a regime with useful gradients throughout training.

The dictionary analogy. The query asks a question (“what information do I need?”), the keys describe the available entries (“here is what I contain”), and the values are the actual content to retrieve. Unlike a hard dictionary lookup that returns a single entry, attention returns a weighted mixture of all values, with weights determined by query-key compatibility. This soft retrieval is differentiable, enabling end-to-end training.


Multi-Head Attention

A single attention function captures one type of relationship. Multi-head attention runs hh parallel attention operations, each with its own learned projections, allowing the model to jointly attend to information from different representation subspaces.

For head ii:

headi=Attention(QWiQ,  KWiK,  VWiV)\text{head}_i = \text{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)

where WiQRdmodel×dkW_i^Q \in \mathbb{R}^{d_\text{model} \times d_k}, WiKRdmodel×dkW_i^K \in \mathbb{R}^{d_\text{model} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_\text{model} \times d_v}. The heads are concatenated and projected:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O

where WORhdv×dmodelW^O \in \mathbb{R}^{hd_v \times d_\text{model}}.

Standard configuration. Typical choices are h=8h = 8 to 1616 heads with dk=dv=dmodel/hd_k = d_v = d_\text{model} / h. This keeps the total computational cost comparable to single-head attention with full dimensionality. For example, with dmodel=768d_\text{model} = 768 and h=12h = 12, each head operates in dk=64d_k = 64 dimensions.

What different heads learn. Empirical analysis (Clark et al., 2019; Voita et al., 2019) shows that different heads specialize in different linguistic phenomena:

Head typeAttends to
PositionalAdjacent or fixed-offset tokens
SyntacticDependency relations (subject-verb, modifier-noun)
SemanticCoreference, entity relationships
DelimiterSentence boundaries, special tokens

Not all heads are equally important. Voita et al. (2019) showed that many heads can be pruned with minimal performance degradation, suggesting redundancy in standard configurations.


Self-Attention vs Cross-Attention

The attention mechanism can be deployed in different configurations depending on where queries, keys, and values originate.

Self-attention. All three inputs derive from the same sequence: Q=XWQQ = XW^Q, K=XWKK = XW^K, V=XWVV = XW^V where XRn×dmodelX \in \mathbb{R}^{n \times d_\text{model}}. Each position attends to all positions in the same sequence, allowing the model to capture arbitrary pairwise dependencies regardless of distance. This is the core operation in both encoder and decoder blocks.

Cross-attention. Queries come from one sequence (typically the decoder), while keys and values come from another (typically the encoder output):

Q=XdecoderWQ,K=XencoderWK,V=XencoderWVQ = X_\text{decoder} W^Q, \quad K = X_\text{encoder} W^K, \quad V = X_\text{encoder} W^V

This allows the decoder to selectively attend to relevant parts of the encoder representation. Cross-attention appears in encoder-decoder architectures (translation, summarization) and in multimodal models (attending to image features while generating text).

Causal (masked) self-attention. For autoregressive generation, position ii must not attend to positions j>ij > i (future tokens). This is enforced by adding a mask MM to the attention logits before softmax:

Mij={0if jiif j>iM_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases} CausalAttention(Q,K,V)=softmax ⁣(QK+Mdk)V\text{CausalAttention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V

The -\infty entries become 0 after softmax, effectively zeroing out attention to future positions. This preserves the autoregressive factorization p(x1,,xn)=i=1np(xix<i)p(x_1, \ldots, x_n) = \prod_{i=1}^{n} p(x_i \mid x_{<i}).


Positional Encoding

Self-attention is permutation-equivariant: shuffling the input sequence and applying attention yields the same result as applying attention and then shuffling. This means the architecture has no inherent notion of position. Positional encodings inject order information.

Sinusoidal Positional Encoding

Vaswani et al. (2017) proposed fixed encodings based on sinusoids at different frequencies:

PE(pos,2i)=sin ⁣(pos100002i/dmodel)\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_\text{model}}}\right) PE(pos,2i+1)=cos ⁣(pos100002i/dmodel)\text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_\text{model}}}\right)

Each dimension ii oscillates at a different frequency, from wavelength 2π2\pi (for i=0i=0) to wavelength 100002π10000 \cdot 2\pi (for i=dmodel/21i = d_\text{model}/2 - 1). The encoding is added to the token embeddings.

Why sinusoids? For any fixed offset kk, the encoding at position pos+k\text{pos} + k can be expressed as a linear function of the encoding at position pos\text{pos}. Specifically, PE(pos+k)\text{PE}(\text{pos}+k) can be computed from PE(pos)\text{PE}(\text{pos}) via a rotation matrix that depends only on kk, not on pos\text{pos}. This makes it possible for attention to learn relative position patterns from absolute position encodings.

Learned Positional Embeddings

BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) use learned position embeddings: a trainable matrix EposRLmax×dmodelE_\text{pos} \in \mathbb{R}^{L_\text{max} \times d_\text{model}} where LmaxL_\text{max} is the maximum sequence length. Simple and effective, but the model cannot generalize to positions beyond LmaxL_\text{max} without extrapolation techniques.

Rotary Position Embeddings (RoPE)

Su et al. (2021) proposed encoding position by rotating query and key vectors in 2D subspaces. For a vector xRd\mathbf{x} \in \mathbb{R}^{d} at position mm, RoPE applies a block-diagonal rotation:

f(x,m)=(x1cosmθ1x2sinmθ1x1sinmθ1+x2cosmθ1xd1cosmθd/2xdsinmθd/2xd1sinmθd/2+xdcosmθd/2)f(\mathbf{x}, m) = \begin{pmatrix} x_1 \cos m\theta_1 - x_2 \sin m\theta_1 \\ x_1 \sin m\theta_1 + x_2 \cos m\theta_1 \\ \vdots \\ x_{d-1} \cos m\theta_{d/2} - x_d \sin m\theta_{d/2} \\ x_{d-1} \sin m\theta_{d/2} + x_d \cos m\theta_{d/2} \end{pmatrix}

where θj=100002j/d\theta_j = 10000^{-2j/d}. The key property is that the dot product between rotated queries and keys depends only on relative position:

f(q,m)f(k,n)=g(q,k,mn)f(\mathbf{q}, m)^\top f(\mathbf{k}, n) = g(\mathbf{q}, \mathbf{k}, m - n)

RoPE encodes relative position without additive embeddings, preserving the dot-product structure of attention. It has become the standard choice for modern LLMs (LLaMA, Mistral, Qwen).

ALiBi (Attention with Linear Biases)

Press et al. (2022) take a different approach: no positional encoding at all. Instead, ALiBi adds a static, non-learned bias to attention scores based on the distance between query and key positions:

softmax ⁣(QKdkλij)\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} - \lambda \cdot |i - j|\right)

where λ\lambda is a head-specific slope (set geometrically, not learned). Closer tokens receive higher attention scores. ALiBi enables strong length extrapolation with zero additional parameters.


The Transformer Block

The fundamental building block of the transformer stacks attention with a position-wise feedforward network, connected by residual connections and layer normalization.

Pre-Norm Transformer Block

The modern standard (pre-norm) arrangement:

h=x+MultiHead(LN(x))\mathbf{h}' = \mathbf{x} + \text{MultiHead}(\text{LN}(\mathbf{x})) h=h+FFN(LN(h))\mathbf{h} = \mathbf{h}' + \text{FFN}(\text{LN}(\mathbf{h}'))

where LN\text{LN} is layer normalization and FFN\text{FFN} is a position-wise feedforward network:

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}W_1 + \mathbf{b}_1)W_2 + \mathbf{b}_2

with W1Rdmodel×dffW_1 \in \mathbb{R}^{d_\text{model} \times d_\text{ff}} and W2Rdff×dmodelW_2 \in \mathbb{R}^{d_\text{ff} \times d_\text{model}}. The standard expansion factor is dff=4dmodeld_\text{ff} = 4 \cdot d_\text{model}.

Pre-Norm vs Post-Norm

The original transformer (Vaswani et al., 2017) applied layer normalization after the residual connection (post-norm):

h=LN(x+MultiHead(x))\mathbf{h} = \text{LN}(\mathbf{x} + \text{MultiHead}(\mathbf{x}))

Pre-norm places the normalization before the sublayer. This has a significant practical advantage: at initialization, the residual path is approximately an identity function, so gradients flow cleanly through the full depth of the network. Post-norm requires careful learning rate warmup to avoid training instability. Pre-norm is now standard in virtually all large-scale models (GPT-3, PaLM, LLaMA).

Some architectures (PaLM) further simplify by using RMSNorm instead of LayerNorm, removing the mean-centering step.

Parameter Count per Block

For a single transformer block with dmodel=dd_\text{model} = d and dff=4dd_\text{ff} = 4d:

ComponentParameters
WQ,WK,WVW^Q, W^K, W^V (attention)3d23d^2
WOW^O (output projection)d2d^2
W1W_1 (FFN up-projection)4d24d^2
W2W_2 (FFN down-projection)4d24d^2
LayerNorm (x2)4d4d
Total per block12d2\approx 12d^2

For GPT-3 (d=12288d = 12288, 96 layers), this gives roughly 12×122882×96174B12 \times 12288^2 \times 96 \approx 174\text{B} parameters, consistent with the reported 175B (including embeddings).


Encoder-Decoder Architecture

The original transformer (Vaswani et al., 2017) follows an encoder-decoder structure designed for sequence-to-sequence tasks.

Encoder. A stack of NN identical blocks, each containing:

  1. Multi-head self-attention (bidirectional — every position attends to every other position)
  2. Position-wise FFN
  3. Residual connections and layer normalization around each sublayer

The encoder processes the full input sequence in parallel, producing a contextualized representation HencRn×dmodelH_\text{enc} \in \mathbb{R}^{n \times d_\text{model}}.

Decoder. A stack of NN identical blocks, each containing:

  1. Masked multi-head self-attention (causal — prevents attending to future positions)
  2. Multi-head cross-attention (queries from decoder, keys/values from HencH_\text{enc})
  3. Position-wise FFN
  4. Residual connections and layer normalization

The decoder generates output tokens autoregressively, attending to both its own previous outputs (via masked self-attention) and the encoder representation (via cross-attention).

Use cases. Encoder-decoder architectures are natural for tasks with distinct input and output sequences: machine translation, abstractive summarization, speech recognition. T5 (Raffel et al., 2020) cast all NLP tasks into a text-to-text format using this architecture. BART (Lewis et al., 2020) combined a bidirectional encoder with an autoregressive decoder for denoising pretraining.


Decoder-Only Architecture

GPT (Radford et al., 2018) simplified the transformer to a decoder-only stack: NN blocks of masked self-attention and FFN, with no encoder and no cross-attention. The model is trained with a causal language modeling objective:

L=i=1nlogp(xix1,,xi1)\mathcal{L} = -\sum_{i=1}^{n} \log p(x_i \mid x_1, \ldots, x_{i-1})

Why decoder-only dominates. Several factors explain the convergence toward decoder-only:

  1. Architectural simplicity. One block type, one attention pattern. Fewer design decisions, easier to scale.
  2. Unified pretraining and generation. The same autoregressive objective serves both “understanding” (in-context) and generation. No separate pretraining objectives needed.
  3. In-context learning. Decoder-only models naturally support few-shot prompting — examples and task are concatenated into a single sequence, and the causal structure processes them left-to-right.
  4. Scaling properties. Empirically, decoder-only models show smooth, predictable scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022). The simplicity of the architecture reduces confounds in scaling analysis.

GPT-2 (Radford et al., 2019) demonstrated zero-shot task transfer. GPT-3 (Brown et al., 2020) demonstrated few-shot in-context learning at 175B parameters. This line of work established decoder-only transformers as the default architecture for large language models.


Encoder-Only Architecture

BERT (Devlin et al., 2018) uses only the encoder with bidirectional self-attention. Every position can attend to every other position — there is no causal mask.

Pretraining objectives:

  • Masked Language Modeling (MLM). Randomly mask 15% of input tokens; the model predicts the masked tokens from their bidirectional context. This forces the model to build rich contextual representations.
  • Next Sentence Prediction (NSP). Classify whether two segments are consecutive (later shown to be unnecessary by RoBERTa).

Strengths. Bidirectional attention means each token’s representation is informed by the full context, making BERT representations powerful for:

  • Classification (sentiment, NLI)
  • Token-level tasks (NER, POS tagging)
  • Sentence embeddings and retrieval
  • Extractive question answering

Limitation. Encoder-only models cannot generate text autoregressively. The bidirectional attention pattern means there is no natural left-to-right factorization of the output distribution. This limits their applicability to discriminative and embedding tasks.


Why Transformers Replaced RNNs

Recurrent neural networks (LSTMs, GRUs) process sequences token by token, maintaining a hidden state that summarizes the history. Transformers replaced them for three fundamental reasons.

Parallelism. RNNs have O(n)O(n) sequential operations: each time step depends on the previous hidden state. Self-attention computes all pairwise interactions in O(1)O(1) sequential steps (with O(n2)O(n^2) parallel work). On modern hardware with massive parallelism, this is dramatically faster for training.

Long-range dependencies. In RNNs, information from token ii must pass through O(n)O(n) transformations to reach token jj, with vanishing or exploding gradients along the way. In self-attention, every pair of positions is connected by a single attention operation — the path length is O(1)O(1).

Representational capacity. Each layer of self-attention can implement arbitrary pairwise interactions, while each RNN step applies the same transition function. Transformers can express a richer class of sequence-to-sequence functions per layer.

The quadratic cost. The obvious drawback: self-attention is O(n2)O(n^2) in both time and memory with respect to sequence length. For a sequence of length nn with dimension dd, computing the attention matrix requires O(n2d)O(n^2 d) operations and O(n2)O(n^2) memory. This motivates several lines of work:

ApproachComplexityMechanism
Sparse attention (Beltagy et al., 2020)O(nn)O(n\sqrt{n})Attend to local window + global tokens
Linear attention (Katharopoulos et al., 2020)O(n)O(n)Kernel approximation of softmax
Flash Attention (Dao et al., 2022)O(n2)O(n^2) time, O(n)O(n) memoryIO-aware tiling, no materialized attention matrix

Flash Attention does not reduce the asymptotic time complexity but eliminates the memory bottleneck, making standard attention practical for much longer sequences.


Computational Considerations

KV Cache

During autoregressive generation, the model generates one token at a time. Without caching, generating token tt requires recomputing attention over all tt previous tokens, leading to O(n2)O(n^2) total computation for a sequence of length nn.

The KV cache stores the key and value projections from all previous positions. At each generation step, only the new token’s query, key, and value are computed. The new key and value are appended to the cache, and attention is computed between the single new query and all cached keys/values. This reduces per-step attention from O(nd)O(n \cdot d) to O(d)O(d) for the projection (the attention computation itself is O(ndk)O(n \cdot d_k) for the dot products).

Memory cost. For a model with LL layers, hh heads, and dkd_k per head, the KV cache for a sequence of length nn requires 2Lhdkn2 \cdot L \cdot h \cdot d_k \cdot n parameters (factor of 2 for keys and values). For LLaMA-2 70B (L=80L=80, h=64h=64, dk=128d_k=128) at sequence length n=4096n=4096 in FP16, this is approximately 40 GB — a significant fraction of total GPU memory.

Flash Attention

Standard attention materializes the full n×nn \times n attention matrix, requiring O(n2)O(n^2) memory. Flash Attention (Dao et al., 2022) reformulates the computation using tiling and online softmax to avoid ever storing the full matrix.

Key ideas:

  1. Tiling. Partition QQ, KK, VV into blocks that fit in GPU SRAM (fast on-chip memory). Compute partial attention within tiles.
  2. Online softmax. Maintain running statistics (max and sum) to compute the exact softmax incrementally across tiles, without needing the full row of logits.
  3. Recomputation. In the backward pass, recompute attention weights from QQ, KK, VV rather than storing them. This trades compute for memory.

The result: exact attention (no approximation) with O(n)O(n) memory instead of O(n2)O(n^2). Flash Attention achieves 2-4x wall-clock speedup over standard PyTorch attention by reducing HBM (high-bandwidth memory) reads/writes, which are the true bottleneck on modern GPUs.

Grouped Query Attention (GQA)

Multi-head attention requires separate KK and VV projections for each head, which inflates the KV cache. Grouped Query Attention (Ainslie et al., 2023) reduces this by sharing key and value heads across groups of query heads.

VariantKV headsQuery headsKV cache size
Multi-Head Attention (MHA)hhhh2Lhdkn2Lhd_k n
Multi-Query Attention (MQA)1hh2Ldkn2Ld_k n
Grouped Query Attention (GQA)gghh2Lgdkn2Lgd_k n

GQA with gg groups interpolates between MHA (g=hg = h) and MQA (g=1g = 1). LLaMA 2 70B uses GQA with 8 KV heads and 64 query heads (g=8g = 8), reducing KV cache by 8x relative to MHA with minimal quality degradation. Mistral 7B similarly uses GQA. The reduction in KV cache size directly increases the maximum batch size that fits in memory during inference, improving throughput.


Key Papers

PaperYearContribution
Vaswani et al., Attention Is All You Need2017The transformer architecture: scaled dot-product attention, multi-head attention, encoder-decoder structure
Devlin et al., BERT2018Bidirectional encoder with masked language modeling; demonstrated that pretraining + finetuning dominates task-specific architectures
Radford et al., Improving Language Understanding by Generative Pre-Training (GPT)2018Decoder-only transformer pretrained with causal LM; showed generative pretraining transfers to discriminative tasks
Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2)2019Scaled decoder-only to 1.5B parameters; demonstrated zero-shot task transfer
Brown et al., Language Models are Few-Shot Learners (GPT-3)2020175B parameter decoder-only model; established in-context learning as a paradigm
Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding2021RoPE: relative position encoding via rotation, now standard in open-source LLMs
Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness2022IO-aware attention algorithm achieving O(n)O(n) memory; changed how attention is implemented in practice
Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints2023Grouped Query Attention for efficient KV caching; adopted by LLaMA 2, Mistral