Transformers and Attention
The transformer architecture (Vaswani et al., 2017) replaced recurrence with attention as the fundamental mechanism for sequence modeling. This article develops the architecture from the attention primitive through full encoder-decoder and decoder-only variants, covering positional encoding schemes, computational considerations, and the design choices that have made decoder-only transformers the dominant paradigm in modern language modeling.
Attention as a Differentiable Dictionary
Attention can be understood as a soft lookup in a differentiable dictionary. Given a query , a set of keys , and corresponding values , attention computes a weighted combination of values where the weights reflect query-key similarity.
Scaled dot-product attention. For queries , keys , and values :
The matrix contains all pairwise dot products between queries and keys. The softmax is applied row-wise, producing a stochastic matrix of attention weights where each row sums to 1. The output is then a convex combination of value vectors.
Why scale by ? If the entries of and are independent with mean 0 and variance 1, their dot product has mean 0 and variance . For large , the dot products grow in magnitude, pushing softmax inputs into regions where gradients are vanishingly small. Dividing by normalizes the variance to 1, keeping the softmax in a regime with useful gradients throughout training.
The dictionary analogy. The query asks a question (“what information do I need?”), the keys describe the available entries (“here is what I contain”), and the values are the actual content to retrieve. Unlike a hard dictionary lookup that returns a single entry, attention returns a weighted mixture of all values, with weights determined by query-key compatibility. This soft retrieval is differentiable, enabling end-to-end training.
Multi-Head Attention
A single attention function captures one type of relationship. Multi-head attention runs parallel attention operations, each with its own learned projections, allowing the model to jointly attend to information from different representation subspaces.
For head :
where , , . The heads are concatenated and projected:
where .
Standard configuration. Typical choices are to heads with . This keeps the total computational cost comparable to single-head attention with full dimensionality. For example, with and , each head operates in dimensions.
What different heads learn. Empirical analysis (Clark et al., 2019; Voita et al., 2019) shows that different heads specialize in different linguistic phenomena:
| Head type | Attends to |
|---|---|
| Positional | Adjacent or fixed-offset tokens |
| Syntactic | Dependency relations (subject-verb, modifier-noun) |
| Semantic | Coreference, entity relationships |
| Delimiter | Sentence boundaries, special tokens |
Not all heads are equally important. Voita et al. (2019) showed that many heads can be pruned with minimal performance degradation, suggesting redundancy in standard configurations.
Self-Attention vs Cross-Attention
The attention mechanism can be deployed in different configurations depending on where queries, keys, and values originate.
Self-attention. All three inputs derive from the same sequence: , , where . Each position attends to all positions in the same sequence, allowing the model to capture arbitrary pairwise dependencies regardless of distance. This is the core operation in both encoder and decoder blocks.
Cross-attention. Queries come from one sequence (typically the decoder), while keys and values come from another (typically the encoder output):
This allows the decoder to selectively attend to relevant parts of the encoder representation. Cross-attention appears in encoder-decoder architectures (translation, summarization) and in multimodal models (attending to image features while generating text).
Causal (masked) self-attention. For autoregressive generation, position must not attend to positions (future tokens). This is enforced by adding a mask to the attention logits before softmax:
The entries become 0 after softmax, effectively zeroing out attention to future positions. This preserves the autoregressive factorization .
Positional Encoding
Self-attention is permutation-equivariant: shuffling the input sequence and applying attention yields the same result as applying attention and then shuffling. This means the architecture has no inherent notion of position. Positional encodings inject order information.
Sinusoidal Positional Encoding
Vaswani et al. (2017) proposed fixed encodings based on sinusoids at different frequencies:
Each dimension oscillates at a different frequency, from wavelength (for ) to wavelength (for ). The encoding is added to the token embeddings.
Why sinusoids? For any fixed offset , the encoding at position can be expressed as a linear function of the encoding at position . Specifically, can be computed from via a rotation matrix that depends only on , not on . This makes it possible for attention to learn relative position patterns from absolute position encodings.
Learned Positional Embeddings
BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) use learned position embeddings: a trainable matrix where is the maximum sequence length. Simple and effective, but the model cannot generalize to positions beyond without extrapolation techniques.
Rotary Position Embeddings (RoPE)
Su et al. (2021) proposed encoding position by rotating query and key vectors in 2D subspaces. For a vector at position , RoPE applies a block-diagonal rotation:
where . The key property is that the dot product between rotated queries and keys depends only on relative position:
RoPE encodes relative position without additive embeddings, preserving the dot-product structure of attention. It has become the standard choice for modern LLMs (LLaMA, Mistral, Qwen).
ALiBi (Attention with Linear Biases)
Press et al. (2022) take a different approach: no positional encoding at all. Instead, ALiBi adds a static, non-learned bias to attention scores based on the distance between query and key positions:
where is a head-specific slope (set geometrically, not learned). Closer tokens receive higher attention scores. ALiBi enables strong length extrapolation with zero additional parameters.
The Transformer Block
The fundamental building block of the transformer stacks attention with a position-wise feedforward network, connected by residual connections and layer normalization.
Pre-Norm Transformer Block
The modern standard (pre-norm) arrangement:
where is layer normalization and is a position-wise feedforward network:
with and . The standard expansion factor is .
Pre-Norm vs Post-Norm
The original transformer (Vaswani et al., 2017) applied layer normalization after the residual connection (post-norm):
Pre-norm places the normalization before the sublayer. This has a significant practical advantage: at initialization, the residual path is approximately an identity function, so gradients flow cleanly through the full depth of the network. Post-norm requires careful learning rate warmup to avoid training instability. Pre-norm is now standard in virtually all large-scale models (GPT-3, PaLM, LLaMA).
Some architectures (PaLM) further simplify by using RMSNorm instead of LayerNorm, removing the mean-centering step.
Parameter Count per Block
For a single transformer block with and :
| Component | Parameters |
|---|---|
| (attention) | |
| (output projection) | |
| (FFN up-projection) | |
| (FFN down-projection) | |
| LayerNorm (x2) | |
| Total per block |
For GPT-3 (, 96 layers), this gives roughly parameters, consistent with the reported 175B (including embeddings).
Encoder-Decoder Architecture
The original transformer (Vaswani et al., 2017) follows an encoder-decoder structure designed for sequence-to-sequence tasks.
Encoder. A stack of identical blocks, each containing:
- Multi-head self-attention (bidirectional — every position attends to every other position)
- Position-wise FFN
- Residual connections and layer normalization around each sublayer
The encoder processes the full input sequence in parallel, producing a contextualized representation .
Decoder. A stack of identical blocks, each containing:
- Masked multi-head self-attention (causal — prevents attending to future positions)
- Multi-head cross-attention (queries from decoder, keys/values from )
- Position-wise FFN
- Residual connections and layer normalization
The decoder generates output tokens autoregressively, attending to both its own previous outputs (via masked self-attention) and the encoder representation (via cross-attention).
Use cases. Encoder-decoder architectures are natural for tasks with distinct input and output sequences: machine translation, abstractive summarization, speech recognition. T5 (Raffel et al., 2020) cast all NLP tasks into a text-to-text format using this architecture. BART (Lewis et al., 2020) combined a bidirectional encoder with an autoregressive decoder for denoising pretraining.
Decoder-Only Architecture
GPT (Radford et al., 2018) simplified the transformer to a decoder-only stack: blocks of masked self-attention and FFN, with no encoder and no cross-attention. The model is trained with a causal language modeling objective:
Why decoder-only dominates. Several factors explain the convergence toward decoder-only:
- Architectural simplicity. One block type, one attention pattern. Fewer design decisions, easier to scale.
- Unified pretraining and generation. The same autoregressive objective serves both “understanding” (in-context) and generation. No separate pretraining objectives needed.
- In-context learning. Decoder-only models naturally support few-shot prompting — examples and task are concatenated into a single sequence, and the causal structure processes them left-to-right.
- Scaling properties. Empirically, decoder-only models show smooth, predictable scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022). The simplicity of the architecture reduces confounds in scaling analysis.
GPT-2 (Radford et al., 2019) demonstrated zero-shot task transfer. GPT-3 (Brown et al., 2020) demonstrated few-shot in-context learning at 175B parameters. This line of work established decoder-only transformers as the default architecture for large language models.
Encoder-Only Architecture
BERT (Devlin et al., 2018) uses only the encoder with bidirectional self-attention. Every position can attend to every other position — there is no causal mask.
Pretraining objectives:
- Masked Language Modeling (MLM). Randomly mask 15% of input tokens; the model predicts the masked tokens from their bidirectional context. This forces the model to build rich contextual representations.
- Next Sentence Prediction (NSP). Classify whether two segments are consecutive (later shown to be unnecessary by RoBERTa).
Strengths. Bidirectional attention means each token’s representation is informed by the full context, making BERT representations powerful for:
- Classification (sentiment, NLI)
- Token-level tasks (NER, POS tagging)
- Sentence embeddings and retrieval
- Extractive question answering
Limitation. Encoder-only models cannot generate text autoregressively. The bidirectional attention pattern means there is no natural left-to-right factorization of the output distribution. This limits their applicability to discriminative and embedding tasks.
Why Transformers Replaced RNNs
Recurrent neural networks (LSTMs, GRUs) process sequences token by token, maintaining a hidden state that summarizes the history. Transformers replaced them for three fundamental reasons.
Parallelism. RNNs have sequential operations: each time step depends on the previous hidden state. Self-attention computes all pairwise interactions in sequential steps (with parallel work). On modern hardware with massive parallelism, this is dramatically faster for training.
Long-range dependencies. In RNNs, information from token must pass through transformations to reach token , with vanishing or exploding gradients along the way. In self-attention, every pair of positions is connected by a single attention operation — the path length is .
Representational capacity. Each layer of self-attention can implement arbitrary pairwise interactions, while each RNN step applies the same transition function. Transformers can express a richer class of sequence-to-sequence functions per layer.
The quadratic cost. The obvious drawback: self-attention is in both time and memory with respect to sequence length. For a sequence of length with dimension , computing the attention matrix requires operations and memory. This motivates several lines of work:
| Approach | Complexity | Mechanism |
|---|---|---|
| Sparse attention (Beltagy et al., 2020) | Attend to local window + global tokens | |
| Linear attention (Katharopoulos et al., 2020) | Kernel approximation of softmax | |
| Flash Attention (Dao et al., 2022) | time, memory | IO-aware tiling, no materialized attention matrix |
Flash Attention does not reduce the asymptotic time complexity but eliminates the memory bottleneck, making standard attention practical for much longer sequences.
Computational Considerations
KV Cache
During autoregressive generation, the model generates one token at a time. Without caching, generating token requires recomputing attention over all previous tokens, leading to total computation for a sequence of length .
The KV cache stores the key and value projections from all previous positions. At each generation step, only the new token’s query, key, and value are computed. The new key and value are appended to the cache, and attention is computed between the single new query and all cached keys/values. This reduces per-step attention from to for the projection (the attention computation itself is for the dot products).
Memory cost. For a model with layers, heads, and per head, the KV cache for a sequence of length requires parameters (factor of 2 for keys and values). For LLaMA-2 70B (, , ) at sequence length in FP16, this is approximately 40 GB — a significant fraction of total GPU memory.
Flash Attention
Standard attention materializes the full attention matrix, requiring memory. Flash Attention (Dao et al., 2022) reformulates the computation using tiling and online softmax to avoid ever storing the full matrix.
Key ideas:
- Tiling. Partition , , into blocks that fit in GPU SRAM (fast on-chip memory). Compute partial attention within tiles.
- Online softmax. Maintain running statistics (max and sum) to compute the exact softmax incrementally across tiles, without needing the full row of logits.
- Recomputation. In the backward pass, recompute attention weights from , , rather than storing them. This trades compute for memory.
The result: exact attention (no approximation) with memory instead of . Flash Attention achieves 2-4x wall-clock speedup over standard PyTorch attention by reducing HBM (high-bandwidth memory) reads/writes, which are the true bottleneck on modern GPUs.
Grouped Query Attention (GQA)
Multi-head attention requires separate and projections for each head, which inflates the KV cache. Grouped Query Attention (Ainslie et al., 2023) reduces this by sharing key and value heads across groups of query heads.
| Variant | KV heads | Query heads | KV cache size |
|---|---|---|---|
| Multi-Head Attention (MHA) | |||
| Multi-Query Attention (MQA) | 1 | ||
| Grouped Query Attention (GQA) |
GQA with groups interpolates between MHA () and MQA (). LLaMA 2 70B uses GQA with 8 KV heads and 64 query heads (), reducing KV cache by 8x relative to MHA with minimal quality degradation. Mistral 7B similarly uses GQA. The reduction in KV cache size directly increases the maximum batch size that fits in memory during inference, improving throughput.
Key Papers
| Paper | Year | Contribution |
|---|---|---|
| Vaswani et al., Attention Is All You Need | 2017 | The transformer architecture: scaled dot-product attention, multi-head attention, encoder-decoder structure |
| Devlin et al., BERT | 2018 | Bidirectional encoder with masked language modeling; demonstrated that pretraining + finetuning dominates task-specific architectures |
| Radford et al., Improving Language Understanding by Generative Pre-Training (GPT) | 2018 | Decoder-only transformer pretrained with causal LM; showed generative pretraining transfers to discriminative tasks |
| Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2) | 2019 | Scaled decoder-only to 1.5B parameters; demonstrated zero-shot task transfer |
| Brown et al., Language Models are Few-Shot Learners (GPT-3) | 2020 | 175B parameter decoder-only model; established in-context learning as a paradigm |
| Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding | 2021 | RoPE: relative position encoding via rotation, now standard in open-source LLMs |
| Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | 2022 | IO-aware attention algorithm achieving memory; changed how attention is implemented in practice |
| Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | 2023 | Grouped Query Attention for efficient KV caching; adopted by LLaMA 2, Mistral |