Large Language Models

Large language models (LLMs) are autoregressive transformer models trained on massive text corpora to predict the next token. This article covers the pretraining objectives, scaling phenomena, alignment techniques, and architectural decisions that define modern LLMs.

Pretraining Objectives

Causal Language Modeling (CLM)

The standard objective for decoder-only models (GPT family, LLaMA, Claude). Given a sequence of tokens $x_1, x_2, \ldots, x_T$ , maximize the log-likelihood:

\mathcal{L}_{\text{CLM}} = \sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1}; \theta)

Each token is predicted from its left context only, enforced by a causal attention mask. The model learns a conditional distribution over the vocabulary at each position.

Why next-token prediction works. Predicting the next token requires compressing all preceding context into a representation sufficient for prediction. For natural text, this requires modeling syntax, semantics, world knowledge, reasoning patterns, and stylistic conventions. The loss provides dense supervision at every position in every training sequence.

Masked Language Modeling (MLM)

The BERT objective. Randomly mask 15% of input tokens and predict them from bidirectional context:

\mathcal{L}_{\text{MLM}} = \sum_{t \in \mathcal{M}} \log P(x_t \mid \mathbf{x}_{\setminus \mathcal{M}}; \theta)

where $\mathcal{M}$ is the set of masked positions. Of the 15% selected, 80% are replaced with [MASK], 10% with a random token, 10% left unchanged. This prevents the model from learning a shortcut based on the [MASK] token.

MLM produces bidirectional representations useful for classification and extraction tasks, but cannot generate text autoregressively. This architectural limitation is why decoder-only models dominate the current landscape.

Prefix Language Modeling

A hybrid: bidirectional attention over a prefix, causal attention over the remainder. Used in T5, UL2, and some encoder-decoder models. Enables both understanding (prefix) and generation (continuation) in a single architecture.

Tokenization

LLMs operate on tokens, not characters or words. The tokenizer defines the vocabulary and segmentation.

Byte Pair Encoding (BPE)

The dominant algorithm (GPT-2, GPT-4, LLaMA). Starting from a character-level vocabulary, iteratively merge the most frequent adjacent pair:

Initialize vocabulary with all individual bytes/characters
Count all adjacent token pairs in the corpus
Merge the most frequent pair into a new token
Repeat until vocabulary reaches target size (32K—100K typical)

BPE produces a variable-length tokenization: common words become single tokens, rare words are split into subword units. The string “transformer” might tokenize as [“trans”, “former”] while “the” is a single token.

Vocabulary size tradeoffs. Larger vocabularies reduce sequence length (fewer tokens per document, faster inference) but increase embedding table size and make each token’s embedding less well-trained. Smaller vocabularies increase sequence length but handle rare words and multilingual text more gracefully.

SentencePiece

A language-agnostic tokenizer that treats the input as a raw byte stream, requiring no pre-tokenization or language-specific preprocessing. Implements both BPE and unigram language model tokenization. Used by LLaMA, Mistral, and most multilingual models. The unigram variant maintains a large initial vocabulary and prunes tokens that least reduce the corpus likelihood, producing a probabilistically motivated segmentation.

Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. (2022) established empirical power laws relating model performance to compute, data, and parameters.

Neural Scaling Laws (Kaplan et al., 2020)

Test loss follows a power law in each variable when the others are not bottlenecked:

L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}, \quad L(C) \propto C^{-\alpha_C}

where $N$ = parameters, $D$ = dataset tokens, $C$ = compute (FLOPs). The exponents $\alpha_N \approx 0.076$ , $\alpha_D \approx 0.095$ suggest that scaling data is slightly more efficient than scaling parameters.

Chinchilla Scaling (Hoffmann et al., 2022)

For a fixed compute budget $C$ , the optimal allocation scales parameters and data equally:

N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

This overturned the prior assumption that larger models should be trained on relatively less data. Chinchilla (70B parameters, 1.4T tokens) matched Gopher (280B, 300B tokens) at 4x less compute. The practical implication: most early LLMs were significantly undertrained relative to their parameter count.

Implications for practitioners. Given a compute budget, training a smaller model on more data typically outperforms training a larger model on less data. This has pushed the field toward longer training runs on larger datasets (LLaMA 3: 8B parameters on 15T tokens, far exceeding the Chinchilla-optimal ratio).

Emergent Abilities

Certain capabilities appear abruptly as models scale, absent at smaller scales and present at larger ones (Wei et al., 2022):

Chain-of-thought reasoning: Models below ~60B parameters show minimal improvement from “let’s think step by step” prompting. Above this threshold, CoT dramatically improves mathematical and logical reasoning.
In-context learning: The ability to learn new tasks from few-shot examples in the prompt, without parameter updates. Improves consistently with scale.
Instruction following: Larger models more reliably follow complex, multi-step instructions.

Controversy. Schaeffer et al. (2023) argued that emergence is partly an artifact of evaluation metrics: when measured with continuous metrics rather than binary accuracy, many “emergent” abilities show smooth improvement with scale. The debate is unresolved, but the practical observation stands: larger models qualitatively differ in their capabilities.

Alignment: From Pretraining to Assistants

Pretrained LLMs are next-token predictors, not helpful assistants. Alignment techniques bridge this gap.

Supervised Fine-Tuning (SFT)

Train on curated (instruction, response) pairs:

\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log P(y_t \mid \mathbf{x}, y_1, \ldots, y_{t-1}; \theta)

where $\mathbf{x}$ is the instruction and $y$ is the desired response. The loss is computed only on the response tokens; instruction tokens contribute to context but not to the gradient. SFT datasets typically contain 10K—100K high-quality examples.

Reinforcement Learning from Human Feedback (RLHF)

The three-stage alignment pipeline (Ouyang et al., 2022):

Stage 1: Reward model training. Collect human preference data: for each prompt, generate two responses and have a human label which is better. Train a reward model $R(x, y)$ on these preferences using the Bradley-Terry model:

P(y_1 \succ y_2 \mid x) = \sigma(R(x, y_1) - R(x, y_2))

Stage 2: RL fine-tuning. Optimize the policy (LLM) to maximize the reward while staying close to the SFT model via a KL penalty:

\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}\left[R(x, y) - \beta \text{KL}(\pi_\theta \| \pi_{\text{SFT}})\right]

The KL constraint prevents reward hacking (the model finding degenerate outputs that exploit the reward model without being genuinely helpful). Optimized using PPO (Proximal Policy Optimization).

Stage 3: Iteration. Collect new preference data on the RLHF model’s outputs, retrain the reward model, and repeat.

Direct Preference Optimization (DPO)

Rafailov et al. (2023) showed that the RLHF objective can be reparameterized to eliminate the reward model entirely:

\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

where $y_w$ and $y_l$ are the preferred and dispreferred responses. DPO is simpler to implement (standard supervised training loop, no RL infrastructure) and has become the dominant alignment method.

Architectural Choices in Modern LLMs

The transformer decoder block has been refined across model generations. Key decisions:

Choice	GPT-2/3	LLaMA	Modern (2025+)
Normalization	Post-norm	Pre-RMSNorm	Pre-RMSNorm
Activation	GELU	SiLU/Swish	SiLU
Position encoding	Learned absolute	RoPE	RoPE
Attention	Multi-head	Multi-head / GQA	GQA
Vocabulary size	50K	32K	128K+
Context length	1K—2K	4K	128K—1M+

Grouped Query Attention (GQA). Instead of separate K/V projections per head, share K/V across groups of query heads. LLaMA 2 70B uses 8 KV heads for 64 query heads, reducing KV cache memory by 8x with minimal quality loss.

RoPE (Rotary Position Embeddings). Encode position by rotating query and key vectors:

f(\mathbf{q}, m) = \mathbf{q} e^{im\theta}

The dot product $\langle f(\mathbf{q}, m), f(\mathbf{k}, n) \rangle$ depends on the relative position $m - n$ rather than absolute positions, naturally encoding relative distance. RoPE supports extrapolation to longer sequences than seen during training.

Inference Optimization

KV Cache

In autoregressive generation, each new token attends to all previous tokens. Without caching, generating $T$ tokens requires $O(T^2)$ attention computations. The KV cache stores computed key and value vectors from previous positions, reducing generation to $O(T)$ total attention (each step is $O(T)$ for the single new token attending to all cached positions).

Memory cost: For a model with $L$ layers, $d$ model dimension, and sequence length $T$ : cache size = $2 \times L \times T \times d$ (2 for K and V). For a 70B model at 4K context, this is ~40GB in fp16.

Speculative Decoding

Use a small draft model to generate $k$ candidate tokens, then verify all $k$ in parallel with the large model. When the draft model’s tokens are accepted, this produces $k$ tokens in the time of one large-model forward pass. Acceptance rates of 70—90% are typical, yielding 2—3x speedup with no quality degradation.

Quantization

Reduce parameter precision from fp16/bf16 to int8 or int4:

Post-training quantization (PTQ): Quantize weights after training. GPTQ, AWQ achieve near-lossless int4 quantization for most models.
Quantization-aware training (QAT): Simulate quantization during training. Higher quality but requires retraining.

INT4 quantization reduces model size by 4x and increases throughput proportionally on hardware with INT4 support.

Evaluation

LLM evaluation is notoriously difficult because the output space is open-ended.

Benchmark suites: MMLU (knowledge), GSM8K (math), HumanEval (code), HellaSwag (commonsense), TruthfulQA (factuality). These measure specific capabilities but don’t capture overall quality.

LLM-as-judge: Use a strong model to evaluate outputs of other models. Chatbot Arena uses pairwise comparisons with Elo ratings. This correlates well with human preferences but introduces the evaluated model’s biases.

Perplexity: $\text{PPL} = \exp(-\frac{1}{T}\sum_t \log P(x_t|x_{<t}))$ . Measures how well the model predicts held-out text. Lower is better. Useful for comparing models on the same tokenizer but not across different tokenizations.

Summary

Stage	What Happens	Key Innovation
Tokenization	Raw text → token IDs	BPE/SentencePiece for subword segmentation
Pretraining	Next-token prediction on web-scale data	Scaling laws determine optimal compute allocation
SFT	Fine-tune on instruction-response pairs	Teaches format and task completion
RLHF/DPO	Align with human preferences	KL-constrained reward maximization
Inference	Autoregressive generation with KV cache	Speculative decoding, quantization for speed

The LLM stack is a pipeline: tokenization determines the model’s vocabulary, pretraining builds broad capabilities through next-token prediction, alignment shapes those capabilities into useful behavior, and inference optimization makes deployment tractable. Each stage involves fundamental tradeoffs (vocabulary size vs. sequence length, model size vs. data size, helpfulness vs. safety) that define the design space of modern language models.