Fine-Tuning and Parameter-Efficient Methods
Fine-tuning adapts a pretrained model to a specific task or domain. Full fine-tuning updates all parameters, which is increasingly impractical for billion-parameter models. Parameter-efficient fine-tuning (PEFT) methods achieve comparable performance by updating a small fraction of parameters, reducing compute, memory, and storage requirements.
Full Fine-Tuning
Standard fine-tuning initializes from pretrained weights and optimizes the full parameter set on task-specific data:
When to full fine-tune:
- Sufficient task-specific data (>10K examples)
- The task distribution differs substantially from pretraining
- Maximum performance is required and compute budget permits
- The resulting model can be served independently
Practical considerations. For a 7B parameter model in bf16, full fine-tuning requires ~28GB for parameters, ~28GB for gradients, ~56GB for optimizer states (Adam stores first and second moments) = ~112GB GPU memory minimum. Gradient checkpointing and DeepSpeed ZeRO can reduce this across multiple GPUs.
Low-Rank Adaptation (LoRA)
LoRA (Hu et al., 2021) is the dominant PEFT method. The core insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating a full weight matrix , decompose the update into two low-rank matrices:
where , , and .
Initialization. is initialized with a random Gaussian and with zeros, so at the start. This ensures the model begins from the exact pretrained weights.
Scaling. The update is scaled by :
where is a hyperparameter (typically set to or ). This scaling ensures that the magnitude of the update is independent of rank , simplifying hyperparameter transfer across different rank settings.
Parameter reduction. For a weight matrix of size with rank :
- Full fine-tuning: parameters
- LoRA: parameters
- For and : (0.78%)
Where to apply LoRA. Typically applied to the attention projection matrices (). Applying to all linear layers (including FFN) can improve performance at the cost of more trainable parameters. Empirically, attention projections provide the best accuracy-per-parameter ratio.
Rank selection. Common values: . Higher rank increases capacity but also overfitting risk. For most tasks, is a robust default. The optimal rank depends on the gap between the pretraining and task distributions: larger gaps require higher rank.
Merging and Serving
At inference time, LoRA adapters can be merged into the base weights: . This adds zero inference latency. Alternatively, multiple LoRA adapters can be served simultaneously with a shared base model, switching adapters per request. This is the basis for multi-tenant serving.
QLoRA
QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization of the base model:
- Quantize to 4-bit NormalFloat (NF4), a data type optimized for normally distributed weights
- Apply LoRA adapters in bf16/fp16
- Backpropagate through the quantized weights using double quantization (quantizing the quantization constants)
Memory reduction. A 65B model that requires ~130GB in fp16 fits in ~33GB with QLoRA, enabling fine-tuning on a single 48GB GPU. The quality loss from 4-bit quantization of the base model is minimal because the LoRA adapters compensate.
Paged optimizers. QLoRA uses paged attention (CPU offloading of optimizer states) to handle memory spikes during training, preventing out-of-memory errors that would otherwise occur with long sequences.
Adapter Layers
Houlsby et al. (2019) insert small bottleneck modules between transformer layers:
where projects to a low-dimensional bottleneck, is a nonlinearity, and projects back. Adapters add sequential computation (unlike LoRA which is parallel), introducing small inference latency. They have been largely superseded by LoRA in practice.
Prefix Tuning and Prompt Tuning
Prefix tuning (Li and Liang, 2021) prepends learnable continuous vectors to the key and value sequences at each attention layer:
where are learnable prefix matrices of length . Only the prefix parameters are trained; the rest of the model is frozen.
Prompt tuning (Lester et al., 2021) is a simplified version that prepends learnable embeddings only at the input layer. Fewer parameters but less expressive. With sufficiently large models (>10B), prompt tuning approaches full fine-tuning performance.
Comparison
| Method | Trainable params | Memory | Inference overhead | Merging |
|---|---|---|---|---|
| Full fine-tuning | 100% | Full model + optimizer | None | N/A |
| LoRA | 0.1—1% | Base + adapters + optimizer for adapters | None (merged) | Yes |
| QLoRA | 0.1—1% | 4-bit base + bf16 adapters | None (merged) | Yes |
| Adapters | 0.5—5% | Base + adapters | Small (sequential) | No |
| Prefix tuning | <0.1% | Base + prefix | Small (longer sequence) | No |
| Prompt tuning | <0.01% | Base + soft prompts | Minimal | No |
When to Fine-Tune vs. Prompt
| Scenario | Recommended Approach |
|---|---|
| Task is well-represented in pretraining data | Few-shot prompting or prompt tuning |
| Specific output format required | SFT or LoRA on format examples |
| Domain-specific terminology or knowledge | LoRA with domain data |
| Maximum task performance needed | Full fine-tuning or high-rank LoRA |
| Limited labeled data (<100 examples) | Few-shot prompting, no fine-tuning |
| Serving multiple tasks from one model | Multiple LoRA adapters with shared base |
| Constrained GPU budget | QLoRA |
The general trend: As base models improve, the gap between prompting and fine-tuning narrows for standard tasks. Fine-tuning remains essential for domain adaptation, format control, and tasks where the pretraining distribution is a poor match.
Training Best Practices
Learning rate. PEFT methods use higher learning rates than full fine-tuning: 1e-4 to 3e-4 for LoRA (vs 1e-5 to 5e-5 for full fine-tuning). The smaller parameter space has a smoother loss landscape.
Epochs. 1—3 epochs for most tasks. Overfitting is the primary risk with PEFT on small datasets; monitor validation loss and use early stopping.
Data quality over quantity. For instruction tuning, 1K high-quality examples often outperform 100K noisy examples. Curate data carefully: remove duplicates, filter for quality, ensure diversity.
Evaluation. Always hold out a test set. For generation tasks, automated metrics (ROUGE, BERTScore) correlate weakly with quality; use LLM-as-judge or human evaluation on a sample.
Summary
| Concept | Key Insight |
|---|---|
| Full fine-tuning | Updates all parameters; maximum expressiveness, maximum cost |
| LoRA | Low-rank weight updates; 0.1—1% parameters, zero inference overhead |
| QLoRA | 4-bit base + LoRA; fine-tune 65B models on a single GPU |
| Adapter layers | Bottleneck modules; superseded by LoRA in practice |
| Prefix/prompt tuning | Learnable soft tokens; minimal parameters, weaker than LoRA |
The key insight across all PEFT methods: fine-tuning weight updates are low-dimensional relative to the full parameter space. Exploiting this structure enables adaptation of models that would otherwise be too large to fine-tune, democratizing access to task-specific LLMs.