Fine-Tuning and Parameter-Efficient Methods

Fine-tuning adapts a pretrained model to a specific task or domain. Full fine-tuning updates all parameters, which is increasingly impractical for billion-parameter models. Parameter-efficient fine-tuning (PEFT) methods achieve comparable performance by updating a small fraction of parameters, reducing compute, memory, and storage requirements.


Full Fine-Tuning

Standard fine-tuning initializes from pretrained weights θ0\theta_0 and optimizes the full parameter set on task-specific data:

θ=argminθ(x,y)DtaskL(fθ(x),y)\theta^* = \arg\min_\theta \sum_{(x,y) \in \mathcal{D}_{\text{task}}} \mathcal{L}(f_\theta(x), y)

When to full fine-tune:

  • Sufficient task-specific data (>10K examples)
  • The task distribution differs substantially from pretraining
  • Maximum performance is required and compute budget permits
  • The resulting model can be served independently

Practical considerations. For a 7B parameter model in bf16, full fine-tuning requires ~28GB for parameters, ~28GB for gradients, ~56GB for optimizer states (Adam stores first and second moments) = ~112GB GPU memory minimum. Gradient checkpointing and DeepSpeed ZeRO can reduce this across multiple GPUs.


Low-Rank Adaptation (LoRA)

LoRA (Hu et al., 2021) is the dominant PEFT method. The core insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating a full weight matrix WRd×k\mathbf{W} \in \mathbb{R}^{d \times k}, decompose the update into two low-rank matrices:

W=W0+ΔW=W0+BA\mathbf{W}' = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}

where BRd×r\mathbf{B} \in \mathbb{R}^{d \times r}, ARr×k\mathbf{A} \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d, k).

Initialization. A\mathbf{A} is initialized with a random Gaussian and B\mathbf{B} with zeros, so ΔW=BA=0\Delta\mathbf{W} = \mathbf{BA} = 0 at the start. This ensures the model begins from the exact pretrained weights.

Scaling. The update is scaled by α/r\alpha/r:

h=W0x+αrBAx\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \frac{\alpha}{r} \mathbf{B}\mathbf{A}\mathbf{x}

where α\alpha is a hyperparameter (typically set to rr or 2r2r). This scaling ensures that the magnitude of the update is independent of rank rr, simplifying hyperparameter transfer across different rank settings.

Parameter reduction. For a weight matrix of size d×kd \times k with rank rr:

  • Full fine-tuning: d×kd \times k parameters
  • LoRA: r×(d+k)r \times (d + k) parameters
  • For d=k=4096d = k = 4096 and r=16r = 16: 16.8M131K16.8\text{M} \to 131\text{K} (0.78%)

Where to apply LoRA. Typically applied to the attention projection matrices (WQ,WK,WV,WO\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V, \mathbf{W}_O). Applying to all linear layers (including FFN) can improve performance at the cost of more trainable parameters. Empirically, attention projections provide the best accuracy-per-parameter ratio.

Rank selection. Common values: r{4,8,16,32,64}r \in \{4, 8, 16, 32, 64\}. Higher rank increases capacity but also overfitting risk. For most tasks, r=16r = 16 is a robust default. The optimal rank depends on the gap between the pretraining and task distributions: larger gaps require higher rank.

Merging and Serving

At inference time, LoRA adapters can be merged into the base weights: W=W0+BA\mathbf{W}' = \mathbf{W}_0 + \mathbf{BA}. This adds zero inference latency. Alternatively, multiple LoRA adapters can be served simultaneously with a shared base model, switching adapters per request. This is the basis for multi-tenant serving.


QLoRA

QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization of the base model:

  1. Quantize W0\mathbf{W}_0 to 4-bit NormalFloat (NF4), a data type optimized for normally distributed weights
  2. Apply LoRA adapters in bf16/fp16
  3. Backpropagate through the quantized weights using double quantization (quantizing the quantization constants)

Memory reduction. A 65B model that requires ~130GB in fp16 fits in ~33GB with QLoRA, enabling fine-tuning on a single 48GB GPU. The quality loss from 4-bit quantization of the base model is minimal because the LoRA adapters compensate.

Paged optimizers. QLoRA uses paged attention (CPU offloading of optimizer states) to handle memory spikes during training, preventing out-of-memory errors that would otherwise occur with long sequences.


Adapter Layers

Houlsby et al. (2019) insert small bottleneck modules between transformer layers:

hh+f(hWdown)Wup\mathbf{h} \leftarrow \mathbf{h} + f(\mathbf{h}\mathbf{W}_{\text{down}})\mathbf{W}_{\text{up}}

where WdownRd×r\mathbf{W}_{\text{down}} \in \mathbb{R}^{d \times r} projects to a low-dimensional bottleneck, ff is a nonlinearity, and WupRr×d\mathbf{W}_{\text{up}} \in \mathbb{R}^{r \times d} projects back. Adapters add sequential computation (unlike LoRA which is parallel), introducing small inference latency. They have been largely superseded by LoRA in practice.


Prefix Tuning and Prompt Tuning

Prefix tuning (Li and Liang, 2021) prepends learnable continuous vectors to the key and value sequences at each attention layer:

headi=Attention([PK(i);K],[PV(i);V],Q)\text{head}_i = \text{Attention}([\mathbf{P}_K^{(i)}; \mathbf{K}], [\mathbf{P}_V^{(i)}; \mathbf{V}], \mathbf{Q})

where PK,PVRl×dk\mathbf{P}_K, \mathbf{P}_V \in \mathbb{R}^{l \times d_k} are learnable prefix matrices of length ll. Only the prefix parameters are trained; the rest of the model is frozen.

Prompt tuning (Lester et al., 2021) is a simplified version that prepends learnable embeddings only at the input layer. Fewer parameters but less expressive. With sufficiently large models (>10B), prompt tuning approaches full fine-tuning performance.


Comparison

MethodTrainable paramsMemoryInference overheadMerging
Full fine-tuning100%Full model + optimizerNoneN/A
LoRA0.1—1%Base + adapters + optimizer for adaptersNone (merged)Yes
QLoRA0.1—1%4-bit base + bf16 adaptersNone (merged)Yes
Adapters0.5—5%Base + adaptersSmall (sequential)No
Prefix tuning<0.1%Base + prefixSmall (longer sequence)No
Prompt tuning<0.01%Base + soft promptsMinimalNo

When to Fine-Tune vs. Prompt

ScenarioRecommended Approach
Task is well-represented in pretraining dataFew-shot prompting or prompt tuning
Specific output format requiredSFT or LoRA on format examples
Domain-specific terminology or knowledgeLoRA with domain data
Maximum task performance neededFull fine-tuning or high-rank LoRA
Limited labeled data (<100 examples)Few-shot prompting, no fine-tuning
Serving multiple tasks from one modelMultiple LoRA adapters with shared base
Constrained GPU budgetQLoRA

The general trend: As base models improve, the gap between prompting and fine-tuning narrows for standard tasks. Fine-tuning remains essential for domain adaptation, format control, and tasks where the pretraining distribution is a poor match.


Training Best Practices

Learning rate. PEFT methods use higher learning rates than full fine-tuning: 1e-4 to 3e-4 for LoRA (vs 1e-5 to 5e-5 for full fine-tuning). The smaller parameter space has a smoother loss landscape.

Epochs. 1—3 epochs for most tasks. Overfitting is the primary risk with PEFT on small datasets; monitor validation loss and use early stopping.

Data quality over quantity. For instruction tuning, 1K high-quality examples often outperform 100K noisy examples. Curate data carefully: remove duplicates, filter for quality, ensure diversity.

Evaluation. Always hold out a test set. For generation tasks, automated metrics (ROUGE, BERTScore) correlate weakly with quality; use LLM-as-judge or human evaluation on a sample.


Summary

ConceptKey Insight
Full fine-tuningUpdates all parameters; maximum expressiveness, maximum cost
LoRALow-rank weight updates; 0.1—1% parameters, zero inference overhead
QLoRA4-bit base + LoRA; fine-tune 65B models on a single GPU
Adapter layersBottleneck modules; superseded by LoRA in practice
Prefix/prompt tuningLearnable soft tokens; minimal parameters, weaker than LoRA

The key insight across all PEFT methods: fine-tuning weight updates are low-dimensional relative to the full parameter space. Exploiting this structure enables adaptation of models that would otherwise be too large to fine-tune, democratizing access to task-specific LLMs.