From 15 Fine-Tuning and Parameter-Efficient Methods

Fine-Tuning and Parameter-Efficient Methods

Fine-tuning adapts a pretrained model to a specific task or domain. Full fine-tuning updates all parameters, which is increasingly impractical for billion-parameter models. Parameter-efficient fine-tuning (PEFT) methods achieve comparable performance by updating a small fraction of parameters, reducing compute, memory, and storage requirements.

Full Fine-Tuning

Standard fine-tuning initializes from pretrained weights $\theta_0$ and optimizes the full parameter set on task-specific data:

\theta^* = \arg\min_\theta \sum_{(x,y) \in \mathcal{D}_{\text{task}}} \mathcal{L}(f_\theta(x), y)

When to full fine-tune:

Sufficient task-specific data (>10K examples)
The task distribution differs substantially from pretraining
Maximum performance is required and compute budget permits
The resulting model can be served independently

Practical considerations. For a 7B parameter model in bf16, full fine-tuning requires ~28GB for parameters, ~28GB for gradients, ~56GB for optimizer states (Adam stores first and second moments) = ~112GB GPU memory minimum. Gradient checkpointing and DeepSpeed ZeRO can reduce this across multiple GPUs.

Low-Rank Adaptation (LoRA)

LoRA (Hu et al., 2021) is the dominant PEFT method. The core insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating a full weight matrix $\mathbf{W} \in \mathbb{R}^{d \times k}$ , decompose the update into two low-rank matrices:

\mathbf{W}' = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}

where $\mathbf{B} \in \mathbb{R}^{d \times r}$ , $\mathbf{A} \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d, k)$ .

Initialization. $\mathbf{A}$ is initialized with a random Gaussian and $\mathbf{B}$ with zeros, so $\Delta\mathbf{W} = \mathbf{BA} = 0$ at the start. This ensures the model begins from the exact pretrained weights.

Scaling. The update is scaled by $\alpha/r$ :

\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \frac{\alpha}{r} \mathbf{B}\mathbf{A}\mathbf{x}

where $\alpha$ is a hyperparameter (typically set to $r$ or $2r$ ). This scaling ensures that the magnitude of the update is independent of rank $r$ , simplifying hyperparameter transfer across different rank settings.

Parameter reduction. For a weight matrix of size $d \times k$ with rank $r$ :

Full fine-tuning: $d \times k$ parameters
LoRA: $r \times (d + k)$ parameters
For $d = k = 4096$ and $r = 16$ : $16.8\text{M} \to 131\text{K}$ (0.78%)

Where to apply LoRA. Typically applied to the attention projection matrices ( $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V, \mathbf{W}_O$ ). Applying to all linear layers (including FFN) can improve performance at the cost of more trainable parameters. Empirically, attention projections provide the best accuracy-per-parameter ratio.

Rank selection. Common values: $r \in \{4, 8, 16, 32, 64\}$ . Higher rank increases capacity but also overfitting risk. For most tasks, $r = 16$ is a robust default. The optimal rank depends on the gap between the pretraining and task distributions: larger gaps require higher rank.

Merging and Serving

At inference time, LoRA adapters can be merged into the base weights: $\mathbf{W}' = \mathbf{W}_0 + \mathbf{BA}$ . This adds zero inference latency. Alternatively, multiple LoRA adapters can be served simultaneously with a shared base model, switching adapters per request. This is the basis for multi-tenant serving.

QLoRA

QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization of the base model:

Quantize $\mathbf{W}_0$ to 4-bit NormalFloat (NF4), a data type optimized for normally distributed weights
Apply LoRA adapters in bf16/fp16
Backpropagate through the quantized weights using double quantization (quantizing the quantization constants)

Memory reduction. A 65B model that requires ~130GB in fp16 fits in ~33GB with QLoRA, enabling fine-tuning on a single 48GB GPU. The quality loss from 4-bit quantization of the base model is minimal because the LoRA adapters compensate.

Paged optimizers. QLoRA uses paged attention (CPU offloading of optimizer states) to handle memory spikes during training, preventing out-of-memory errors that would otherwise occur with long sequences.

Adapter Layers

Houlsby et al. (2019) insert small bottleneck modules between transformer layers:

\mathbf{h} \leftarrow \mathbf{h} + f(\mathbf{h}\mathbf{W}_{\text{down}})\mathbf{W}_{\text{up}}

where $\mathbf{W}_{\text{down}} \in \mathbb{R}^{d \times r}$ projects to a low-dimensional bottleneck, $f$ is a nonlinearity, and $\mathbf{W}_{\text{up}} \in \mathbb{R}^{r \times d}$ projects back. Adapters add sequential computation (unlike LoRA which is parallel), introducing small inference latency. They have been largely superseded by LoRA in practice.

Prefix Tuning and Prompt Tuning

Prefix tuning (Li and Liang, 2021) prepends learnable continuous vectors to the key and value sequences at each attention layer:

\text{head}_i = \text{Attention}([\mathbf{P}_K^{(i)}; \mathbf{K}], [\mathbf{P}_V^{(i)}; \mathbf{V}], \mathbf{Q})

where $\mathbf{P}_K, \mathbf{P}_V \in \mathbb{R}^{l \times d_k}$ are learnable prefix matrices of length $l$ . Only the prefix parameters are trained; the rest of the model is frozen.

Prompt tuning (Lester et al., 2021) is a simplified version that prepends learnable embeddings only at the input layer. Fewer parameters but less expressive. With sufficiently large models (>10B), prompt tuning approaches full fine-tuning performance.

Comparison

Method	Trainable params	Memory	Inference overhead	Merging
Full fine-tuning	100%	Full model + optimizer	None	N/A
LoRA	0.1—1%	Base + adapters + optimizer for adapters	None (merged)	Yes
QLoRA	0.1—1%	4-bit base + bf16 adapters	None (merged)	Yes
Adapters	0.5—5%	Base + adapters	Small (sequential)	No
Prefix tuning	<0.1%	Base + prefix	Small (longer sequence)	No
Prompt tuning	<0.01%	Base + soft prompts	Minimal	No

When to Fine-Tune vs. Prompt

Scenario	Recommended Approach
Task is well-represented in pretraining data	Few-shot prompting or prompt tuning
Specific output format required	SFT or LoRA on format examples
Domain-specific terminology or knowledge	LoRA with domain data
Maximum task performance needed	Full fine-tuning or high-rank LoRA
Limited labeled data (<100 examples)	Few-shot prompting, no fine-tuning
Serving multiple tasks from one model	Multiple LoRA adapters with shared base
Constrained GPU budget	QLoRA

The general trend: As base models improve, the gap between prompting and fine-tuning narrows for standard tasks. Fine-tuning remains essential for domain adaptation, format control, and tasks where the pretraining distribution is a poor match.

Training Best Practices

Learning rate. PEFT methods use higher learning rates than full fine-tuning: 1e-4 to 3e-4 for LoRA (vs 1e-5 to 5e-5 for full fine-tuning). The smaller parameter space has a smoother loss landscape.

Epochs. 1—3 epochs for most tasks. Overfitting is the primary risk with PEFT on small datasets; monitor validation loss and use early stopping.

Data quality over quantity. For instruction tuning, 1K high-quality examples often outperform 100K noisy examples. Curate data carefully: remove duplicates, filter for quality, ensure diversity.

Evaluation. Always hold out a test set. For generation tasks, automated metrics (ROUGE, BERTScore) correlate weakly with quality; use LLM-as-judge or human evaluation on a sample.

Summary

Concept	Key Insight
Full fine-tuning	Updates all parameters; maximum expressiveness, maximum cost
LoRA	Low-rank weight updates; 0.1—1% parameters, zero inference overhead
QLoRA	4-bit base + LoRA; fine-tune 65B models on a single GPU
Adapter layers	Bottleneck modules; superseded by LoRA in practice
Prefix/prompt tuning	Learnable soft tokens; minimal parameters, weaker than LoRA

The key insight across all PEFT methods: fine-tuning weight updates are low-dimensional relative to the full parameter space. Exploiting this structure enables adaptation of models that would otherwise be too large to fine-tune, democratizing access to task-specific LLMs.