Parameter Efficient Fine-Tuning

263-5354-00L — Large Language Models · ETH Zürich · Mrinmaya Sachan
Section 01

Full Fine-Tuning & Its Limitations

The standard way to adapt a pretrained language model to a downstream task is full fine-tuning: attach a small task-specific head (e.g., a linear classifier of dimension k, the number of classes) on top of the pretrained encoder, and update all parameters of the network by backpropagating gradients from the task loss. While this strategy achieves strong results, it comes with several serious drawbacks as models grow in size.

Problem 1 — Computational & Storage Cost

Full fine-tuning requires computing and storing gradients for every parameter in the network. When a model like GPT-3 has 175 billion parameters, this becomes extremely expensive. Worse, if we want to deploy the model on multiple tasks simultaneously, we must make a separate copy of the entire model for each task. In the extreme case of personalization — one model per user — this is clearly infeasible.

Problem 2 — Overfitting on Small Datasets

Fine-tuning all parameters of a massive model on a small labeled dataset can easily lead to overfitting, since the model has far more capacity than needed for the limited training signal.

Problem 3 — Catastrophic Forgetting

When all weights are updated, the model may catastrophically forget the knowledge it acquired during pretraining. After fine-tuning, the model might not only fail on other tasks — it may become unusable even for tasks it previously handled well.

Key Question

Can we find a way to adapt the model just a little bit — retaining most of the original model — while also making fine-tuning for multiple tasks more space-efficient?

This motivates Parameter Efficient Fine-Tuning (PEFT): the idea of fine-tuning only a small subset of parameters for each task, thereby alleviating storage costs, mitigating catastrophic forgetting, and improving performance on small datasets. The lecture covers four main families of PEFT methods: partial fine-tuning, adapters, prefix tuning, and LoRA.

Section 02

Partial Fine-Tuning

The simplest form of parameter-efficient fine-tuning is to select a subset of the existing model parameters to update while keeping the rest frozen. This family of methods is also known as specification-based tuning (Ding et al., 2022). No new parameters are introduced — instead, certain existing parameters are designated as trainable based on heuristics or learned criteria.

Heuristic Specification: Fine-Tuning Top Layers

The most straightforward approach is to freeze most of the model and only update the final few layers. Lee et al. (2019) showed that fine-tuning only one-fourth of the final layers of BERT and RoBERTa can produce approximately 90% of the performance of full fine-tuning. The intuition here is that lower layers capture general linguistic features while upper layers encode more task-specific representations.

2.1 — BitFit

BitFit (Zaken et al., 2022)

BitFit takes the heuristic approach further by freezing all weight matrices and only optimizing the bias terms inside the model. Bias terms appear in both the attention mechanism and the MLP feedforward layers of a Transformer. The equations below show where the bias terms reside inside each Transformer sublayer (red terms are the ones BitFit updates; blue terms are also frozen):

Query in Attention: Q(x) = Wqx + bq K(x) = Wkx + bk V(x) = Wvx + bv MLP in Feedforward: h2 = Dropout(W1 · h1 + b1) h3 = gLN1 ⊙ (h2 + x − μ) / σ + bLN1 h4 = GELU(W2 · h3 + b2) h5 = Dropout(W3 · h4 + b3) out = gLN2 ⊙ (h5 + h3 − μ) / σ + bLN2 where:Red terms = bias terms updated by BitFit (query bias bq and middle-of-MLP bias b1) • Blue terms = other bias terms, also frozen • Wq, Wk, Wv, W1, W2, W3 = weight matrices (all frozen)

BitFit fine-tunes only two key bias components — the query bias and the middle-of-MLP bias — amounting to just half of the bias parameters in the model and only 0.04% of all model parameters. Despite this extreme reduction, BitFit reproduces over 95% of full fine-tuning performance on several benchmarks.

Insight

Empirical results show that even using a small random set of parameters for tuning can yield passable results on benchmarks like GLUE. This suggests the model's pretrained representations are already very powerful — only minimal adaptation is needed. However, different bias terms have different functionalities during adaptation, and the trick has only been validated on smaller-scale models.

2.2 — Diff Pruning

Diff Pruning (Guo et al., 2021)

Rather than manually or heuristically choosing which parameters to update, Diff Pruning learns which parts of the network to modify. It reparameterizes the fine-tuned model parameters as the sum of the pretrained parameters and a sparse update vector:

θFT = θLM + δDiff where: • θFT = fine-tuned model parameters • θLM = original pretrained language model parameters • δDiff = sparse difference vector (the learned update)

The key challenge is to encourage δDiff to be as sparse as possible. This is achieved by regularizing the fine-tuning objective with the L0-norm (or a differentiable approximation thereof) of the update vector.

Limitation

Diff pruning introduces new parameters during the learning phase, which means it actually consumes more GPU memory than full fine-tuning. This limits its applicability to very large language models.

Section 03

Adapter Tuning

Instead of selecting existing parameters to fine-tune, adapter tuning takes a different approach: keep the entire pretrained model frozen and insert small, trainable modules — called adapters — into the model architecture. Only these new adapter parameters are updated during fine-tuning, and only these need to be stored per task.

Architecture

The standard approach (Houlsby et al., 2019) places a two-layer feedforward neural network with a bottleneck after each sublayer of the Transformer — both after the multi-head attention sublayer and after the feed-forward sublayer. The adapter first projects the hidden representation down to a smaller dimension, applies a nonlinearity, and then projects back up to the original dimension. A skip connection is added around the adapter:

h ← h + f(h Wdown) Wup where: • h = hidden state (input and output of the adapter) • Wdown ∈ ℝd × m = down-projection matrix (d → m, where m ≪ d) • Wup ∈ ℝm × d = up-projection matrix (m → d) • f = nonlinear activation function (e.g., ReLU) • m = adapter size (bottleneck dimension), a hyperparameter

The bottleneck dimension m (the "adapter size") controls the trade-off between capacity and efficiency. A smaller bottleneck means fewer parameters but potentially less expressive power.

Results on GLUE Benchmark

Evaluated on the GLUE benchmark with BERT Large, adapters achieve impressive results:

Comparable performance: Adapters achieved a mean GLUE score of 80.0, compared to 80.4 for full fine-tuning.
Dataset-dependent adapter size: The optimal bottleneck dimension varies — 256 was chosen for MNLI (a large dataset), while 8 was sufficient for RTE (a smaller dataset).
Graceful degradation: Restricting adapter size to 64 across all tasks led to only a small decrease in accuracy (79.6).
Dramatic storage savings: To solve all 9 GLUE tasks, full fine-tuning requires 9× the total BERT parameters. Adapters require only 1.3× parameters (~3.6% of parameters per task).
3.1 — MAD-X: Cross-Lingual Transfer

MAD-X Framework (Pfeiffer et al., 2020)

The MAD-X (Multiple ADapters for Cross-lingual transfer) framework demonstrates a powerful application of adapters to cross-lingual NLP. Multilingual models like mBERT and XLM-R enable cross-lingual transfer, but a single model has limited capacity — it cannot cover unlimited languages, and better performance on low-resource languages often hurts high-resource performance. MAD-X addresses this by learning separate adapters for each language and task.

The framework comprises three types of adapters:

1

Language Adapters — One adapter per language, trained using masked language modeling (MLM). These capture language-specific knowledge while keeping the multilingual model frozen.

2

Task Adapters — Stacked on top of language adapters to capture task-specific knowledge. During task fine-tuning, language adapters are frozen and only task adapters are trained.

3

Invertible Adapters — Placed on top of the embedding layer to bridge the mismatch between the pretrained multilingual vocabulary and the target language's vocabulary. Since input and output embeddings are tied, the inverses of these adapters are placed before the output embedding layer.

The invertible adapter splits an input embedding vector e into two equal halves and applies coupled transformations:

Forward pass: o1 = F(e2) + e1 o2 = G(o1) + e2 o = [o1, o2] Inverse (for the output layer): e2 = o2 − G(o1) e1 = o1 − F(e2) o = [e1, e2] where: F(x) = ReLU(x Wdown,F) Wup,F G(x) = ReLU(x Wdown,G) Wup,G • F, G = arbitrary non-linear functions (similar structure to standard adapters, minus the residual connection) • Invertible adapters are trained together with language adapters using MLM and frozen during task fine-tuning
Section 04

Prefix Tuning

Prefix tuning (Li & Liang, 2021) is a PEFT method with roots in prompting. While standard fine-tuning updates all Transformer parameters and requires storing a full model copy for each task, prefix tuning takes a fundamentally different approach: it freezes all Transformer parameters and instead optimizes a set of continuous, task-specific prefix vectors that are prepended to the key and value matrices at every layer of the model.

Key Idea

For each task, a sequence of trainable "prefix" vectors is prepended to the hidden states. Only these prefix parameters need to be stored per task, making prefix tuning highly modular and space-efficient. The rest of the model is shared and frozen.

Implementation Details

The prefix activations for all layers are learned. For an autoregressive LM, one set of prefix tokens is used (the number of prefix tokens is a hyperparameter). For an encoder-decoder model, two sets of prefixes are used (one for the encoder, one for the decoder). These prefix activations are drawn from a trainable matrix Pθ, while the remaining activations in the model are computed normally by the Transformer.

The training objective is the standard log-likelihood, but only the prefix parameters are optimized:

maxϕ log P(y | x; θ; ϕ) = maxϕ Σyi log P(yi | h<i; θ; ϕ) where: • ϕ = trainable prefix parameters • θ = frozen pretrained LM parameters • h<i = [h(1)<i ; ... ; h(n)<i] = concatenation of all neural network layers at time step i • If time step i is within the prefix: hi = Mϕ[i] (read directly from the trainable prefix matrix) • Otherwise: hi is computed normally by the pretrained LM

Reparameterization Trick

In practice, directly updating the Pθ parameters leads to unstable optimization and a slight drop in performance. To address this, the prefix matrix is reparameterized through a smaller matrix Pθ' composed with a large feedforward neural network:

Pθ[i, :] = MLPθ(P'θ[i, :]) where: • P'θ = smaller trainable matrix • MLPθ = feedforward network used for reparameterization • This adds more parameters during training but stabilizes optimization • Once training is complete, the MLP can be dropped and only Pθ needs to be saved

Results

With only 0.1% of the model's parameters, prefix tuning outperforms other lightweight baselines and achieves performance comparable to full fine-tuning. This is remarkable efficiency — the task-specific storage for each new task is negligible compared to storing an entire model copy.

Limitations

Prefix tuning can be difficult to optimize (hence the need for reparameterization). Additionally, the prefix tokens consume part of the model's sequence length, which reduces the length available for processing actual task input. This can be problematic for tasks requiring long context windows.

Section 05

LoRA — Low-Rank Adaptation

While bottleneck adapters are effective, they introduce additional computational overhead through extra sequential layers that must be processed at every forward pass. This latency cannot be mitigated through model parallelism due to the sequential processing requirement. To tackle this, Hu et al. (2022) introduced LoRA (Low-Rank Adaptation), a method that leverages trainable low-rank matrices to efficiently approximate weight updates without adding inference latency.

Motivation

Standard fine-tuning maximizes the log-likelihood of the fine-tuning data by updating all model weights:

Standard Fine-Tuning: maxΦ Σ(x,y)∈D Σt=1|y| log PΦ(yt | x, y<t) Weights after fine-tuning: Φ0 + ΔΦ (dimensions of |ΔΦ| = |Φ0| — as large as the entire model!) LoRA's Approach: New weights: Φ0 + ΔΦ(Θ) (dimensions of |Θ| ≪ |Φ0| — parameterize the update with far fewer parameters)

The key insight is to parameterize the weight update ΔW with a low-rank factorization, exploiting the hypothesis that the necessary weight changes during fine-tuning lie in a low-dimensional subspace.

Architecture

For each weight matrix W ∈ ℝd×k in the pretrained model (such as the query or value projection matrices in multi-head attention), LoRA decomposes the weight update as:

ΔW = B A where: • W0 ∈ ℝd×k = original pretrained weight matrix (frozen during training) • B ∈ ℝd×r = low-rank "up" matrix • A ∈ ℝr×k = low-rank "down" matrix • r = rank of the decomposition, where r ≪ min(d, k) • Only A and B are trainable; W0 remains unchanged

Forward Pass

Given an input x to a layer, the output h is computed as follows:

Before LoRA (standard): h = W0 x After LoRA: h = W0 x + (α/r) · ΔW x = W0 x + (α/r) · B A x where: • α = scaling factor (akin to adjusting the learning rate) • α/r = scaling coefficient that controls the magnitude of the low-rank update • The scaling ensures updates enhance task performance without overpowering pretrained weights

Initialization

The initialization strategy is carefully designed to ensure training starts from the pretrained model's behavior:

B is initialized to zero — this means ΔW = BA = 0 at the start of training, so the model begins with exactly the pretrained weights.
A is initialized with random Gaussian values — sampled from 𝒩(0, σ²), providing the initial "direction" for adaptation.
Key Insight

Because B = 0 at initialization, the model starts at exactly the pretrained solution and gradually learns task-specific adjustments. This is a crucial design choice that preserves the pretrained model's performance from the very first training step.

Advantages of LoRA

1

No additional inference latency. Unlike adapters, LoRA does not add sequential depth to the model. At deployment time, the low-rank matrices can be merged directly into the pretrained weight: Wdeployed = W0 + BA. The model then runs at the same speed as the original.

2

Flexibility. LoRA can be applied to any weight matrix in the Transformer — attention projections (Wq, Wk, Wv), feed-forward layers, or other components. Practitioners can target whichever parts most impact their task.

3

Efficiency. Only a fraction of parameters are trainable. This dramatically reduces memory requirements for gradient computation and optimizer states.

4

Strong performance. Empirical results show LoRA can match or even exceed the performance of full fine-tuning on downstream tasks — with better accuracy, fewer trainable parameters, and lower storage requirements.

5

Preserved generalization. Despite the reduced parameter count, LoRA maintains (or enhances) the model's capacity for generalization, which is crucial when applying fine-tuned models to scenarios or datasets that differ from the training data.

Section 06

Comparing PEFT Methods

Benefits of LoRA over Adapters

Adapter modules introduce high inference latency because adapter layers must be processed sequentially. This additional depth requires more synchronous GPU operations, and the overhead cannot be mitigated through model parallelism. LoRA avoids this entirely — it does not introduce any additional depth, and its low-rank updates can be folded into the base weights at deployment time.

Benefits of LoRA over Prefix Tuning

Prefix tuning faces two notable challenges that LoRA avoids. First, it is difficult to optimize (requiring reparameterization tricks). Second, the prefix tokens consume part of the model's sequence length, reducing the context window available for actual task input. LoRA has neither issue.

Method Comparison

Method % Params New Params? Inference Cost Key Trade-off
Top Layer FT ~25% No Same Simple but ~90% of full FT performance
BitFit 0.04% No Same Extremely few params; ~95% of full FT
Diff Pruning Sparse Yes (during training) Same Learned sparsity; more GPU memory than full FT
Adapters ~3.6% Yes Higher (sequential layers) Strong performance; adds inference latency
Prefix Tuning ~0.1% Yes Same Very few params; hard to optimize; reduces seq length
LoRA Small (rank-dependent) Yes (but mergeable) Same (after merging) No extra latency; strong performance; flexible
Empirical Overview

A comparison of PEFT methods applied to T0-3B (Sanh et al., 2022) shows that methods like LoRA, (IA)³, and standard adapters all achieve competitive accuracy while updating far fewer parameters than full fine-tuning. Notably, LoRA and (IA)³ achieve the strongest performance among PEFT methods, approaching or matching full fine-tuning accuracy with orders of magnitude fewer trainable parameters.

Section 07

Summary & Key Takeaways

Parameter Efficient Fine-Tuning addresses the fundamental challenge of adapting massive pretrained language models to specific tasks without incurring prohibitive computational, storage, and performance costs. The lecture covered a progression of increasingly sophisticated methods:

1

Partial Fine-Tuning (Specification-based tuning) — The simplest approach: freeze most parameters, update a chosen subset. BitFit shows that updating only 0.04% of parameters (bias terms) can recover 95% of full fine-tuning performance. Diff Pruning learns which parameters to update via L0 regularization, but at higher memory cost.

2

Adapter Tuning — Insert small bottleneck modules into each Transformer layer. Only these adapters are trained per task. Achieves comparable performance to full fine-tuning with ~3.6% of parameters. Extends naturally to cross-lingual settings (MAD-X). Drawback: adds inference latency.

3

Prefix Tuning — Prepend trainable continuous vectors to each layer's key/value matrices. Extremely parameter-efficient (~0.1%), but can be hard to optimize and reduces available sequence length.

4

LoRA — Decompose weight updates into low-rank matrices (ΔW = BA). No additional inference latency (matrices can be merged). Flexible, efficient, and matches or exceeds full fine-tuning performance. Currently the most widely adopted PEFT method.

The Big Picture

All PEFT methods share a core philosophy: the pretrained model already contains powerful, general-purpose representations, and adapting it to a new task requires far fewer degrees of freedom than the total number of model parameters. By constraining the update — whether through selecting a subset, adding bottleneck modules, learning prefix vectors, or factorizing updates into low-rank matrices — we can achieve strong task performance while keeping storage, computation, and forgetting under control.