The standard way to adapt a pretrained language model to a downstream task is full fine-tuning: attach a small task-specific head (e.g., a linear classifier of dimension k, the number of classes) on top of the pretrained encoder, and update all parameters of the network by backpropagating gradients from the task loss. While this strategy achieves strong results, it comes with several serious drawbacks as models grow in size.
Full fine-tuning requires computing and storing gradients for every parameter in the network. When a model like GPT-3 has 175 billion parameters, this becomes extremely expensive. Worse, if we want to deploy the model on multiple tasks simultaneously, we must make a separate copy of the entire model for each task. In the extreme case of personalization — one model per user — this is clearly infeasible.
Fine-tuning all parameters of a massive model on a small labeled dataset can easily lead to overfitting, since the model has far more capacity than needed for the limited training signal.
When all weights are updated, the model may catastrophically forget the knowledge it acquired during pretraining. After fine-tuning, the model might not only fail on other tasks — it may become unusable even for tasks it previously handled well.
Can we find a way to adapt the model just a little bit — retaining most of the original model — while also making fine-tuning for multiple tasks more space-efficient?
This motivates Parameter Efficient Fine-Tuning (PEFT): the idea of fine-tuning only a small subset of parameters for each task, thereby alleviating storage costs, mitigating catastrophic forgetting, and improving performance on small datasets. The lecture covers four main families of PEFT methods: partial fine-tuning, adapters, prefix tuning, and LoRA.
The simplest form of parameter-efficient fine-tuning is to select a subset of the existing model parameters to update while keeping the rest frozen. This family of methods is also known as specification-based tuning (Ding et al., 2022). No new parameters are introduced — instead, certain existing parameters are designated as trainable based on heuristics or learned criteria.
The most straightforward approach is to freeze most of the model and only update the final few layers. Lee et al. (2019) showed that fine-tuning only one-fourth of the final layers of BERT and RoBERTa can produce approximately 90% of the performance of full fine-tuning. The intuition here is that lower layers capture general linguistic features while upper layers encode more task-specific representations.
2.1 — BitFitBitFit takes the heuristic approach further by freezing all weight matrices and only optimizing the bias terms inside the model. Bias terms appear in both the attention mechanism and the MLP feedforward layers of a Transformer. The equations below show where the bias terms reside inside each Transformer sublayer (red terms are the ones BitFit updates; blue terms are also frozen):
BitFit fine-tunes only two key bias components — the query bias and the middle-of-MLP bias — amounting to just half of the bias parameters in the model and only 0.04% of all model parameters. Despite this extreme reduction, BitFit reproduces over 95% of full fine-tuning performance on several benchmarks.
Empirical results show that even using a small random set of parameters for tuning can yield passable results on benchmarks like GLUE. This suggests the model's pretrained representations are already very powerful — only minimal adaptation is needed. However, different bias terms have different functionalities during adaptation, and the trick has only been validated on smaller-scale models.
Rather than manually or heuristically choosing which parameters to update, Diff Pruning learns which parts of the network to modify. It reparameterizes the fine-tuned model parameters as the sum of the pretrained parameters and a sparse update vector:
The key challenge is to encourage δDiff to be as sparse as possible. This is achieved by regularizing the fine-tuning objective with the L0-norm (or a differentiable approximation thereof) of the update vector.
Diff pruning introduces new parameters during the learning phase, which means it actually consumes more GPU memory than full fine-tuning. This limits its applicability to very large language models.
Instead of selecting existing parameters to fine-tune, adapter tuning takes a different approach: keep the entire pretrained model frozen and insert small, trainable modules — called adapters — into the model architecture. Only these new adapter parameters are updated during fine-tuning, and only these need to be stored per task.
The standard approach (Houlsby et al., 2019) places a two-layer feedforward neural network with a bottleneck after each sublayer of the Transformer — both after the multi-head attention sublayer and after the feed-forward sublayer. The adapter first projects the hidden representation down to a smaller dimension, applies a nonlinearity, and then projects back up to the original dimension. A skip connection is added around the adapter:
The bottleneck dimension m (the "adapter size") controls the trade-off between capacity and efficiency. A smaller bottleneck means fewer parameters but potentially less expressive power.
Evaluated on the GLUE benchmark with BERT Large, adapters achieve impressive results:
The MAD-X (Multiple ADapters for Cross-lingual transfer) framework demonstrates a powerful application of adapters to cross-lingual NLP. Multilingual models like mBERT and XLM-R enable cross-lingual transfer, but a single model has limited capacity — it cannot cover unlimited languages, and better performance on low-resource languages often hurts high-resource performance. MAD-X addresses this by learning separate adapters for each language and task.
The framework comprises three types of adapters:
Language Adapters — One adapter per language, trained using masked language modeling (MLM). These capture language-specific knowledge while keeping the multilingual model frozen.
Task Adapters — Stacked on top of language adapters to capture task-specific knowledge. During task fine-tuning, language adapters are frozen and only task adapters are trained.
Invertible Adapters — Placed on top of the embedding layer to bridge the mismatch between the pretrained multilingual vocabulary and the target language's vocabulary. Since input and output embeddings are tied, the inverses of these adapters are placed before the output embedding layer.
The invertible adapter splits an input embedding vector e into two equal halves and applies coupled transformations:
Prefix tuning (Li & Liang, 2021) is a PEFT method with roots in prompting. While standard fine-tuning updates all Transformer parameters and requires storing a full model copy for each task, prefix tuning takes a fundamentally different approach: it freezes all Transformer parameters and instead optimizes a set of continuous, task-specific prefix vectors that are prepended to the key and value matrices at every layer of the model.
For each task, a sequence of trainable "prefix" vectors is prepended to the hidden states. Only these prefix parameters need to be stored per task, making prefix tuning highly modular and space-efficient. The rest of the model is shared and frozen.
The prefix activations for all layers are learned. For an autoregressive LM, one set of prefix tokens is used (the number of prefix tokens is a hyperparameter). For an encoder-decoder model, two sets of prefixes are used (one for the encoder, one for the decoder). These prefix activations are drawn from a trainable matrix Pθ, while the remaining activations in the model are computed normally by the Transformer.
The training objective is the standard log-likelihood, but only the prefix parameters are optimized:
In practice, directly updating the Pθ parameters leads to unstable optimization and a slight drop in performance. To address this, the prefix matrix is reparameterized through a smaller matrix Pθ' composed with a large feedforward neural network:
With only 0.1% of the model's parameters, prefix tuning outperforms other lightweight baselines and achieves performance comparable to full fine-tuning. This is remarkable efficiency — the task-specific storage for each new task is negligible compared to storing an entire model copy.
Prefix tuning can be difficult to optimize (hence the need for reparameterization). Additionally, the prefix tokens consume part of the model's sequence length, which reduces the length available for processing actual task input. This can be problematic for tasks requiring long context windows.
While bottleneck adapters are effective, they introduce additional computational overhead through extra sequential layers that must be processed at every forward pass. This latency cannot be mitigated through model parallelism due to the sequential processing requirement. To tackle this, Hu et al. (2022) introduced LoRA (Low-Rank Adaptation), a method that leverages trainable low-rank matrices to efficiently approximate weight updates without adding inference latency.
Standard fine-tuning maximizes the log-likelihood of the fine-tuning data by updating all model weights:
The key insight is to parameterize the weight update ΔW with a low-rank factorization, exploiting the hypothesis that the necessary weight changes during fine-tuning lie in a low-dimensional subspace.
For each weight matrix W ∈ ℝd×k in the pretrained model (such as the query or value projection matrices in multi-head attention), LoRA decomposes the weight update as:
Given an input x to a layer, the output h is computed as follows:
The initialization strategy is carefully designed to ensure training starts from the pretrained model's behavior:
Because B = 0 at initialization, the model starts at exactly the pretrained solution and gradually learns task-specific adjustments. This is a crucial design choice that preserves the pretrained model's performance from the very first training step.
No additional inference latency. Unlike adapters, LoRA does not add sequential depth to the model. At deployment time, the low-rank matrices can be merged directly into the pretrained weight: Wdeployed = W0 + BA. The model then runs at the same speed as the original.
Flexibility. LoRA can be applied to any weight matrix in the Transformer — attention projections (Wq, Wk, Wv), feed-forward layers, or other components. Practitioners can target whichever parts most impact their task.
Efficiency. Only a fraction of parameters are trainable. This dramatically reduces memory requirements for gradient computation and optimizer states.
Strong performance. Empirical results show LoRA can match or even exceed the performance of full fine-tuning on downstream tasks — with better accuracy, fewer trainable parameters, and lower storage requirements.
Preserved generalization. Despite the reduced parameter count, LoRA maintains (or enhances) the model's capacity for generalization, which is crucial when applying fine-tuned models to scenarios or datasets that differ from the training data.
Adapter modules introduce high inference latency because adapter layers must be processed sequentially. This additional depth requires more synchronous GPU operations, and the overhead cannot be mitigated through model parallelism. LoRA avoids this entirely — it does not introduce any additional depth, and its low-rank updates can be folded into the base weights at deployment time.
Prefix tuning faces two notable challenges that LoRA avoids. First, it is difficult to optimize (requiring reparameterization tricks). Second, the prefix tokens consume part of the model's sequence length, reducing the context window available for actual task input. LoRA has neither issue.
| Method | % Params | New Params? | Inference Cost | Key Trade-off |
|---|---|---|---|---|
| Top Layer FT | ~25% | No | Same | Simple but ~90% of full FT performance |
| BitFit | 0.04% | No | Same | Extremely few params; ~95% of full FT |
| Diff Pruning | Sparse | Yes (during training) | Same | Learned sparsity; more GPU memory than full FT |
| Adapters | ~3.6% | Yes | Higher (sequential layers) | Strong performance; adds inference latency |
| Prefix Tuning | ~0.1% | Yes | Same | Very few params; hard to optimize; reduces seq length |
| LoRA | Small (rank-dependent) | Yes (but mergeable) | Same (after merging) | No extra latency; strong performance; flexible |
A comparison of PEFT methods applied to T0-3B (Sanh et al., 2022) shows that methods like LoRA, (IA)³, and standard adapters all achieve competitive accuracy while updating far fewer parameters than full fine-tuning. Notably, LoRA and (IA)³ achieve the strongest performance among PEFT methods, approaching or matching full fine-tuning accuracy with orders of magnitude fewer trainable parameters.
Parameter Efficient Fine-Tuning addresses the fundamental challenge of adapting massive pretrained language models to specific tasks without incurring prohibitive computational, storage, and performance costs. The lecture covered a progression of increasingly sophisticated methods:
Partial Fine-Tuning (Specification-based tuning) — The simplest approach: freeze most parameters, update a chosen subset. BitFit shows that updating only 0.04% of parameters (bias terms) can recover 95% of full fine-tuning performance. Diff Pruning learns which parameters to update via L0 regularization, but at higher memory cost.
Adapter Tuning — Insert small bottleneck modules into each Transformer layer. Only these adapters are trained per task. Achieves comparable performance to full fine-tuning with ~3.6% of parameters. Extends naturally to cross-lingual settings (MAD-X). Drawback: adds inference latency.
Prefix Tuning — Prepend trainable continuous vectors to each layer's key/value matrices. Extremely parameter-efficient (~0.1%), but can be hard to optimize and reduces available sequence length.
LoRA — Decompose weight updates into low-rank matrices (ΔW = BA). No additional inference latency (matrices can be merged). Flexible, efficient, and matches or exceeds full fine-tuning performance. Currently the most widely adopted PEFT method.
All PEFT methods share a core philosophy: the pretrained model already contains powerful, general-purpose representations, and adapting it to a new task requires far fewer degrees of freedom than the total number of model parameters. By constraining the update — whether through selecting a subset, adding bottleneck modules, learning prefix vectors, or factorizing updates into low-rank matrices — we can achieve strong task performance while keeping storage, computation, and forgetting under control.