Transfer learning is the idea of using knowledge gained from training on one task to solve other, related tasks. It draws inspiration from the psychological concept of "transfer of learning" — for instance, a person who already speaks Dutch or English will learn Swiss German more efficiently.
In neural networks, this dates back to Bozinovski & Fulgosi (1976), who asked: if a network fθ is trained on a first task, does that enable learning a second task with fewer samples (positive transfer) or more (negative transfer)?
Later, Pratt (1992) introduced the Discriminability-Based Transfer (DBT) algorithm, which reused and additionally scaled pretrained parameters by how well they fit new task data — achieving the same final performance with fewer epochs.
Consider a language model pLM(y; θ) over Σ* trained on corpus 𝒟. Consider a target task 𝒯 posed as learning f : ℒ → 𝒴.
Transfer learning occurs if parameterizing f as fθ̂, with θ̂ ⊆ θ, allows for more efficient learning of 𝒯 compared to initializing f with randomly sampled parameters θ′.
"Efficient" is measured in number of training samples and/or training iterations, ceteris paribus.
McCann et al., 2017
One of the earliest approaches to transfer learning with contextualized representations:
The transfer mechanism: concatenate GloVe embeddings with CoVe vectors:
This concatenated vector serves as the input embedding to downstream models.
CoVe requires a parallel translation corpus for pretraining — expensive and limited compared to raw text. ELMo will solve this.
Peters et al., 2018
ELMo switched the pretraining task from machine translation to language modeling (requiring only raw text) and became one of the first truly successful transfer learning models in NLP.
ELMo trains two separate language models using stacked LSTM layers (L layers each):
Forward layers produce →hLMtl and backward layers produce ←hLMtl for each layer l ∈ [0, L].
Parameters →θ, ←θ, and shared parameters θ′ are optimized jointly:
The combined representation at each layer:
The ELMo representation — a scaled convex combination over all layers:
Simply concatenate ELMo representations with existing input representations:
Only the input dimension changes — the rest of the downstream network stays unchanged.
ELMo beat SOTA on six benchmarks: SQuAD (QA), SNLI (NLI), SRL, Coreference Resolution, NER, and SST-5 (Sentiment).
Devlin et al., 2019
BERT replaced LSTMs with Transformers and introduced a new paradigm: pretrain a bidirectional Transformer encoder, then fine-tune all parameters end-to-end.
BERT is a multi-layer bidirectional Transformer encoder that reads the entire sequence at once.
Uses WordPiece tokenization (30K vocab). Each token's input is:
Special tokens: [CLS] (prepended; used for classification) and [SEP] (separates sentence pairs). Segment embeddings indicate Sentence A vs. B.
A — Masked Language Modeling (MLM)
Predict randomly masked tokens using both left and right context, overcoming the unidirectional limitation:
Masking strategy (to reduce pretrain–finetune mismatch):
B — Next Sentence Prediction (NSP)
Given sentences A and B, predict whether B actually follows A. Training data: 50% real next sentences (IsNext), 50% random sentences (NotNext). This helps learn inter-sentence relationships.
BERT's paradigm is simpler than ELMo's — no task-specific architecture needed. Just plug in task-specific inputs/outputs and fine-tune all parameters:
BERT can also serve as a feature extractor (like ELMo) — extract embeddings without fine-tuning.
BERT substantially outperformed both ELMo and GPT on the GLUE benchmark, confirming the power of bidirectional pretraining with Transformers.
5.1 — RoBERTa
Liu et al., 2019 — Robustly Optimized BERT
Same architecture as BERT, with improved pretraining:
5.2 — XLNet
Yang et al., 2019 — Permutation Language Modeling
Problem: BERT's [MASK] token appears during pretraining but never during fine-tuning.
Solution: Maximize expected log-likelihood over all possible permutations of the factorization order. The original sequence order is preserved — only the attention mask changes. Since all permutations share parameters θ, each token eventually "sees" every other token, achieving bidirectional context without [MASK].
5.3 — ALBERT
Lan et al., 2019 — A Lite BERT
Result: Reduces BERTBASE from 108M → 12M parameters with moderate performance drop.
5.4 — ELECTRA
Clark et al., 2020 — Replaced Token Detection
Uses a discriminative pretraining objective with a generator-discriminator setup:
After pretraining, the generator is discarded. The discriminator is fine-tuned on downstream tasks.
Every token provides a training signal (not just ~15% masked tokens), making training far more computationally efficient. Matches RoBERTa with fewer FLOPs.
Radford et al., 2018 / 2019; Brown et al., 2020
GPT models are decoder-only Transformers pretrained with standard (unidirectional, left-to-right) language modeling:
GPT-1 (2018)
Pretrain with next-token prediction, then fine-tune by transforming task inputs into a single text sequence (using delimiter tokens). The final hidden state → fully connected layer → prediction.
GPT-2 (2019)
Scaled up with larger models and data. Key discovery: large-scale generative pretraining produces unsupervised multi-task learners with non-trivial zero-shot performance. The 1.5B parameter model showed this clearly.
GPT-3 (2020)
Scaled to 175 billion parameters (100× GPT-2). Demonstrated powerful in-context learning:
In-context learning is a form of transfer learning that requires no parameter updates at all.
Encoder-decoder models provide bidirectional understanding (encoder) + fluent generation (decoder), making them ideal for translation, summarization, and general seq2seq tasks.
T5 — Text-to-Text Transfer Transformer
Raffel et al., 2020
T5 frames all NLP tasks as text-to-text problems: encoder receives a task prefix + input; decoder generates the answer.
Pretraining — Text Infilling: Randomly mask contiguous spans and train the model to predict them. Special sentinel tokens (<X>, <Y>) mark masked span boundaries.
BART
Lewis et al., 2020
Investigated multiple noise functions: token masking, token deletion, text infilling, sentence permutation, and document rotation. Best combination: text infilling + sentence permutation. Achieved SOTA on several generation tasks.
The same pretraining recipes extend to many languages simultaneously (with a larger shared vocabulary):
These enable few-shot or zero-shot cross-lingual transfer — fine-tune on English, evaluate on any language.
| Architecture | Type | Examples | Objective | Best For |
|---|---|---|---|---|
| Encoder | Bidirectional | BERT, RoBERTa, XLNet, ALBERT, ELECTRA | MLM, NSP, Perm. LM, RTD | Classification, NER, QA |
| Decoder | Unidirectional (autoregressive) | GPT-1, GPT-2, GPT-3 | Next-token prediction | Generation, in-context learning |
| Enc-Dec | Bidir. encoder + autoregressive decoder | T5, BART | Text infilling, denoising | Translation, summarization, seq2seq |
The transformative workflow that emerged:
This paradigm eliminated the need for millions of task-specific labeled examples, removed the burden of designing task-specific architectures, and enabled a single pretrained model to be adapted to virtually any NLP task with minimal engineering.
Peters et al. (2018), McCann et al. (2017), Devlin et al. (2019), Liu et al. (2019), Yang et al. (2019), Lan et al. (2019), Clark et al. (2020), Radford et al. (2018, 2019), Brown et al. (2020), Raffel et al. (2020), Lewis et al. (2020).