Transfer Learning in NLP

ETH Zürich · 263-5354-00L Large Language Models · Prof. Mrinmaya Sachan
Section 01

What is Transfer Learning?

Transfer learning is the idea of using knowledge gained from training on one task to solve other, related tasks. It draws inspiration from the psychological concept of "transfer of learning" — for instance, a person who already speaks Dutch or English will learn Swiss German more efficiently.

In neural networks, this dates back to Bozinovski & Fulgosi (1976), who asked: if a network fθ is trained on a first task, does that enable learning a second task with fewer samples (positive transfer) or more (negative transfer)?

Later, Pratt (1992) introduced the Discriminability-Based Transfer (DBT) algorithm, which reused and additionally scaled pretrained parameters by how well they fit new task data — achieving the same final performance with fewer epochs.

Key Terminology

Pretrained Model The network trained on the source task (typically language modeling).
Pretraining The learning process of training on the source task.
Fine-tuning Updating pretrained weights for a new target task.
Multi-task Learning Learning multiple tasks jointly (vs. sequentially in transfer learning).

Why Language Modeling?

It is trivially scalable — only raw text is needed, no expensive human annotation.
Input space equals output space, making it a self-supervised task.
With pretrained LMs, tasks that required millions of labeled samples now need only thousands.

Formal Definition

Definition — Transfer Learning for LMs

Consider a language model pLM(y; θ) over Σ* trained on corpus 𝒟. Consider a target task 𝒯 posed as learning f : ℒ → 𝒴.

Transfer learning occurs if parameterizing f as fθ̂, with θ̂ ⊆ θ, allows for more efficient learning of 𝒯 compared to initializing f with randomly sampled parameters θ′.

"Efficient" is measured in number of training samples and/or training iterations, ceteris paribus.

Section 02

CoVe — Contextualized Word Vectors

McCann et al., 2017

One of the earliest approaches to transfer learning with contextualized representations:

Step 1 — Train a seq2seq model for machine translation with a two-layer bidirectional LSTM encoder.
Step 2 — Reuse the trained encoder to produce context vectors for downstream tasks (sentiment analysis, question classification, textual entailment, QA).
CoVe(w) = MT-LSTM(GloVe(w)) = [… Ct−1 ; Ct ; Ct+1 ; …]

The transfer mechanism: concatenate GloVe embeddings with CoVe vectors:

w̃ = [GloVe(w) ; CoVe(w)]

This concatenated vector serves as the input embedding to downstream models.

Limitation

CoVe requires a parallel translation corpus for pretraining — expensive and limited compared to raw text. ELMo will solve this.

Section 03

ELMo — Embeddings from Language Models

Peters et al., 2018

ELMo switched the pretraining task from machine translation to language modeling (requiring only raw text) and became one of the first truly successful transfer learning models in NLP.

Architecture

ELMo trains two separate language models using stacked LSTM layers (L layers each):

Forward LM: pLM(yt | y<t) — predicts left-to-right
Backward LM: pLMB(yt | y>t) — predicts right-to-left

Forward layers produce →hLMtl and backward layers produce ←hLMtl for each layer l ∈ [0, L].

Training Objective

Parameters →θ, ←θ, and shared parameters θ′ are optimized jointly:

ELMo(θ) = Σn=1N Σt=1T [ log pLM(yt(n) | y<t(n) ; →θ, θ′) + log pLMB(yt(n) | y>t(n) ; ←θ, θ′) ]

Key Innovations Over CoVe

All layers used: Uses representations from every LSTM layer, not just the final one.
Task-specific layer combination: Learns a weighted combination of all layers per task.

The combined representation at each layer:

htlLM = [←htlLM ; →htlLM]

The ELMo representation — a scaled convex combination over all layers:

ELMottask = γtask · Σl=0L sltask · htlLM where: • Σl sltask = 1 (softmax weights, learned per task) • γtask ∈ ℝ (task-specific scaling factor)

Downstream Usage

Simply concatenate ELMo representations with existing input representations:

inputt = [xt ; ELMottask]

Only the input dimension changes — the rest of the downstream network stays unchanged.

Results

ELMo beat SOTA on six benchmarks: SQuAD (QA), SNLI (NLI), SRL, Coreference Resolution, NER, and SST-5 (Sentiment).

Section 04

BERT — Bidirectional Encoder Representations from Transformers

Devlin et al., 2019

BERT replaced LSTMs with Transformers and introduced a new paradigm: pretrain a bidirectional Transformer encoder, then fine-tune all parameters end-to-end.

Architecture

BERT is a multi-layer bidirectional Transformer encoder that reads the entire sequence at once.

BERTBASE L=12, H=768, A=12 heads, FFN=3072, 110M params
BERTLARGE L=24, H=1024, A=16 heads, FFN=4096, 340M params

Input Representation

Uses WordPiece tokenization (30K vocab). Each token's input is:

input = Token Embedding + Segment Embedding + Position Embedding

Special tokens: [CLS] (prepended; used for classification) and [SEP] (separates sentence pairs). Segment embeddings indicate Sentence A vs. B.

Pre-training Objectives

ℒ = ℒMLM + ℒNSP

A — Masked Language Modeling (MLM)

Predict randomly masked tokens using both left and right context, overcoming the unidirectional limitation:

MLM(θ) = Σn=1N Σt=1T log pMLM(yt(n) | y<t(n), y>t(n) ; θ) · 𝟙{yt(n) = [MASK]}

Masking strategy (to reduce pretrain–finetune mismatch):

80% → replace with [MASK]
10% → replace with a random token
10% → keep the original token

B — Next Sentence Prediction (NSP)

Given sentences A and B, predict whether B actually follows A. Training data: 50% real next sentences (IsNext), 50% random sentences (NotNext). This helps learn inter-sentence relationships.

Pre-training Data

English Wikipedia + BookCorpus (~20 GB, 3.3B words, ~800M sentences)
Batch size: 256 sequences, trained for 1M steps

Fine-tuning

BERT's paradigm is simpler than ELMo's — no task-specific architecture needed. Just plug in task-specific inputs/outputs and fine-tune all parameters:

Sentence pairs (paraphrasing, entailment, QA): Feed both sentences with [SEP]; use [CLS] for classification or token representations for span extraction.
Single sentences (classification, NER): Use [CLS] for classification or per-token outputs for tagging.

BERT can also serve as a feature extractor (like ELMo) — extract embeddings without fine-tuning.

Results

BERT substantially outperformed both ELMo and GPT on the GLUE benchmark, confirming the power of bidirectional pretraining with Transformers.

Section 05

BERT Variants

5.1 — RoBERTa

Liu et al., 2019 — Robustly Optimized BERT

Same architecture as BERT, with improved pretraining:

Larger corpus: 160 GB (added CC-News + OpenWebText), ~2.5B word pieces
Larger batch: 8,000 sequences × 512 tokens for 500K steps
Dynamic masking: Mask pattern re-randomized each epoch
No NSP task: Dropped next sentence prediction entirely

5.2 — XLNet

Yang et al., 2019 — Permutation Language Modeling

Problem: BERT's [MASK] token appears during pretraining but never during fine-tuning.

Solution: Maximize expected log-likelihood over all possible permutations of the factorization order. The original sequence order is preserved — only the attention mask changes. Since all permutations share parameters θ, each token eventually "sees" every other token, achieving bidirectional context without [MASK].

5.3 — ALBERT

Lan et al., 2019 — A Lite BERT

Factorized embeddings: Decompose V×H matrix into V×E and E×H (efficient when H ≫ E).
Cross-layer parameter sharing: All Transformer layers share the same weights. (Same FLOPs, far fewer parameters.)
Sentence Order Prediction (SOP): Replaces NSP. Negatives are consecutive segments with order swapped, focusing on coherence over topic.

Result: Reduces BERTBASE from 108M → 12M parameters with moderate performance drop.

5.4 — ELECTRA

Clark et al., 2020 — Replaced Token Detection

Uses a discriminative pretraining objective with a generator-discriminator setup:

Generator (small MLM): Creates corrupted input by replacing tokens with plausible alternatives.
Discriminator (ELECTRA model): Classifies each token as original or replaced.

After pretraining, the generator is discarded. The discriminator is fine-tuned on downstream tasks.

Advantage

Every token provides a training signal (not just ~15% masked tokens), making training far more computationally efficient. Matches RoBERTa with fewer FLOPs.

Section 06

GPT Series — Decoder Models

Radford et al., 2018 / 2019; Brown et al., 2020

GPT models are decoder-only Transformers pretrained with standard (unidirectional, left-to-right) language modeling:

LM(θ) = −(1/T) Σt=1T log p(yt | y1, …, yt−1 ; θ)

GPT-1 (2018)

Pretrain with next-token prediction, then fine-tune by transforming task inputs into a single text sequence (using delimiter tokens). The final hidden state → fully connected layer → prediction.

GPT-2 (2019)

Scaled up with larger models and data. Key discovery: large-scale generative pretraining produces unsupervised multi-task learners with non-trivial zero-shot performance. The 1.5B parameter model showed this clearly.

GPT-3 (2020)

Scaled to 175 billion parameters (100× GPT-2). Demonstrated powerful in-context learning:

Zero-shot: Only a task description, no examples. No gradient updates.
One-shot: Task description + one example. No gradient updates.
Few-shot: Task description + a few examples. No gradient updates.

In-context learning is a form of transfer learning that requires no parameter updates at all.

Section 07

Seq2Seq Transformer LMs

Encoder-decoder models provide bidirectional understanding (encoder) + fluent generation (decoder), making them ideal for translation, summarization, and general seq2seq tasks.

T5 — Text-to-Text Transfer Transformer

Raffel et al., 2020

T5 frames all NLP tasks as text-to-text problems: encoder receives a task prefix + input; decoder generates the answer.

Pretraining — Text Infilling: Randomly mask contiguous spans and train the model to predict them. Special sentinel tokens (<X>, <Y>) mark masked span boundaries.

Data: C4 corpus (750 GB of English text from Common Crawl)
Models: Up to 11 billion parameters

BART

Lewis et al., 2020

Investigated multiple noise functions: token masking, token deletion, text infilling, sentence permutation, and document rotation. Best combination: text infilling + sentence permutation. Achieved SOTA on several generation tasks.

Section 08

Multilingual Transformer LMs

The same pretraining recipes extend to many languages simultaneously (with a larger shared vocabulary):

BERT → mBERT
RoBERTa → XLM-R
T5 → mT5

These enable few-shot or zero-shot cross-lingual transfer — fine-tune on English, evaluate on any language.

Section 09

Architecture Comparison

Architecture Type Examples Objective Best For
Encoder Bidirectional BERT, RoBERTa, XLNet, ALBERT, ELECTRA MLM, NSP, Perm. LM, RTD Classification, NER, QA
Decoder Unidirectional (autoregressive) GPT-1, GPT-2, GPT-3 Next-token prediction Generation, in-context learning
Enc-Dec Bidir. encoder + autoregressive decoder T5, BART Text infilling, denoising Translation, summarization, seq2seq
Section 10

The Big Picture: Pretrain → Fine-tune

The transformative workflow that emerged:

Step 1 — Pretrain: Train a large neural LM on vast unlabeled text (self-supervised).
Step 2 — Transfer: Fine-tune (or use via in-context learning) on a specific downstream task with relatively little labeled data.

This paradigm eliminated the need for millions of task-specific labeled examples, removed the burden of designing task-specific architectures, and enabled a single pretrained model to be adapted to virtually any NLP task with minimal engineering.

References

Peters et al. (2018), McCann et al. (2017), Devlin et al. (2019), Liu et al. (2019), Yang et al. (2019), Lan et al. (2019), Clark et al. (2020), Radford et al. (2018, 2019), Brown et al. (2020), Raffel et al. (2020), Lewis et al. (2020).