In-context Learning, Prompting, Zero-shot & Instruction Tuning

Training, Fine-Tuning, Inference and Applications of LLMs — Lecture Summary
Section 01

What Is Prompting?

In a traditional supervised learning setting, a model generates an output y given an input x and model parameters θ using the objective P(y|x; θ). However, for many tasks, supervised data is simply unavailable, making conventional training difficult or impossible. Prompting offers an elegant alternative: instead of training a model on labeled data, we take a language model (LM) that has already been trained to model the probability of text P(x; θ) and use that probability to predict y — reducing or eliminating the need for large supervised datasets.

Key Insight

Prompting is non-invasive: it does not introduce additional parameters or require direct inspection of a model's internal representations. It can be thought of as a lower bound on what the model "knows" about a new task (x → y), and this information is simply extracted from the LM via prompting.

How to Prompt a Language Model

To prompt an LM, we map the input x to a prompt x′ using a prompting function fprompt(·). This function modifies the input text into a prompt x′ = fprompt(x). The prompt is defined by a template containing two slots: an input slot [X] for the input x, and an answer slot [Z] for a generated answer z that may or may not be mapped to the final output y.

The model then searches for the highest-scoring answer ẑ that maximizes the LM's score. Formally:

ẑ = searchz∈Z P(ffill(x′, z); θ) where: • Z = the set of all potential answers (full vocabulary for generation, or a restricted set for classification) • ffill(x′, z) = the function that fills answer slot [Z] in prompt x′ with candidate answer z • P(·; θ) = the pre-trained language model probability • The search function can be argmax (greedy) or various sampling techniques

Prompting Terminology

Input x
One or multiple texts, e.g., "I love this movie."
Output y
The output label or text, e.g., "++ (very positive)"
Prompting Function
fprompt(x): converts input into a specific form with slots [X] and [Z], e.g., "[X] Overall, it was a [Z] movie."
Prompt x′
The text where [X] is filled with input x but [Z] remains open, e.g., "I love this movie. Overall, it was a [Z] movie."
Filled Prompt
ffill(x′, z): a prompt where [Z] is filled with any answer z, e.g., "…it was a bad movie."
Answered Prompt
ffill(x′, z*): a prompt where [Z] is filled with the true answer z*, e.g., "…it was a good movie."

Example: Sentiment Classification via Prompting

Consider a sentiment classification task. Given an input such as "No reason to watch," we can formulate it as a masked LM problem using a template:

1

Formulate the task as a (Masked) LM problem using a template: [CLS] No reason to watch. It was [MASK]. [SEP]

2

Choose a label word mapping M(y) that maps task labels to individual words. For example, M(positive) = "great" and M(negative) = "terrible".

3

Use the LM to score each candidate label word at the [MASK] position and select the highest-probability one as the prediction.

Section 02

Prompt Engineering

Given a task, multiple prompts can work. However, finding the most effective prompt is crucial to unlocking the full potential of the LM. Prompt Engineering is the process of designing a prompting function fprompt(x) that results in the most effective performance for the given task.

Important

Prompting is often called a "dark art" because it requires domain expertise and extensive trial and error. The core challenge is to find a template T and label words M(y) that work well in conjunction — even slight variations in prompts can lead to significant performance differences.

Manual Prompts

The most straightforward approach is to design prompts for a given task by hand. Since the number of required prompts is typically small (1 to 8 depending on the task and available context window), manual design gives the user maximum control and flexibility. For instance, on the SST-2 sentiment analysis benchmark using the template <Input> It was [MASK]., choosing the label word pair "great/terrible" yields 92.7% accuracy, while "good/bad" achieves 92.5%. The sensitivity to such choices illustrates the fragility of manual prompt design.

Automated Prompts

While manual prompts are intuitive, they are error-prone and labor-intensive — even experienced prompt designers may fail to find optimal prompts. Automated approaches search for better prompts either in a discrete space (natural language tokens) or a continuous space (embedding vectors).

2.1 — Discrete (Hard) Prompts

Discrete prompts automatically search for prompts in the space of natural language strings. Several families of methods have been proposed:

1

Mining-based approaches (Jiang et al., 2020): Search a large text corpus (e.g., Wikipedia) for strings containing both the training input x and output y, then extract the middle words or dependency paths between them as templates of the form "[X] middle words [Z]".

2

Paraphrasing methods: Take an original prompt and generate variations — by translating the prompt into another language and back (Jiang et al., 2020), replacing words with a thesaurus (Yuan et al., 2021), or training a neural prompt rewriter whose objective is to improve the accuracy of systems using the prompt (Haviv et al., 2021).

3

Generation-based approaches: Treat prompt creation as a text generation task. Gao et al. (2021) used a pre-trained T5 model to search for effective templates.

4

Gradient-based search: Find token sequences that trigger the LM to produce desired outputs. AutoPrompt (Shin et al., 2020) combines the original input x with gradient-selected "trigger tokens" and marginalizes the model's predictions over a set of label tokens to produce class probabilities.

Experimental results on the LAMA knowledge probe showed that automated approaches (mining, paraphrasing, and their ensembles) can match or exceed manually designed prompts. For instance, on BERT-base, combining mined and manual prompts achieved an optimal ensemble score of 39.6, compared to 31.1 for the manually designed baseline.

2.2 — Continuous (Soft) Prompts

Since the purpose of prompts is to steer a model toward solving a task, there is no strict requirement that prompts be limited to human-readable natural language. Continuous prompts (or soft prompts) operate directly in the model's embedding space, allowing for more expressive optimization.

Prefix Tuning (Li and Liang, 2021): Prepends a sequence of continuous task-specific vectors to the input while keeping all LM parameters frozen. The trainable prefix matrix Mϕ is optimized using the following log-likelihood objective:
maxϕ log P(y|x; θ; ϕ) = maxϕ Σyi log P(yi | h<i; θ; ϕ) where: • h<i = [h(1)<i ; … ; h(n)<i] is the concatenation of all neural network layers at time step i • If the time step is within the prefix, hi is copied directly from Mϕ[i] • Otherwise, hi is computed using the pre-trained LM • θ = frozen pre-trained LM parameters • ϕ = trainable prefix parameters
P-tuning (Liu et al., 2021a): Learns continuous prompts by inserting trainable variables directly into the embedded input sequence.
Hybrid approaches (Zhong et al., 2021): First define a template using a discrete search method like AutoPrompt, then initialize virtual tokens from the discovered prompt and fine-tune the embeddings to boost performance — combining the best of both discrete and continuous worlds.
Prompt Tuning with Rules (PTR) (Han et al., 2021): Uses logic rules to create templates alongside virtual tokens whose initial embeddings come from the pre-trained LM.
Section 03

Zero-Shot and Few-Shot Inference

Prompting methods can often be used without any explicit training of the language model for a downstream task. A pre-trained LM that has been trained to predict text probability P(x) can be applied as-is to fill in cloze or prefix prompts that define a task.

Definition — Zero-Shot

In the zero-shot setting, no training data is available for the task of interest. The model relies entirely on the task description in the prompt and its pre-trained knowledge to produce an answer.

Definition — Few-Shot

In the few-shot setting, a limited number of labeled input–output exemplars are included directly in the prompt as demonstrations of the desired behavior. For example, the standard prompt "France's capital is [X]." can be augmented by prepending: "Great Britain's capital is London. Germany's capital is Berlin. France's capital is [X]." Crucially, few-shot inference does not involve any parameter updates.

The introduction of GPT-3 (Brown et al., 2020) popularized this approach: the 175B-parameter model was shown to substantially benefit from including exemplars in the prompt across a wide range of tasks and model sizes. On the TriviaQA benchmark, for instance, accuracy increased monotonically with the number of in-context examples and with model size — zero-shot accuracy on the smallest model (0.1B) was below 10%, while few-shot (K=64) on the 175B model reached above 60%.

Practical Considerations for Few-Shot Prompting

Although the idea is simple, two aspects require careful attention:

Example selection: Different demonstrations can produce vastly different performance (Lu et al., 2022). One effective strategy is to use sentence embeddings to sample examples that are semantically close to the test input in the embedding space (Liu et al., 2022b; Drori et al., 2022).
Example ordering: The order of the labeled examples in the prompt also matters significantly. Methods have been proposed to score different candidate permutations and select the best ordering (Lu et al., 2022).
Section 04

In-Context Learning

The idea of prompting with demonstrations was dubbed In-context learning (ICL) in the original GPT-3 paper (Brown et al., 2020). In-context learning is an emergent behavior displayed by large language models: the ability to perform previously unseen tasks in a few-shot setting, without any parameter optimization.

Key Characteristic — Emergent Behavior

In-context learning occurs suddenly as model size increases. While including a few exemplars in the prompt does not produce a significant improvement in small-scale models, it leads to dramatically improved behavior in large models (e.g., GPT-3, PaLM). This sudden capability gain at scale is what defines it as an emergent behavior — it is not predictable from smaller-scale experiments.

Why Does In-Context Learning Work?

The mechanism behind ICL remains an active area of research. Typically, large LMs are trained with a self-supervised language modeling objective, and the process by which they learn a previously unseen task (e.g., text classification) without any parameter optimization is not fully understood.

A leading hypothesis, proposed by Dai et al. (2022), is that in-context learning works as a process of meta-optimization over the few examples included in the prompt. Their experimental results support the idea that, during in-context learning, the model produces meta-gradients according to the demonstration exemplars through the forward computation pass. These meta-gradients are then applied to the original LM through the attention mechanism. Under this view, in-context learning can be understood as a kind of implicit fine-tuning on the few-shot exemplars — but one that occurs entirely within the forward pass, with no actual parameter updates.

Achievements and Applications

On many NLP benchmarks, in-context learning proved competitive with models trained on much more labeled data, and achieved state-of-the-art (at the time) results on LAMBADA (commonsense sentence completion) and TriviaQA. It enabled a variety of applications including writing code from natural language descriptions, designing app mockups, and generalizing spreadsheet functions.

Section 05

Chain of Thought Prompting

Multiple studies on the reasoning capabilities of large LMs highlighted that eliciting the model to produce a step-by-step solution of a problem can lead to a more accurate final answer (Nye et al., 2021). Wei et al. (2022b) combined this insight with the idea of including demonstrations in the prompt, introducing the concept of Chain of Thought (CoT) prompting.

Definition

Given a question, a chain of thought is a coherent sequence of intermediate reasoning steps that leads to a final answer. The key intuition is that by allowing the model to generate a step-by-step solution, we let it apply computation proportional to the problem difficulty — more tokens are generated for problems requiring more steps.

How CoT Works

Consider a math word problem. With standard prompting, the demonstration shows only the final answer (e.g., "Q: Roger has 5 tennis balls… A: The answer is 11."). With CoT prompting, the demonstration includes intermediate reasoning steps (e.g., "A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."). The model then mimics this step-by-step reasoning for new questions.

Zero-Shot CoT

Remarkably, CoT-style reasoning can also be elicited in a zero-shot setting without any demonstrations (Kojima et al., 2022). This was achieved simply by adding the phrase "Let's think step-by-step" to the prompt before the model's answer. Experiments showed that this simple addition dramatically improved accuracy on reasoning tasks compared to standard zero-shot prompting.

Emergent Property of Scale

The ability to generate useful chains of reasoning is an emergent property of model size: larger models are significantly better at producing correct step-by-step reasoning chains than smaller ones. This property has made CoT prompting the de facto method for querying modern LLMs on complex reasoning tasks. CoT prompting has achieved state-of-the-art accuracy (at the time) on the GSM8K benchmark of math word problems, using just eight chain-of-thought exemplars with a 540B-parameter model — surpassing even fine-tuned GPT-3 with a verifier.

Section 06

Least to Most Prompting (Problem Decomposition)

While CoT prompting shows impressive results, its effectiveness decreases when a given task is more challenging than the examples provided in the prompts. To address this limitation, problem decomposition was proposed both as a training strategy (Shridhar et al., 2022) and as a prompting strategy (Zhou et al., 2023).

Core Idea

In Least to Most (LtM) prompting, a complex problem is broken down into smaller, more manageable sub-problems that are tackled one at a time. The solution to each sub-problem is used to help solve the next one, enabling a sequential problem-solving process. Each sub-question is solved in a separate prompt call, with the previously computed sub-answers available as context.

For example, given a complex math word problem, the LtM approach first generates a series of sub-questions (e.g., "Q1: How many trees are there in the beginning? Q2: How many trees will there be after the grove workers plant trees?"). Each sub-question is answered in sequence, and the final answer is synthesized from the accumulated sub-answers.

Section 07

Program of Thought Prompting (Structural Thought)

Approaches like Chain of Thought or Least to Most prompting often struggle with tasks that involve heavy numerical computation, because the language model must not only generate mathematical expressions but also execute the computations at each step. LLMs face several limitations here: they frequently make arithmetic errors (especially with large numbers), struggle with complex mathematical expressions (polynomial equations, differential equations), and are inefficient at handling iterative processes.

Core Idea

Leveraging the observation that LLMs are often better at producing code than doing math, Program of Thought (PoT) prompting (Chen et al., 2023) outsources computational steps to an external language interpreter. The LM formulates reasoning steps as Python programs, and a Python interpreter then executes the computation — enabling more accurate and complex mathematical problem-solving.

For example, instead of producing "We start with 15 trees. Later we have 21 trees. So they planted 21 − 15 = 6 trees.", the model produces:

trees_begin = 15 trees_end = 21 trees_today = trees_end - trees_begin answer = trees_today → Python interpreter executes this code and returns: 6

A variant called Faithful Chain of Thought (FCoT) combines both natural language reasoning and code execution, annotating each step with its dependencies and supporting evidence before generating a Python program for computation.

Section 08

Performance Comparison of Prompting Strategies

It is difficult to declare a single best prompting strategy, as each is designed for specific types of tasks. Lyu et al. (2024) conducted a comprehensive study comparing all of the above strategies across nine reasoning datasets using various open-source and closed models. The key finding is that prompting strategies involving intermediate reasoning steps consistently outperform direct-answer (standard) prompting.

Model Standard CoT LtM PoT FCoT
Codex 57.1 81.3 74.3 80.0 83.4
GPT-3.5-turbo 64.9 77.6 77.6 72.5 76.8
GPT-4 79.3 88.3 87.3 84.4 90.9
LLaMA-7B 40.1 56.4 46.0 47.4 50.2
LLaMA-13B 43.2 66.8 58.2 56.9 62.2
LLaMA-70B 58.0 82.0 73.3 73.6 77.0
Mistral-7B 49.9 73.5 61.9 66.5 71.2
Mistral-7B-instruct 43.7 63.6 56.0 60.0 67.1

Accuracy (%) averaged across all reasoning datasets. Bold = best for each model. Data from Lyu et al. (2024).

Takeaway

Chain of Thought and Faithful Chain of Thought tend to perform best overall, closely followed by Least to Most and Program of Thought. All four of these strategies involve intermediate step explanations. Standard prompting — which directly predicts the answer — consistently underperforms. The conclusion is clear: prompting strategies that involve intermediate reasoning steps are superior to direct prediction.

Section 09

Self-Consistency

Self-consistency (Wang et al., 2023) is a decoding strategy that can be layered on top of Chain of Thought prompting to further improve accuracy. The key argument is that for many tasks — especially reasoning tasks — there are usually several different lines of thought that converge on the same correct solution. In this case, the most frequently generated answer is more likely to be correct.

How It Works

Instead of using a single greedy decode, self-consistency samples n candidate reasoning paths from the model. Each path produces a final numerical answer. The final prediction is obtained by marginalizing over the reasoning paths and taking a majority vote over the answers:

â = arg maxa Σni=1 𝟙(ai = a) where: • n = number of sampled reasoning paths • (ri, ai) = the i-th reasoning path ri and its corresponding final answer ai • 𝟙(ai = a) = indicator function, equals 1 if ai matches candidate answer a • The formula selects the answer that appears most frequently across all n samples

For example, if 3 reasoning paths are sampled and two produce the answer $18 while one produces $26, the self-consistency method selects $18 as the final answer. This is a significant improvement over standard greedy decoding, which might commit to a single incorrect reasoning chain.

Section 10

Instruction Tuning

The pretraining–finetuning paradigm led to the hypothesis that language models are unsupervised multitask learners (Radford et al., 2019) — during pretraining, models implicitly learn a latent structure of language useful for many downstream tasks. Building on this, recent work proposed an explicit multi-task training paradigm that leverages natural language instructions.

Definition

Instruction tuning consists of finetuning language models on a collection of NLP datasets where each task is described via natural language instructions. The instructions are incorporated by simply prepending to the model's input a short description of the task (e.g., "Is the sentiment of this movie review positive or negative?" or "Translate the following sentence into Chinese"). This approach combines aspects of both the pretrain–finetune and prompting paradigms.

FLAN: Finetuned Language Net

A prominent example of instruction-tuned models is the FLAN family (Wei et al., 2022a). To evaluate FLAN's ability to perform a specific task T (e.g., natural language inference), the model is instruction-tuned on a range of other NLP tasks — such as commonsense reasoning, translation, and sentiment analysis — while ensuring that task T is explicitly excluded from instruction tuning. The model is then evaluated on task T in a zero-shot setting, testing its ability to generalize to unseen tasks.

Key Results

FLAN (137B) zero-shot surpasses GPT-3 (175B) zero-shot on 20 out of 25 evaluated datasets, despite being a smaller model.
FLAN zero-shot even outperforms GPT-3 few-shot by large margins on several benchmarks: ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
On Natural Language Inference tasks: GPT-3 zero-shot scored 42.9%, GPT-3 few-shot scored 53.2%, and FLAN zero-shot scored 56.2%.
On Reading Comprehension: GPT-3 zero-shot scored 63.7%, GPT-3 few-shot scored 72.6%, and FLAN zero-shot scored 77.4%.
On Closed-Book QA: GPT-3 zero-shot scored 49.8%, GPT-3 few-shot scored 55.7%, and FLAN zero-shot scored 56.6%.

Scaling Instruction Tuning

The approach was subsequently scaled to the 540B-parameter PaLM model (Chowdhery et al., 2022) and a much larger number of tasks (1,836) by Chung et al. (2022). This scaling resulted in models with better generalization capabilities, improved reasoning abilities, and better behavior in open-ended zero-shot generation compared to non-instruction-tuned models. Ablation studies revealed that three factors are key to the success of instruction tuning: the number of finetuning datasets, model scale, and the use of natural language instructions.

Section 11

Reinforcement Learning from Human Feedback (RLHF)

A different approach to incorporating human feedback is Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017). The central intuition is to fit a reward function to human preferences while simultaneously training a policy to optimize the predicted reward. In this context, the "agent" is a pre-trained language model that needs to be aligned with the user's intentions — incorporating both explicit intentions (following instructions) and implicit intentions (staying truthful, unbiased, non-toxic).

The RLHF Pipeline

1

Define a prompt distribution on which we want the model to produce aligned outputs.

2

Collect human demonstrations of desired behavior and use them to train a supervised fine-tuning (SFT) baseline model.

3

Train a reward model on a dataset D of human comparisons between model outputs. Labelers indicate which output they prefer for a given input. The reward model loss is:

ℒ(θRM) = − (1 / K2) · E(x, yw, yl) ~ D [ log(σ(rθRM(x, yw) − rθRM(x, yl))) ] where: • rθ(x, y) = scalar output of the reward model for prompt x and completion y • yw = the preferred (winner) completion • yl = the dispreferred (loser) completion • σ = sigmoid function • K = number of ranked responses per prompt in D
4

Fine-tune the SFT model using PPO (Proximal Policy Optimization) to maximize the reward model's output, with the following objective:

obj(ϕ) = E(x,y) ~ DπRLϕ [ rθRM(x, y) − β · log(πRLϕ(y|x) / πSFT(y|x)) ] + γ · Ex ~ Dpretrain [ log(πRLϕ(x)) ] where: • πRLϕ = the learned RL policy (the model being trained) • πSFT = the supervised fine-tuned baseline model • Dpretrain = the pretraining data distribution • β = weight for the KL divergence penalty (prevents over-optimization of reward) • γ = weight for the pretraining loss term (preserves general capabilities)

The KL penalty from the SFT model is critical: it prevents the RL policy from drifting too far from the supervised baseline, which would risk "reward hacking" — exploiting the reward model to get high scores without actually producing better outputs. The pretraining loss term helps preserve the model's general language capabilities.

InstructGPT: RLHF in Practice

An important example of RLHF-trained models is the InstructGPT series (Ouyang et al., 2022). These models used GPT-3 (175B parameters) as the SFT baseline and a smaller 6B model for the reward. Human evaluators rated InstructGPT outputs favorably compared to the non-instruction-trained baseline across multiple axes: compliance with prompt constraints, reduced hallucinations, and appropriate language use. Strikingly, the 1.3B PPO-ptx model was preferred over the 175B GPT-3 — demonstrating that alignment training can make a much smaller model more useful than a much larger unaligned one. The same procedure was used to train ChatGPT, with slight differences in data collection.

Section 12

Direct Preference Optimization (DPO)

While RLHF is effective, it has significant drawbacks: RL training is tricky, unstable, and highly sensitive to hyperparameters, and the reward model is typically as large as the LM itself, making it computationally expensive. Direct Preference Optimization (DPO) (Rafailov et al., 2024) offers a simpler alternative that does not require a reward model or any RL training loop.

How DPO Works

DPO models the preference data using the Bradley–Terry model:

p(yw ≻ yl) = σ(r(x, yw) − r(x, yl)) where: • yw = the preferred (winning) response • yl = the dispreferred (losing) response • r(x, y) = the implicit reward for prompt x and completion y • σ = sigmoid function

DPO exploits a key mathematical observation: it is possible to extract the optimal policy in closed form. The optimal policy π* satisfying the RLHF objective can be written as:

π*(y|x) = (1 / Z(x)) · πSFT(y|x) · exp((1/β) · r(x, y)) where: • Z(x) = Σy πSFT(y|x) · exp((1/β) · r(x, y)) is an intractable normalization constant

By rearranging to express the reward as a function of the policy, and noting that Z(x) cancels out in the Bradley–Terry model (which depends on differences in rewards), we obtain the final DPO loss:

DPO(θ) = −E(x, yw, yl) ~ D [ log σ( β · log(πθ(yw|x) / πSFT(yw|x)) − β · log(πθ(yl|x) / πSFT(yl|x)) ) ] Intuition: • The loss increases the likelihood of the preferred response yw • The loss simultaneously decreases the likelihood of the dispreferred response yl • β controls the strength of deviation from the SFT baseline • The log-ratios πθSFT implicitly regularize toward the base model

Advantages of DPO over RLHF

No explicit reward model: DPO does not require training or storing a separate reward model, saving computation and memory.
No RL sampling loop: DPO does not need to sample from the model during training to compute rewards — it directly optimizes the LM weights using the preference data.
Better stability: DPO provides the highest rewards for all KL divergence values and outperforms other approaches (PPO, Preferred-FT, SFT) in summarization tasks as evaluated by GPT-4 win rate against a baseline.