In a traditional supervised learning setting, a model generates an output y given an input x and model parameters θ using the objective P(y|x; θ). However, for many tasks, supervised data is simply unavailable, making conventional training difficult or impossible. Prompting offers an elegant alternative: instead of training a model on labeled data, we take a language model (LM) that has already been trained to model the probability of text P(x; θ) and use that probability to predict y — reducing or eliminating the need for large supervised datasets.
Prompting is non-invasive: it does not introduce additional parameters or require direct inspection of a model's internal representations. It can be thought of as a lower bound on what the model "knows" about a new task (x → y), and this information is simply extracted from the LM via prompting.
To prompt an LM, we map the input x to a prompt x′ using a prompting function fprompt(·). This function modifies the input text into a prompt x′ = fprompt(x). The prompt is defined by a template containing two slots: an input slot [X] for the input x, and an answer slot [Z] for a generated answer z that may or may not be mapped to the final output y.
The model then searches for the highest-scoring answer ẑ that maximizes the LM's score. Formally:
Consider a sentiment classification task. Given an input such as "No reason to watch," we can formulate it as a masked LM problem using a template:
Formulate the task as a (Masked) LM problem using a template: [CLS] No reason to watch. It was [MASK]. [SEP]
Choose a label word mapping M(y) that maps task labels to individual words. For example, M(positive) = "great" and M(negative) = "terrible".
Use the LM to score each candidate label word at the [MASK] position and select the highest-probability one as the prediction.
Given a task, multiple prompts can work. However, finding the most effective prompt is crucial to unlocking the full potential of the LM. Prompt Engineering is the process of designing a prompting function fprompt(x) that results in the most effective performance for the given task.
Prompting is often called a "dark art" because it requires domain expertise and extensive trial and error. The core challenge is to find a template T and label words M(y) that work well in conjunction — even slight variations in prompts can lead to significant performance differences.
The most straightforward approach is to design prompts for a given task by hand. Since the number of required prompts is typically small (1 to 8 depending on the task and available context window), manual design gives the user maximum control and flexibility. For instance, on the SST-2 sentiment analysis benchmark using the template <Input> It was [MASK]., choosing the label word pair "great/terrible" yields 92.7% accuracy, while "good/bad" achieves 92.5%. The sensitivity to such choices illustrates the fragility of manual prompt design.
While manual prompts are intuitive, they are error-prone and labor-intensive — even experienced prompt designers may fail to find optimal prompts. Automated approaches search for better prompts either in a discrete space (natural language tokens) or a continuous space (embedding vectors).
2.1 — Discrete (Hard) PromptsDiscrete prompts automatically search for prompts in the space of natural language strings. Several families of methods have been proposed:
Mining-based approaches (Jiang et al., 2020): Search a large text corpus (e.g., Wikipedia) for strings containing both the training input x and output y, then extract the middle words or dependency paths between them as templates of the form "[X] middle words [Z]".
Paraphrasing methods: Take an original prompt and generate variations — by translating the prompt into another language and back (Jiang et al., 2020), replacing words with a thesaurus (Yuan et al., 2021), or training a neural prompt rewriter whose objective is to improve the accuracy of systems using the prompt (Haviv et al., 2021).
Generation-based approaches: Treat prompt creation as a text generation task. Gao et al. (2021) used a pre-trained T5 model to search for effective templates.
Gradient-based search: Find token sequences that trigger the LM to produce desired outputs. AutoPrompt (Shin et al., 2020) combines the original input x with gradient-selected "trigger tokens" and marginalizes the model's predictions over a set of label tokens to produce class probabilities.
Experimental results on the LAMA knowledge probe showed that automated approaches (mining, paraphrasing, and their ensembles) can match or exceed manually designed prompts. For instance, on BERT-base, combining mined and manual prompts achieved an optimal ensemble score of 39.6, compared to 31.1 for the manually designed baseline.
2.2 — Continuous (Soft) PromptsSince the purpose of prompts is to steer a model toward solving a task, there is no strict requirement that prompts be limited to human-readable natural language. Continuous prompts (or soft prompts) operate directly in the model's embedding space, allowing for more expressive optimization.
Prompting methods can often be used without any explicit training of the language model for a downstream task. A pre-trained LM that has been trained to predict text probability P(x) can be applied as-is to fill in cloze or prefix prompts that define a task.
In the zero-shot setting, no training data is available for the task of interest. The model relies entirely on the task description in the prompt and its pre-trained knowledge to produce an answer.
In the few-shot setting, a limited number of labeled input–output exemplars are included directly in the prompt as demonstrations of the desired behavior. For example, the standard prompt "France's capital is [X]." can be augmented by prepending: "Great Britain's capital is London. Germany's capital is Berlin. France's capital is [X]." Crucially, few-shot inference does not involve any parameter updates.
The introduction of GPT-3 (Brown et al., 2020) popularized this approach: the 175B-parameter model was shown to substantially benefit from including exemplars in the prompt across a wide range of tasks and model sizes. On the TriviaQA benchmark, for instance, accuracy increased monotonically with the number of in-context examples and with model size — zero-shot accuracy on the smallest model (0.1B) was below 10%, while few-shot (K=64) on the 175B model reached above 60%.
Although the idea is simple, two aspects require careful attention:
The idea of prompting with demonstrations was dubbed In-context learning (ICL) in the original GPT-3 paper (Brown et al., 2020). In-context learning is an emergent behavior displayed by large language models: the ability to perform previously unseen tasks in a few-shot setting, without any parameter optimization.
In-context learning occurs suddenly as model size increases. While including a few exemplars in the prompt does not produce a significant improvement in small-scale models, it leads to dramatically improved behavior in large models (e.g., GPT-3, PaLM). This sudden capability gain at scale is what defines it as an emergent behavior — it is not predictable from smaller-scale experiments.
The mechanism behind ICL remains an active area of research. Typically, large LMs are trained with a self-supervised language modeling objective, and the process by which they learn a previously unseen task (e.g., text classification) without any parameter optimization is not fully understood.
A leading hypothesis, proposed by Dai et al. (2022), is that in-context learning works as a process of meta-optimization over the few examples included in the prompt. Their experimental results support the idea that, during in-context learning, the model produces meta-gradients according to the demonstration exemplars through the forward computation pass. These meta-gradients are then applied to the original LM through the attention mechanism. Under this view, in-context learning can be understood as a kind of implicit fine-tuning on the few-shot exemplars — but one that occurs entirely within the forward pass, with no actual parameter updates.
On many NLP benchmarks, in-context learning proved competitive with models trained on much more labeled data, and achieved state-of-the-art (at the time) results on LAMBADA (commonsense sentence completion) and TriviaQA. It enabled a variety of applications including writing code from natural language descriptions, designing app mockups, and generalizing spreadsheet functions.
Multiple studies on the reasoning capabilities of large LMs highlighted that eliciting the model to produce a step-by-step solution of a problem can lead to a more accurate final answer (Nye et al., 2021). Wei et al. (2022b) combined this insight with the idea of including demonstrations in the prompt, introducing the concept of Chain of Thought (CoT) prompting.
Given a question, a chain of thought is a coherent sequence of intermediate reasoning steps that leads to a final answer. The key intuition is that by allowing the model to generate a step-by-step solution, we let it apply computation proportional to the problem difficulty — more tokens are generated for problems requiring more steps.
Consider a math word problem. With standard prompting, the demonstration shows only the final answer (e.g., "Q: Roger has 5 tennis balls… A: The answer is 11."). With CoT prompting, the demonstration includes intermediate reasoning steps (e.g., "A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."). The model then mimics this step-by-step reasoning for new questions.
Remarkably, CoT-style reasoning can also be elicited in a zero-shot setting without any demonstrations (Kojima et al., 2022). This was achieved simply by adding the phrase "Let's think step-by-step" to the prompt before the model's answer. Experiments showed that this simple addition dramatically improved accuracy on reasoning tasks compared to standard zero-shot prompting.
The ability to generate useful chains of reasoning is an emergent property of model size: larger models are significantly better at producing correct step-by-step reasoning chains than smaller ones. This property has made CoT prompting the de facto method for querying modern LLMs on complex reasoning tasks. CoT prompting has achieved state-of-the-art accuracy (at the time) on the GSM8K benchmark of math word problems, using just eight chain-of-thought exemplars with a 540B-parameter model — surpassing even fine-tuned GPT-3 with a verifier.
While CoT prompting shows impressive results, its effectiveness decreases when a given task is more challenging than the examples provided in the prompts. To address this limitation, problem decomposition was proposed both as a training strategy (Shridhar et al., 2022) and as a prompting strategy (Zhou et al., 2023).
In Least to Most (LtM) prompting, a complex problem is broken down into smaller, more manageable sub-problems that are tackled one at a time. The solution to each sub-problem is used to help solve the next one, enabling a sequential problem-solving process. Each sub-question is solved in a separate prompt call, with the previously computed sub-answers available as context.
For example, given a complex math word problem, the LtM approach first generates a series of sub-questions (e.g., "Q1: How many trees are there in the beginning? Q2: How many trees will there be after the grove workers plant trees?"). Each sub-question is answered in sequence, and the final answer is synthesized from the accumulated sub-answers.
Approaches like Chain of Thought or Least to Most prompting often struggle with tasks that involve heavy numerical computation, because the language model must not only generate mathematical expressions but also execute the computations at each step. LLMs face several limitations here: they frequently make arithmetic errors (especially with large numbers), struggle with complex mathematical expressions (polynomial equations, differential equations), and are inefficient at handling iterative processes.
Leveraging the observation that LLMs are often better at producing code than doing math, Program of Thought (PoT) prompting (Chen et al., 2023) outsources computational steps to an external language interpreter. The LM formulates reasoning steps as Python programs, and a Python interpreter then executes the computation — enabling more accurate and complex mathematical problem-solving.
For example, instead of producing "We start with 15 trees. Later we have 21 trees. So they planted 21 − 15 = 6 trees.", the model produces:
A variant called Faithful Chain of Thought (FCoT) combines both natural language reasoning and code execution, annotating each step with its dependencies and supporting evidence before generating a Python program for computation.
It is difficult to declare a single best prompting strategy, as each is designed for specific types of tasks. Lyu et al. (2024) conducted a comprehensive study comparing all of the above strategies across nine reasoning datasets using various open-source and closed models. The key finding is that prompting strategies involving intermediate reasoning steps consistently outperform direct-answer (standard) prompting.
| Model | Standard | CoT | LtM | PoT | FCoT |
|---|---|---|---|---|---|
| Codex | 57.1 | 81.3 | 74.3 | 80.0 | 83.4 |
| GPT-3.5-turbo | 64.9 | 77.6 | 77.6 | 72.5 | 76.8 |
| GPT-4 | 79.3 | 88.3 | 87.3 | 84.4 | 90.9 |
| LLaMA-7B | 40.1 | 56.4 | 46.0 | 47.4 | 50.2 |
| LLaMA-13B | 43.2 | 66.8 | 58.2 | 56.9 | 62.2 |
| LLaMA-70B | 58.0 | 82.0 | 73.3 | 73.6 | 77.0 |
| Mistral-7B | 49.9 | 73.5 | 61.9 | 66.5 | 71.2 |
| Mistral-7B-instruct | 43.7 | 63.6 | 56.0 | 60.0 | 67.1 |
Accuracy (%) averaged across all reasoning datasets. Bold = best for each model. Data from Lyu et al. (2024).
Chain of Thought and Faithful Chain of Thought tend to perform best overall, closely followed by Least to Most and Program of Thought. All four of these strategies involve intermediate step explanations. Standard prompting — which directly predicts the answer — consistently underperforms. The conclusion is clear: prompting strategies that involve intermediate reasoning steps are superior to direct prediction.
Self-consistency (Wang et al., 2023) is a decoding strategy that can be layered on top of Chain of Thought prompting to further improve accuracy. The key argument is that for many tasks — especially reasoning tasks — there are usually several different lines of thought that converge on the same correct solution. In this case, the most frequently generated answer is more likely to be correct.
Instead of using a single greedy decode, self-consistency samples n candidate reasoning paths from the model. Each path produces a final numerical answer. The final prediction is obtained by marginalizing over the reasoning paths and taking a majority vote over the answers:
For example, if 3 reasoning paths are sampled and two produce the answer $18 while one produces $26, the self-consistency method selects $18 as the final answer. This is a significant improvement over standard greedy decoding, which might commit to a single incorrect reasoning chain.
The pretraining–finetuning paradigm led to the hypothesis that language models are unsupervised multitask learners (Radford et al., 2019) — during pretraining, models implicitly learn a latent structure of language useful for many downstream tasks. Building on this, recent work proposed an explicit multi-task training paradigm that leverages natural language instructions.
Instruction tuning consists of finetuning language models on a collection of NLP datasets where each task is described via natural language instructions. The instructions are incorporated by simply prepending to the model's input a short description of the task (e.g., "Is the sentiment of this movie review positive or negative?" or "Translate the following sentence into Chinese"). This approach combines aspects of both the pretrain–finetune and prompting paradigms.
A prominent example of instruction-tuned models is the FLAN family (Wei et al., 2022a). To evaluate FLAN's ability to perform a specific task T (e.g., natural language inference), the model is instruction-tuned on a range of other NLP tasks — such as commonsense reasoning, translation, and sentiment analysis — while ensuring that task T is explicitly excluded from instruction tuning. The model is then evaluated on task T in a zero-shot setting, testing its ability to generalize to unseen tasks.
The approach was subsequently scaled to the 540B-parameter PaLM model (Chowdhery et al., 2022) and a much larger number of tasks (1,836) by Chung et al. (2022). This scaling resulted in models with better generalization capabilities, improved reasoning abilities, and better behavior in open-ended zero-shot generation compared to non-instruction-tuned models. Ablation studies revealed that three factors are key to the success of instruction tuning: the number of finetuning datasets, model scale, and the use of natural language instructions.
A different approach to incorporating human feedback is Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017). The central intuition is to fit a reward function to human preferences while simultaneously training a policy to optimize the predicted reward. In this context, the "agent" is a pre-trained language model that needs to be aligned with the user's intentions — incorporating both explicit intentions (following instructions) and implicit intentions (staying truthful, unbiased, non-toxic).
Define a prompt distribution on which we want the model to produce aligned outputs.
Collect human demonstrations of desired behavior and use them to train a supervised fine-tuning (SFT) baseline model.
Train a reward model on a dataset D of human comparisons between model outputs. Labelers indicate which output they prefer for a given input. The reward model loss is:
Fine-tune the SFT model using PPO (Proximal Policy Optimization) to maximize the reward model's output, with the following objective:
The KL penalty from the SFT model is critical: it prevents the RL policy from drifting too far from the supervised baseline, which would risk "reward hacking" — exploiting the reward model to get high scores without actually producing better outputs. The pretraining loss term helps preserve the model's general language capabilities.
An important example of RLHF-trained models is the InstructGPT series (Ouyang et al., 2022). These models used GPT-3 (175B parameters) as the SFT baseline and a smaller 6B model for the reward. Human evaluators rated InstructGPT outputs favorably compared to the non-instruction-trained baseline across multiple axes: compliance with prompt constraints, reduced hallucinations, and appropriate language use. Strikingly, the 1.3B PPO-ptx model was preferred over the 175B GPT-3 — demonstrating that alignment training can make a much smaller model more useful than a much larger unaligned one. The same procedure was used to train ChatGPT, with slight differences in data collection.
While RLHF is effective, it has significant drawbacks: RL training is tricky, unstable, and highly sensitive to hyperparameters, and the reward model is typically as large as the LM itself, making it computationally expensive. Direct Preference Optimization (DPO) (Rafailov et al., 2024) offers a simpler alternative that does not require a reward model or any RL training loop.
DPO models the preference data using the Bradley–Terry model:
DPO exploits a key mathematical observation: it is possible to extract the optimal policy in closed form. The optimal policy π* satisfying the RLHF objective can be written as:
By rearranging to express the reward as a function of the policy, and noting that Z(x) cancels out in the Bradley–Terry model (which depends on differences in rewards), we obtain the final DPO loss: