Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
Authors: Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham NeubigAuthors Info & Claims
https://dl.acm.org/doi/10.1145/3560815
- Abstract
- 1 TWO SEA CHANGES IN NATURAL LANGUAGE PROCESSING
- 2 A FORMAL DESCRIPTION OF PROMPTING.
- 3 PROMPT TEMPLATE ENGINEERING
- 4 PROMPT ANSWER ENGINEERING
- 5 MULTI-PROMPT LEARNING
- 6 TRAINING STRATEGIES FOR PROMPTING METHODS
- 7 APPLICATIONS
- 8 PROMPT-RELEVANT TOPICS
- 9 CHALLENGES
- 10 CONCLUSION
Prompt-Based Learning in Natural Language Processing: A Systematic Survey
Overview:
- New paradigm in NLP: prompt-based learning
- Instead of traditional supervised learning (P(y|x)), models model probability of text directly
- Use language models for prediction tasks with pre-trained models and prompts
Framework:
- Pre-train language model on massive raw text data
- Define a new prompting function to perform few-shot or zero-shot learning
- Modify input x into textual string prompt x/prime, fill unfilled slots, obtain final string ˆx
- Derive output y from ˆx
Advantages:
- Allows pre-training on vast amounts of raw text
- Adapts to new scenarios with few or no labeled data
Article Contents:
- Introduction and background
- Mathematical notations for prompt-based learning
- Existing work organized by dimensions: pre-trained language models, prompts, tuning strategies
- Making the field more accessible through resources like NLPedia–Pretrain
- Conclusion
Keywords: Computing methodologies, Natural language processing; Additional Key Words and Phrases: Pre-trained language models, prompting.
Two Sea Changes in Natural Language Processing
Fully Supervised Learning:
- Centrally played a role in machine learning tasks
- Long been central to natural language processing (NLP)
- Early NLP models relied heavily on feature engineering
Shift Toward Neural Network Models:
- Salient features learned jointly with training the model itself
- Focus shifted to architecture engineering, where inductive bias provided through network architecture design
Pre-Training Paradigm:
- Shifted from fully supervised learning
- Pre-train a language model (LM) on large datasets of text data
- Fine-tune the LM's parameters for specific downstream tasks using task-specific objective functions
- Focus turned to objective engineering
Pre-Training, Prompting, and Predicting:
- Current approach as of 2021: "pre-train, prompt, and predict"
- Reformulate downstream tasks to look like those solved during LM training with a textual prompt
- Pre-trained LM can be used to predict the desired output without additional task-specific training
- Advantage: A single unsupervised-trained LM can be used to solve many tasks
- Disadvantage: Requires prompt engineering to find appropriate prompts for the model
Organization of this Survey:
- Overview and formal definition of prompting methods (Section 2)
- Prompt template engineering (Section 3)
- Prompt answer engineering (Section 4)
- Multi-prompt learning methods (Section 5)
- Prompt-aware training methods (Section 6)
- Applications of prompt-based learning and their interaction with choice of prompting method (Section 7)
- Locating the current state of research within the research ecosystem (Section 8)
- Challenging problems for further research (Section 9)
- Systematic resources on prompt learning and pre-training, including:
- Frequent updates to this survey, related slides, and soon
- Figure 1: A typology of important concepts for prompt-based learning
- Tables 7 and 8: Comprehensive comparison among different prompting methods
- Table 5: Systematic organization of commonly used prompts
- Table 4: Timeline of prompt-based research works
- Table 1: Comprehensive comparison among different pre-trained LMs
Overview:
- In traditional supervised learning systems for Natural Language Processing (NLP)
- Input
x
: usually text - Predict output
y
based on modelP(y|x;θ)
y
can be a label, text, or other variety
Examples:
- Text Classification:
- Takes an input
x
: text - Predicts a label
y
from a fixed label setY
- Example: Sentiment Analysis
- Input
x
: "I love this movie" - Output
y
: {++, +, ––, --}
- Input
- Takes an input
- Conditional Text Generation:
- Takes an input
x
: text - Generates another text
y
as output - Example: Machine Translation
- Input
x
: "Hyvää huomenta" (Finnish for "Good morning") - Output
y
: "Goodmorning" (English)
- Input
- Takes an input
Prompting Basics
- Supervised learning requires annotated data for training a model P(y|x;θ)
- Prompt-based learning methods attempt to circumvent this issue by learning an LM that models the probability P(x;θ) of text itself
- This allows predicting y using probabilities instead of requiring large labeled datasets
- Prompting refers to a method where the model predicts the highest-scoring answer z given input x
Basic Prompting Steps:
- Prompt Addition: Apply a template function f(prompt) that modifies the input text x into a prompt x' by inserting a slot [Z] for an intermediate generated answer text z, which will later be mapped to y.
- Answer Search: Search for the highest-scoring text z within permissible values Z using the language model (LM) P(·;θ).
- Answer Mapping: If necessary, map the highest-scoring answer z to the corresponding output value y.
Template Structure:
- Consists of two steps: applying a template and filling slot [X] with input text x
- Templates may include virtual words or even continuous vectors
- Number of [X] and [Z] slots can be flexible based on task needs
Examples:
- Sentiment Analysis: "Overall, it was a [Z] movie"
- Machine Translation: "Finnish: [X] English: [Z]"
- Classification Tasks: "The text is about [Z]."
- Generative Tasks: "Textual Similarity: [X1] TL;DR: [Z],[X2] Yes"
- Regression Tasks: "Textual Similarity: [X1] [Z],[X2] No"
Design Considerations for Prompting
Pre-trained Language Model (LM) Choice:
- Variety of pre-trained LMs to choose from
- Discussed in the Appendix, focusing on dimensions important for interpreting their utility in prompting methods
Prompt Template Engineering:
- Proper choice of prompt template affects both accuracy and task performance
- Discussed in Section 3: Choosing a Prompt Template
Prompt Answer Engineering:
- Designing Z differently, along with the mapping function, may be necessary depending on the task
- Discussed in Section 4: Prompt Answer Engineering
Expanding the Paradigm:
- Equations above represent a simple underlying framework for prompting
- Discussed ways to expand this paradigm to improve results or applicability in Section 5
Prompt-based Training Strategies:
- Different strategies to train: prompt, language model (LM), or both
- Summarized in Section 6, with relative advantages
Prompt Template Engineering
Process: creating a prompting function f(prompt)
that results in the most effective performance on the downstream task.
Approaches to Prompt Creation:
- Manual approach: humans or algorithms search for the best template
- Automated approach: using automated methods, such as:
- Cloze prompts: fill in the blanks of a textual string (e.g., "I love this movie, it is a [Z]movie")
- Prefix prompts: continue a string prefix (e.g., "I love this movie. What's the sentiment of the review? [Z]")
Factors Affecting Prompt Shape Selection:
- Task type: determines which prompt shape is more conducive
- Generation tasks: prefix prompts are better for standard auto-regressive LMs, cloze prompts for masked LMs
- Text pair classification tasks: require space for two inputs [X1] and [X2]
- Model type: full text reconstruction models can be used with either cloze or prefix prompts
Manually Crafted Prompts:
- Example from the LAMA dataset [100]: manually created clozetemplates to probe knowledge in LMs
- Examples by Brow et al. [9]: manually crafted prefix prompts for question answering, translation, and common sense reasoning tasks
- Examples by Schick and Schütze [118, 117, 120]: predefined templates used in a few-shot learning setting for text classification and conditional text generation tasks
Automated Template Learning
Issues with Manual Template Design:
- Time-consuming and requires experience, especially for complex tasks like semantic parsing
- Experienced designers may still fail to find optimal prompts manually
Approaches to Automate Prompt Design:
- Discrete prompts: Described in a discrete space, often corresponding to natural language phrases
- Prompt Mining: Scrapes text corpus for strings containing input and output, finds middle words or dependency paths between them
- Prompt Paraphrasing: Takes an existing seed prompt, paraphrases it into a set of candidate prompts, selects the one with highest training accuracy
- Gradient-based Search: Steps through tokens in the prompt to find short sequences that trigger desired prediction from pre-trained LM
- Prompt Generation: Treats prompt generation as a text generation task using standard NLG models
- Prompt Scoring: Scores filled templates using unidirectional LM, selects highest scoring template
- Continuous prompts: Perform prompting directly in the embedding space of the model, removing constraints on template words and parameters
- Prefix Tuning: Prepends sequence of task-specific vectors to input, optimizes log-likelihood objective
- Tuning Initialized with Discrete Prompts: Initializes search for continuous prompt using a previously discovered discrete prompt
- Hard-Soft Prompt Hybrid: Inserts trainable embeddings into hard prompt template, fine-tunes them to increase task accuracy
Two Dimensions of Answer Design:
- Answer Shape: Characterizes granularity of answers
- Tokens: One or a subset of tokens from pre-trained language model's vocabulary
- Span: Short multi-token span, often used with cloze prompts
- Sentence/Document: Longer phrasal or sentential answers, commonly used in language generation tasks
- Answer Space Design Methods:
- Manual Design:
- Unconstrained Spaces: Directly map answer to final output using identity mapping
- Examples: Text classification, entity recognition, multiple-choice question answering
- Constrained Spaces: Craft answer space and its mapping to underlying class manually by system or benchmark designer
- Unconstrained Spaces: Directly map answer to final output using identity mapping
- Discrete Answer Search:
- Paraphrasing: Expand initial answer space using paraphrasing methods and select final answers based on probability of output
- Example: Use back-translation method for paraphrasing (Jiangetal., 2017)
- Prune-then-Search: Generate an initial pruned answer space and search over it to select final set of answers
- Example: Select top-k tokens achieving highest probability score using learned classifier (Shi et al., 2020)
- Label Decomposition: Decompose relation labels into constituent words for relation extraction tasks
- Paraphrasing: Expand initial answer space using paraphrasing methods and select final answers based on probability of output
- Continuous Answer Search:
- Optimize token embeddings for each class label to improve performance.
- Example: Assign virtual tokens for each class label and optimize their embeddings along with prompt token embeddings (Hambardzumyan et al., 2019).
- Optimize token embeddings for each class label to improve performance.
- Manual Design:
Multi-Prompt Learning
Overview:
- Use of multiple prompts to improve efficacy of learning methods
- Motivations: complementary advantages of different prompts, cost alleviation, performance stabilization
Prompt Ensembling:
- Use of multiple unanswered prompts for input at inference time to make predictions
- Discrete or continuous prompts
- Improve performance on downstream tasks by leveraging complementary advantages of different prompts
- Current research explores effective ways for prompt ensembling: uniform averaging, weighted averaging, majority voting, knowledge distillation, and text generation methods.
PromptAugmentation:
- Providing additional answered prompts to demonstrate how language model should answer actual prompt
- Few-shot demonstrations take advantage of strong language models' ability to learn repetitive patterns
- Challenges: sample selection and ordering
- Researchers use sentence embeddings for effective example selection (Gao et al., Liu et al.)
- Lu et al. propose entropy-based methods for scoring different candidate permutations; Kumar and Talukdar search for good promotion orderings
- Closely related to retrieval-based methods that provide more text context to improve performance (Mishra et al., Yoo et al.).
Prompt Composition:
- Breaking down complex tasks into subtasks and defining a composite prompt based on those sub-prompts
- Identifying entities and classifying relationships in relation extraction (Han et al.)
Prompt Decomposition:
- For tasks where multiple predictions need to be performed for one sample, breaking down the holistic prompt into different sub-prompts and answering each sub-prompt separately.
- Converting input into set of text spans, then prompting model to predict entity type for each span (Cui et al.).
Training Strategies for Prompting Methods
Different Tuning Strategies:
- Strategy: LM Params (Pretrained), Prompt Params, Additional Tuned Prompt Params, Tuned or Fixed
- Zero-Shot Learning: No explicit training of downstream task model
- Traditional pre-training and fine-tuning without prompts as promptless fine-tuning
- Parameters of the underlying LM are updated via gradients from downstream training samples
- Example: BERT, RoBERTa
- Advantages: Simple, powerful, widely used
- Fits to larger training datasets
- Disadvantages: Overfitting or unstable learning on small datasets, prone to catastrophic forgetting
- Advantages: Simple, powerful, widely used
- Tuning-free prompting
- Directly generates answers without changing LM parameters based on a prompt
- Optionally augment input with answered prompts as in-context learning
- Example: LAMA, GPT-3
- Advantages: Efficiency, no parameter update process, applicable in zero-shot settings
- No catastrophic forgetting since LM parameters remain fixed
- Disadvantages: Heavy engineering required for high accuracy, inefficiency when providing many answered prompts
- Advantages: Efficiency, no parameter update process, applicable in zero-shot settings
- Fixed-LM prompt tuning
- Updates only the prompts’ parameters using supervision signals obtained from downstream training samples
- Keeps pre-trained LM unchanged
- Example: Prefix-Tuning, Prompt-Tuning
- Advantages: Retains knowledge in LMs and suitable for few-shot scenarios, superior accuracy than tuning-free prompting
- Not applicable in zero-shot settings, limited representation power in large datasets
- Disadvantages: Prompts are not human interpretable or manipulable
- Advantages: Retains knowledge in LMs and suitable for few-shot scenarios, superior accuracy than tuning-free prompting
- Fixed-prompt LM tuning
- Tunes the parameters of the LM while using prompts with fixed parameters to specify model behavior
- Example: PET-TC, PET-Gen, LM-BFF
- Advantages: More efficient learning, particularly in few-shot scenarios
- Template or answer engineering more completely specify the task
- Disadvantages: Prompt engineering still required, LMs fine-tuned on one downstream task may not be effective on another
- Advantages: More efficient learning, particularly in few-shot scenarios
- Prompt+LM Tuning
- Fine-tunes both the parameters of the pre-trained LM and prompt-relevant parameters together
- Example: PADA, P-Tuning
- Advantages: Most expressive method suitable for high-dataset settings
- Provides additional bootstrapping at the start of model training
- Disadvantages: Requires training and storing all parameters of the models, may overfit to small datasets.
- Advantages: Most expressive method suitable for high-dataset settings
- Traditional pre-training and fine-tuning without prompts as promptless fine-tuning
Table 7: List of Applications and their Categories
- Knowledge Probing: Fact Retrieval, Linguistic Probing
- Structure Prediction: Semantic Parsing
- Classification-based Tasks: Text Classification, Natural Language Inference, Text Pair Classification
- Information Extraction: Relation Extraction, Named Entity Recognition
- Factual Probing (a.k.a. Fact Retrieval):
- Quantifying factual knowledge in pre-trained LMs
- Transform original input into cloze prompt
- Manually crafted or automatically discovered
- Focuses on finding effective templates and analyzing results of different models using those templates
- Techniques: Discrete template search, Continuous template learning, Prompt ensemble learning.
- Linguistic Probing:
- Handling linguistic phenomena such as analogies, negations, semantic role sensitivity, semantic similarity, cant understanding, rare word understanding
- Presenting tasks in natural language sentences for LM completion
- Previous work: [9], [25], [131], [116]
- Semantic Parsing:
- Generating structured meaning representation given a natural language input
- Framing as paraphrasing task or constraining decoding process
- Results demonstrate effectiveness using in-context learning and paraphrasing reformulation [124]
- Text Classification:
- Most work uses cloze prompts for template engineering and answer engineering
- Previous work explores fixed-prompt LM tuning strategies in few-shot setting [32,40,67,115,117]
- Text Pair Classification:
- Predict relationship of two given sentences (paraphrase identification, natural language inference, textual similarity prediction)
- Cloze prompts commonly used for template search with manually pre-selected answer space Z [117,120]
- Relation Extraction:
- Predict relation between two entities in a sentence
- Challenges: Larger label space and importance of tokens for entity mentions
- Proposed methods: Adaptive answer selection, task-oriented prompt template construction [13]
- Named Entity Recognition:
- Identifying named entities (person name, location) in a given sentence
- Difficulty of prompt-based learning application to tagging tasks like NER
- Recent work: Template-based NER model using BART [17]
Reasoning in NLP:
- Debate on deep neural networks' reasoning abilities vs. memorizing patterns [3, 89]
- Commonsense reasoning benchmarks [47, 72, 101, 107]
- Winograd Schemas: identifying ambiguous pronouns' antecedents in context
- Example: "The trophy doesn’t fit into the brown suitcase, because it is too large."
- Scoring generation probability of each candidate choice [134]
- Mathematical reasoning with pre-trained LMs [9, 88, 139]
Question Answering (QA):
- Various formats: extractive QA, multiple-choice QA, free-form QA
- Handled differently using different model frameworks
- Pre-trained LMs used as text generation problem [55]
- Fine-tuning seq2seq pre-trained LMs (e.g., T5) and appropriate prompts from context and questions
- Probabilities from pre-trained LMs on QA tasks not very predictive of correctness [51]
Text Generation:
- Conditioned on some other piece of information
- Prompting methods applied using prefix prompts with autoregressive pre-trained LMs [105]
- Text summarization and machine translation demonstrations
- Brown et al. performing in-context learning for text generation [9]
- Fixed-prompt LM tuning for few-shot tasks by Schick and Schütze [118]
- Prompt+LM tuning strategy by Dou et al. [23]
Automatic Evaluation of Text Generation:
- Yuan et al. using prompt learning for automated evaluation of generated texts [147]
- Conceptualizing evaluation as a text generation problem
- Adding "such as" phrase to translated text leads to improvement in correlation on German–English machine translation evaluation.
Meta-Applications:
- Domain adaptation using self-generated DRFs and sequence tagging [5]
- Self-diagnosis and debiasing based on biased or debiased instructions [121]
- Generating text with specific templates for diagnosis and suppressing undesired attributes
- Dataset construction using pre-trained LMs to generate similar sentences [119]
- Multi-modal learning: text and vision [135]
- Fixed-LM prompt tuning strategy with prompt augmentation techniques.
Prompt-Based Learning vs Other Methods
Ensemble Learning:
- Technique to improve performance by combining results from multiple systems
- In prompt ensembling, different predictions come from varying prompt variants
Few-shot Learning:
- Aims to learn a machine learning system with few training samples
- Prompt augmentation can be regarded as a form of priming-based few-shot learning
Larger-Context Learning:
- Introduces additional context to aid the learning process
- Prompt augmentation adds relevant labeled samples into the input, but not necessarily labeled data
Query Reformulation:
- Commonly used in information retrieval and question answering tasks
- Both query reformulation and prompt learning aim to make better use of existing knowledge bases
QA-based Task Reformulation:
- Aims to conceptualize various NLP tasks as a question-answering problem
- Methods similar to prompting, but focus on unifying tasks into a QA framework
Controlled Generation:
- Incorporates additional guidance beyond the input text into the generation model
- Some commonalities with prompt learning, such as adding extra information for better generation
Supervised Attention:
- Aims to provide supervision over a model's attention to extract important information
- Prompt learning and supervised attention share ideas in extracting salient information
Data Augmentation:
- Modifies existing data to increase the amount available for training
- Adding prompts can achieve similar accuracy improvements as adding 100 data points on average across classification tasks
- Difficult problem to choose pre-trained LMs for prompt-based learning
- Few systematic comparisons of benefits brought by different pre-trained LMs for this purpose
- Most existing works focus on text classification and generation tasks
- Challenges in applying prompting methods to more complex tasks like information extraction and text analysis
- Expressing structured information (e.g., tree, graph, table) is a major challenge
- Many-class classification tasks: Difficult combinatorial optimization problem to select appropriate answer spaces
- Long-answer classification tasks: Unknown how best to decode multi-token answers using LMs
- Text generation tasks rely solely on single, lean answers; multiple references largely unexplored
- Variety of methods for tuning parameters in prompts, LMs, or both
- Systematic understanding of tradeoffs between these different strategies lacking
- Prompt ensembling: Space and time complexity increase as more prompts are considered
- How to distill knowledge from multiple prompts remains under-explored
- Ensemble learning in text generation tasks not performed so far, due to complexity of ensemble models
- Lack of theoretical analysis and guarantees for prompt-based learning
- Soft-prompt tuning relaxes non-degeneracy assumptions, making it easier to extract task-specific information [141]
- Text classification tasks can be reformulated as sentence completion tasks, making language modeling a meaningful pre-training step [113]
- Prompts worth 100s of data points on average across classification tasks [114]
- Prompts selected under tuned few-shot learning scenario generalize well across similar models, while those under true few-shot learning do not [96]
- Transferability is poor when the model sizes are quite different in both scenarios
- Pre-training methods effective for fine-tuning may not be applicable to prompting-based learning as-is
- Re-thinking pre-training methods could improve accuracy and ease of applicability to prompting-based learning
- Pre-trained LMs' probability distributions not well calibrated for good probabilistic predictions [33]
- Jiang et al. observed pre-trained LMs' (e.g., BART, T5, GPT-2) probabilities on QA tasks are well calibrated but identified pitfalls leading to bias towards certain answers [51, 151]
- Need to be cautious when assuming a single gold answer for an input due to competing forms of the same object sharing probability mass. Consider paraphrasing methods or calibrating word probabilities based on context.
Prompt-Based Learning: A New Paradigm in NLP
Conclusion
- Article summarizes and analyzes statistical natural language processing (NLP) paradigms
- Argues prompt-based learning is a promising new approach to NLP
- Aims to:
- Help researchers understand the prompt-based learning paradigm better
- Identify core challenges for scientifically meaningful advances
- Highlight commonalities and differences between various NLP paradigms
- Potentially inspire work toward next paradigm shift in NLP research.