by Lihu Chen, Gaël Varoquaux
Imperial College London, UK
https://arxiv.org/pdf/2409.06857
- Data Curation:
- Pre-training data curation: Selecting high-quality subsets from large datasets
- Instruction-tuning data curation: Selecting influential data for efficient instruction tuning
- Weak-to-Strong Paradigm:
- Acting as supervisors for larger models
- Enhancing alignment of LLMs with human values
- Efficient Inference:
- Model ensembling: Combining multiple models of varying sizes for cost-effective inference
- Model cascading: Sequential use of multiple models with different complexity levels
- Model routing: Directing input data to the most appropriate models based on performance
- Evaluating LLMs:
- Automatically assessing LLM performance (e.g., BERTSCORE, BARTSCORE)
- Estimating uncertainty of LLM responses
- Predicting LLM performance to reduce computational costs during model selection
- Domain Adaptation:
- White-box adaptation: Fine-tuning small models to adjust token distributions of frozen LLMs
- Black-box adaptation: Guiding LLMs toward target domains by providing relevant knowledge
- Retrieval-Augmented Generation (RAG):
- Acting as lightweight retrievers to access external knowledge bases or document collections
- Prompt-based Learning:
- Optimizing retrievers for zero-shot tasks
- Breaking down complex problems into subproblems
- Generating pseudo labels or verifying/rewriting outputs of LLMs
- Deficiency Repair:
- Addressing repeated, untruthful, or toxic content generated by LLMs
- Contrastive decoding: Choosing tokens that maximize log-likelihood difference between larger and smaller models
- Addressing out-of-vocabulary words and detecting hallucinations
- Knowledge Distillation:
- Acting as student models to replicate behavior of larger teacher models
- Data Synthesis:
- Training on LLM-generated datasets for specific tasks
- Augmenting existing data using LLM-generated modifications
- Computation-constrained Environments:
- Providing faster training and deployment times
- Reducing hardware and energy consumption requirements
- Task-specific Applications:
- Excelling in domains with limited training data
- Outperforming LLMs in specialized areas (e.g., biomedical, legal, tabular learning)
- Interpretability-required Environments:
- Offering better transparency and understanding of model workings
- Meeting human understanding requirements in industries like healthcare, finance, and law
- Large Language Models (LLMs) have revolutionized NLP through pre-training and fine-tuning paradigms
- LLMs have demonstrated exceptional performance across a range of tasks, including language generation, understanding, and domain-specific applications
- Theories suggest certain reasoning capabilities enhance with model size
- Shift towards smaller language models (SLMs) due to resource constraints for academic researchers and businesses
Comparison between LLMs and SMs:
- Accuracy: LLMs have superior performance, while SMs can achieve comparable results through techniques like knowledge distillation
- Generality: LLMs are highly generalizable, while SMs are more specialized and can outperform for specific tasks with domain-specific datasets
- Efficiency: LLMs require substantial resources, while SMs offer competitive performance while reducing resource demands
- Interpretability: SMs are more transparent and interpretable than larger models
Role of Small Models:
- Collaboration: SMs can strike a balance between power and efficiency, enabling systems that are cost-effective and scalable
- Competition: SMs have advantages like simplicity, lower cost, and greater interpretability; assessing trade-offs depends on task/application requirements
2.1 Small Models Enhance LLMs
SMs and LLMs Collaboration Framework:
- Small Models (SMs) Enhance LLMs:
- Data Curation:
- Pre-training data curation
- Use SMs to select high-quality subsets from large datasets Benefits: enhance model performance by focusing on the quality of data instead of quantity Techniques: simple classifiers trained for content assessment, perplexity scores based on proxy language models, data reweighting using domain weights
- Instruction-tuning data curation
- Use SMs to select influential data for efficient instruction tuning Approaches: Model-oriented data selection (MoDS), LESS framework
- Pre-training data curation
- Data Curation:
Pre-training Data Curation:
- Less is more paradigm: prioritize quality over quantity
- Scale and complexity of raw text data make rule-based methods inadequate
- Importance of selecting high-quality subsets for efficient data curation
- Techniques using small models to assess content quality, remove noise, toxicity, and private information
- Data reweighting: adjust sampling probabilities based on domain weights trained by a proxy model
Instruction-tuning Data Curation:
- Recent findings suggest that strong alignment can be achieved with fewer carefully curated instruction examples
- Importance of selecting high-quality data for efficient instruction tuning
- Approaches using small models to evaluate instruction data based on quality, coverage, and necessity.
Future Directions:1. Develop more nuanced criteria for evaluating data quality: factuality, safety, diversity 2. Explore the potential of small models in curating synthetic data to supplement limited human-generated data.
Weak-to-Strong Paradigm
Background:
- LLMs aligned with human values through RLHF
- Becoming superhuman models: complex tasks, challenging evaluation
- Introducing weak-to-strong generalization paradigm
- Using smaller models as supervisors for larger ones
Advantages:
- Enabling strong models to generalize beyond limitations of weaker ones
- Several variants proposed: diverse set of weak teachers, reliability estimation, collaboration during inference phase
Comparison with Data Labeling:
- Weak models can collaborate with large models during inference phase for alignment enhancement
Examples:
- Aligner: learn correctional residuals between preferred and dispreferred responses
- Weak-to-Strong Search: maximize log-likelihood difference between small tuned and untuned models
**Future Directions:**1. Ensuring strong model's deep understanding of task, capability to correct weak model errors, and natural alignment with objectives.2. Developing a deeper understanding of underlying mechanisms governing success or failure of alignment methods: theoretical analysis (Lang et al., 2024), errors in weak supervision (Guo and Yang, 2024), scaling laws for extrapolating generalization errors (Kaplan et al., 2020).
Model Ensembling
- Larger models are more powerful but have significant costs, including slower inference speed and higher API prices
- Smaller models offer advantages in terms of lower cost and faster inference, especially for simple queries
- Ensemble methods can be used to achieve cost-effective inference by combining multiple models of varying sizes
- Model cascading: sequential use of multiple models, where each model has a different level of complexity
- Output of one model triggers activation of next model in sequence
- Allows for collaboration between models and transferring tasks to larger models
- Critical step is determining when to escalate query to more complex model
- Techniques train small evaluator to assess correctness, confidence, or quality of model output
- Some LLMs can perform self-verification and provide confidence levels
- Techniques train small evaluator to assess correctness, confidence, or quality of model output
- Model routing: dynamically directs input data to most appropriate models based on performance
- Straightforward approach: select best-performing model from all models
- Does not significantly reduce inference costs
- Efficient, reward-based routers trained to select optimal models without accessing their outputs
- Retrieval-based dynamic router assigns instances with similar semantic embeddings to same expert
- RouteLLM uses human preference data and data augmentation to train small router model
- FORC proposes meta-model to assign queries to most suitable model without requiring execution of large models
- Retrieval-based dynamic router assigns instances with similar semantic embeddings to same expert
- Straightforward approach: select best-performing model from all models
- Model cascading: sequential use of multiple models, where each model has a different level of complexity
Speculative Decoding
- Aims to speed up decoding process of generative model by using a smaller, faster auxiliary model alongside larger main model
- Auxiliary model generates multiple token candidates in parallel, which are then verified or refined by larger model
Evaluating Large Language Models (LLMs)
- Traditional evaluation methods like BLEU and ROUGE have limitations for capturing nuanced semantic meaning and compositional diversity of generated text
- Model-based approaches use smaller models to automatically assess performance
- Examples: BERTSCORE for machine translation, image captioning (Zhang et al., 2020b)
- BARTSCORE for various perspectives including informativeness, fluency, and factuality (Yuan et al., 2021)
- Some methods use small natural language infer- ence (NLI) models to estimate uncertainty of LLM responses
- Proxy models can be employed to predict LLM performance, reducing computational costs during model selection.
Future Directions
- As large models generate longer and more complex texts, it becomes essential to develop efficient evaluators for assessing various aspects:
- Factuality (Min et al., 2023b)
- Safety (Zhang et al., 2024b)
- Uncertainty (Huang et al., 2023).
Domain Adaptation for Large Language Models (LLMs)
Background:
- LLMs require further customization for optimal performance in specific use cases and domains
- Fine-tuning on specialized data is resource-intensive and not always feasible
- Recent research explores adapting LLMs using smaller models
**Two Approaches:**1. White-Box Adaptation:
- Involves fine-tuning a small model to adjust token distributions of frozen LLMs for specific domains
- Examples: CombLM (Ormazabal et al., 2023), IPA (Lu et al., 2023b), Proxy-tuning (Liu et al., 2024a)
- Only modify small domain-specific experts' parameters, allowing LLMs to be adapted to specific tasks
- Black-Box Adaptation:
- Involves using a small domain-specific model to guide LLMs toward target domains by providing textual relevant knowledge
- Examples: Retrieval Augmented Generation (RAG)
- Enhances base LLM's performance without requiring access to internal model parameters
Summary:
- Fine-tuning large models for specific domains is resource-intensive
- Adapting LLMs using smaller domain-specific models offers a cost-effective solution
- White-Box Adaptation: fine-tunes small models and guides base models during decoding
- Black-Box Adaptation: uses small expert models to provide textual relevant knowledge, enhancing the base model's understanding of domain-specific knowledge
**Future Directions:**1. Develop techniques for adapting LLMs using a broader range of diverse models 2. Investigate methods to adapt LLMs using a limited number of samples (Sun et al., 2024).
Retrieval-Augmented Generation (RAG)
Background:
- LLMs exhibit reasoning capabilities but limited memory
- Struggle with domain expertise and up-to-date information
- RAG enhances LLMs by using a lightweight retriever to access external knowledge bases, document collections, or other tools
Advantages:
- Mitigates factually inaccurate content (hallucinations)
- Categories of Retrieval Sources: Textual Documents, Structured Knowledge, Other Sources
Textual Document Sources:
- Most commonly used in RAG methods
- Encompasses resources such as Wikipedia, cross-lingual text, and domain-specific corpus
- Lightweight retrievers like sparse BM25 or dense BERT-based models are employed
Structured Knowledge Sources:
- Verified information from knowledge bases and databases
- Enhances answers by concatenating retrieved tables with queries (KnowledgeGPT, T-RAG)
- Retriever can be a lightweight entity linker, query executor, or API
Other Sources:
- Codes, tools, images enabling LLMs to leverage external information for enhanced reasoning (DocPrompting, Toolformer)
**Future Directions:**1. Develop robust approaches to integrate noisy retrieved texts2. Extend RAG to multimodal scenarios beyond text-only information (images, audios)
Prompt-Based Learning
Definition: Paradigm for few-shot or zero-shot learning where prompts are crafted to facilitate adaptation to new scenarios with minimal labeled data.
Advantages: Leverages In-Context Learning (ICL), operates without parameter updates, and can use small models to enhance performance.
Techniques: Uprise - optimizes a lightweight retriever for zero-shot tasks; DaSLaM - uses a small model to break down complex problems into subproblems; generating pseudo labels or verifying/rewriting outputs of LLMs using small models.
Summary: Efficient process handling complex tasks without the need for parameter updates, using prompts embedded in natural language templates and small models to augment performance.
Future Directions: Exploring ways to develop trustworthy, safe, and fair LLMs within the prompt-based learning paradigm by leveraging small models.
Deficiency Repair for LLMs
Powerful LLMs vs. Small Models:
- Powerful LLMs may generate repeated, untruthful, and toxic contents
- Small models can be used to repair these defects
Two Ways to Achieve Deficiency Repair:1. Contrastive Decoding: Chooses tokens that maximize the log-likelihood difference between a larger model (expert) and a smaller model (amateur)2. Small Model Plugins: Fine-tune a specialized small model to address the shortcomings of a larger model
- Address unseen words (Out-Of-Vocabulary) by training a small model to mimic the behavior of the large model
- Detect hallucinations or calibrate confidence scores
Summary and Future Directions:
- Existing work explores synergistic use of logits from both LLMs and SMs to reduce repeated text, mitigate hallucinations, augment reasoning capabilities, and safeguard user privacy
- Proxy Tuning: Fine-tunes a small model and contrasts the difference between the original LLMs and small models to adapt to the target task
- Future Directions:
- Extend the use of small models to fix flaws of large models in other problems, e.g., mathematical reasoning
- Expand the range of knowledge transferred from the teacher model, including feedback on the student's outputs and feature knowledge
- Address trustworthiness issues (helpfulness, honesty, harmlessness) when transferring skills from LLMs to small models
Knowledge Distillation:
- Scaling models to larger sizes is computationally expensive
- Knowledge Distillation offers an effective solution by training a smaller student model to replicate the behavior of a larger teacher model
- White-box distillation involves using internal states of the teacher model, providing transparency in the training process
- Black-box knowledge distillation typically involves generating a distillation dataset through the teacher LLM and using it for fine-tuning the student model
- Recent advancements include Chain-of-Thought distillation and Instruction Following Distillation to enhance reasoning abilities of smaller models
Data Synthesis:
- Human-created data is finite, and large models are not always necessary for specific tasks
- Using LLMs to generate training data or augment existing data can be efficient and feasible
- Training Data Generation: Generating a dataset from scratch using LLMs, followed by training a small task-specific model
- Data Augmentation: Modifying existing data points using LLMs to increase diversity and train smaller models
Preferability of Smaller Models: Competition vs. Collaboration
Background:
- LLMs' impressive capabilities come with substantial computational demands
- Scaling model size leads to exponential increase in training time and higher inference latency
- High computational overhead prevents application in computation-constrained environments
Computation-constrained Environment (§3.1)
- Smaller models are preferable due to lower resource requirements
- Faster training and deployment times
- Lower hardware needs
- Reduced energy consumption
- Examples: Phi-3.8B, MiniCPM, Gemma
- Techniques like knowledge distillation enable transfer of knowledge from LLMs
Task-specific Environment (§3.2)
- Not all tasks require large models
- Diminishing returns observed in certain tasks
- Tasks like information retrieval have critical inference speed requirements
- Domain-specific tasks: biomedical, legal, tabular learning, short text tasks, and other specialized areas
- Small models can outperform LLMs due to domain expertise
- Fewer training tokens available for some domains
Interpretability-required Environment (§3.3)
- Smaller, simpler models offer better interpretability compared to larger, more complex ones
- Transparency: understanding how the model works
- Industries favor small models due to human understanding requirements
- Healthcare, finance, law examples
- Collaboration between LLMs and SMs in balancing performance and efficiency
- Competition under specific conditions (computation-constrained environments, task-specific applications, interpretability)
- Careful evaluation of trade-offs essential when selecting models for tasks or applications.