Skip to content

Latest commit

 

History

History
309 lines (248 loc) · 18.2 KB

small-models_survey.md

File metadata and controls

309 lines (248 loc) · 18.2 KB

What is the Role of Small Models in the LLM Era: A Survey

by Lihu Chen, Gaël Varoquaux

Imperial College London, UK

https://arxiv.org/pdf/2409.06857

TLDR;

  1. Data Curation:
    • Pre-training data curation: Selecting high-quality subsets from large datasets
    • Instruction-tuning data curation: Selecting influential data for efficient instruction tuning
  2. Weak-to-Strong Paradigm:
    • Acting as supervisors for larger models
    • Enhancing alignment of LLMs with human values
  3. Efficient Inference:
    • Model ensembling: Combining multiple models of varying sizes for cost-effective inference
    • Model cascading: Sequential use of multiple models with different complexity levels
    • Model routing: Directing input data to the most appropriate models based on performance
  4. Evaluating LLMs:
    • Automatically assessing LLM performance (e.g., BERTSCORE, BARTSCORE)
    • Estimating uncertainty of LLM responses
    • Predicting LLM performance to reduce computational costs during model selection
  5. Domain Adaptation:
    • White-box adaptation: Fine-tuning small models to adjust token distributions of frozen LLMs
    • Black-box adaptation: Guiding LLMs toward target domains by providing relevant knowledge
  6. Retrieval-Augmented Generation (RAG):
    • Acting as lightweight retrievers to access external knowledge bases or document collections
  7. Prompt-based Learning:
    • Optimizing retrievers for zero-shot tasks
    • Breaking down complex problems into subproblems
    • Generating pseudo labels or verifying/rewriting outputs of LLMs
  8. Deficiency Repair:
    • Addressing repeated, untruthful, or toxic content generated by LLMs
    • Contrastive decoding: Choosing tokens that maximize log-likelihood difference between larger and smaller models
    • Addressing out-of-vocabulary words and detecting hallucinations
  9. Knowledge Distillation:
    • Acting as student models to replicate behavior of larger teacher models
  10. Data Synthesis:
    • Training on LLM-generated datasets for specific tasks
    • Augmenting existing data using LLM-generated modifications
  11. Computation-constrained Environments:
    • Providing faster training and deployment times
    • Reducing hardware and energy consumption requirements
  12. Task-specific Applications:
    • Excelling in domains with limited training data
    • Outperforming LLMs in specialized areas (e.g., biomedical, legal, tabular learning)
  13. Interpretability-required Environments:
    • Offering better transparency and understanding of model workings
    • Meeting human understanding requirements in industries like healthcare, finance, and law

Contents

Introduction

  • Large Language Models (LLMs) have revolutionized NLP through pre-training and fine-tuning paradigms
  • LLMs have demonstrated exceptional performance across a range of tasks, including language generation, understanding, and domain-specific applications
  • Theories suggest certain reasoning capabilities enhance with model size
  • Shift towards smaller language models (SLMs) due to resource constraints for academic researchers and businesses

Comparison between LLMs and SMs:

  • Accuracy: LLMs have superior performance, while SMs can achieve comparable results through techniques like knowledge distillation
  • Generality: LLMs are highly generalizable, while SMs are more specialized and can outperform for specific tasks with domain-specific datasets
  • Efficiency: LLMs require substantial resources, while SMs offer competitive performance while reducing resource demands
  • Interpretability: SMs are more transparent and interpretable than larger models

Role of Small Models:

  • Collaboration: SMs can strike a balance between power and efficiency, enabling systems that are cost-effective and scalable
  • Competition: SMs have advantages like simplicity, lower cost, and greater interpretability; assessing trade-offs depends on task/application requirements

2 Collaboration

2.1 Small Models Enhance LLMs

2.1.1 Data Curation

SMs and LLMs Collaboration Framework:

  • Small Models (SMs) Enhance LLMs:
    • Data Curation:
      • Pre-training data curation
        • Use SMs to select high-quality subsets from large datasets Benefits: enhance model performance by focusing on the quality of data instead of quantity Techniques: simple classifiers trained for content assessment, perplexity scores based on proxy language models, data reweighting using domain weights
      • Instruction-tuning data curation
        • Use SMs to select influential data for efficient instruction tuning Approaches: Model-oriented data selection (MoDS), LESS framework

Pre-training Data Curation:

  • Less is more paradigm: prioritize quality over quantity
    • Scale and complexity of raw text data make rule-based methods inadequate
  • Importance of selecting high-quality subsets for efficient data curation
    • Techniques using small models to assess content quality, remove noise, toxicity, and private information
  • Data reweighting: adjust sampling probabilities based on domain weights trained by a proxy model

Instruction-tuning Data Curation:

  • Recent findings suggest that strong alignment can be achieved with fewer carefully curated instruction examples
    • Importance of selecting high-quality data for efficient instruction tuning
  • Approaches using small models to evaluate instruction data based on quality, coverage, and necessity.

Future Directions:1. Develop more nuanced criteria for evaluating data quality: factuality, safety, diversity 2. Explore the potential of small models in curating synthetic data to supplement limited human-generated data.

2.1.2 Weak-to-Strong Paradigm

Weak-to-Strong Paradigm

Background:

  • LLMs aligned with human values through RLHF
  • Becoming superhuman models: complex tasks, challenging evaluation
  • Introducing weak-to-strong generalization paradigm
    • Using smaller models as supervisors for larger ones

Advantages:

  • Enabling strong models to generalize beyond limitations of weaker ones
  • Several variants proposed: diverse set of weak teachers, reliability estimation, collaboration during inference phase

Comparison with Data Labeling:

  • Weak models can collaborate with large models during inference phase for alignment enhancement

Examples:

  • Aligner: learn correctional residuals between preferred and dispreferred responses
  • Weak-to-Strong Search: maximize log-likelihood difference between small tuned and untuned models

**Future Directions:**1. Ensuring strong model's deep understanding of task, capability to correct weak model errors, and natural alignment with objectives.2. Developing a deeper understanding of underlying mechanisms governing success or failure of alignment methods: theoretical analysis (Lang et al., 2024), errors in weak supervision (Guo and Yang, 2024), scaling laws for extrapolating generalization errors (Kaplan et al., 2020).

2.1.3 Efficient Inference Model Ensembling

Model Ensembling

  • Larger models are more powerful but have significant costs, including slower inference speed and higher API prices
  • Smaller models offer advantages in terms of lower cost and faster inference, especially for simple queries
  • Ensemble methods can be used to achieve cost-effective inference by combining multiple models of varying sizes
    • Model cascading: sequential use of multiple models, where each model has a different level of complexity
      • Output of one model triggers activation of next model in sequence
      • Allows for collaboration between models and transferring tasks to larger models
      • Critical step is determining when to escalate query to more complex model
        • Techniques train small evaluator to assess correctness, confidence, or quality of model output
          • Some LLMs can perform self-verification and provide confidence levels
    • Model routing: dynamically directs input data to most appropriate models based on performance
      • Straightforward approach: select best-performing model from all models
        • Does not significantly reduce inference costs
      • Efficient, reward-based routers trained to select optimal models without accessing their outputs
        • Retrieval-based dynamic router assigns instances with similar semantic embeddings to same expert
          • RouteLLM uses human preference data and data augmentation to train small router model
        • FORC proposes meta-model to assign queries to most suitable model without requiring execution of large models

Speculative Decoding

  • Aims to speed up decoding process of generative model by using a smaller, faster auxiliary model alongside larger main model
  • Auxiliary model generates multiple token candidates in parallel, which are then verified or refined by larger model

2.1.4 Evaluating LLMs Effectively

Evaluating Large Language Models (LLMs)

  • Traditional evaluation methods like BLEU and ROUGE have limitations for capturing nuanced semantic meaning and compositional diversity of generated text
  • Model-based approaches use smaller models to automatically assess performance
    • Examples: BERTSCORE for machine translation, image captioning (Zhang et al., 2020b)
    • BARTSCORE for various perspectives including informativeness, fluency, and factuality (Yuan et al., 2021)
  • Some methods use small natural language infer- ence (NLI) models to estimate uncertainty of LLM responses
  • Proxy models can be employed to predict LLM performance, reducing computational costs during model selection.

Future Directions

  • As large models generate longer and more complex texts, it becomes essential to develop efficient evaluators for assessing various aspects:
    • Factuality (Min et al., 2023b)
    • Safety (Zhang et al., 2024b)
    • Uncertainty (Huang et al., 2023).

2.1.5 Domain Adaptation

Domain Adaptation for Large Language Models (LLMs)

Background:

  • LLMs require further customization for optimal performance in specific use cases and domains
  • Fine-tuning on specialized data is resource-intensive and not always feasible
  • Recent research explores adapting LLMs using smaller models

**Two Approaches:**1. White-Box Adaptation:

  • Involves fine-tuning a small model to adjust token distributions of frozen LLMs for specific domains
  • Examples: CombLM (Ormazabal et al., 2023), IPA (Lu et al., 2023b), Proxy-tuning (Liu et al., 2024a)
  • Only modify small domain-specific experts' parameters, allowing LLMs to be adapted to specific tasks
  1. Black-Box Adaptation:
    • Involves using a small domain-specific model to guide LLMs toward target domains by providing textual relevant knowledge
    • Examples: Retrieval Augmented Generation (RAG)
    • Enhances base LLM's performance without requiring access to internal model parameters

Summary:

  • Fine-tuning large models for specific domains is resource-intensive
  • Adapting LLMs using smaller domain-specific models offers a cost-effective solution
  • White-Box Adaptation: fine-tunes small models and guides base models during decoding
  • Black-Box Adaptation: uses small expert models to provide textual relevant knowledge, enhancing the base model's understanding of domain-specific knowledge

**Future Directions:**1. Develop techniques for adapting LLMs using a broader range of diverse models 2. Investigate methods to adapt LLMs using a limited number of samples (Sun et al., 2024).

2.1.6 Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG)

Background:

  • LLMs exhibit reasoning capabilities but limited memory
  • Struggle with domain expertise and up-to-date information
  • RAG enhances LLMs by using a lightweight retriever to access external knowledge bases, document collections, or other tools

Advantages:

  • Mitigates factually inaccurate content (hallucinations)
  • Categories of Retrieval Sources: Textual Documents, Structured Knowledge, Other Sources

Textual Document Sources:

  • Most commonly used in RAG methods
  • Encompasses resources such as Wikipedia, cross-lingual text, and domain-specific corpus
  • Lightweight retrievers like sparse BM25 or dense BERT-based models are employed

Structured Knowledge Sources:

  • Verified information from knowledge bases and databases
  • Enhances answers by concatenating retrieved tables with queries (KnowledgeGPT, T-RAG)
  • Retriever can be a lightweight entity linker, query executor, or API

Other Sources:

  • Codes, tools, images enabling LLMs to leverage external information for enhanced reasoning (DocPrompting, Toolformer)

**Future Directions:**1. Develop robust approaches to integrate noisy retrieved texts2. Extend RAG to multimodal scenarios beyond text-only information (images, audios)

2.1.7 Prompt-based Learning

Prompt-Based Learning

Definition: Paradigm for few-shot or zero-shot learning where prompts are crafted to facilitate adaptation to new scenarios with minimal labeled data.

Advantages: Leverages In-Context Learning (ICL), operates without parameter updates, and can use small models to enhance performance.

Techniques: Uprise - optimizes a lightweight retriever for zero-shot tasks; DaSLaM - uses a small model to break down complex problems into subproblems; generating pseudo labels or verifying/rewriting outputs of LLMs using small models.

Summary: Efficient process handling complex tasks without the need for parameter updates, using prompts embedded in natural language templates and small models to augment performance.

Future Directions: Exploring ways to develop trustworthy, safe, and fair LLMs within the prompt-based learning paradigm by leveraging small models.

2.1.8 Deficiency Repair

Deficiency Repair for LLMs

Powerful LLMs vs. Small Models:

  • Powerful LLMs may generate repeated, untruthful, and toxic contents
  • Small models can be used to repair these defects

Two Ways to Achieve Deficiency Repair:1. Contrastive Decoding: Chooses tokens that maximize the log-likelihood difference between a larger model (expert) and a smaller model (amateur)2. Small Model Plugins: Fine-tune a specialized small model to address the shortcomings of a larger model

  • Address unseen words (Out-Of-Vocabulary) by training a small model to mimic the behavior of the large model
  • Detect hallucinations or calibrate confidence scores

Summary and Future Directions:

  • Existing work explores synergistic use of logits from both LLMs and SMs to reduce repeated text, mitigate hallucinations, augment reasoning capabilities, and safeguard user privacy
  • Proxy Tuning: Fine-tunes a small model and contrasts the difference between the original LLMs and small models to adapt to the target task
  • Future Directions:
    • Extend the use of small models to fix flaws of large models in other problems, e.g., mathematical reasoning
    • Expand the range of knowledge transferred from the teacher model, including feedback on the student's outputs and feature knowledge
    • Address trustworthiness issues (helpfulness, honesty, harmlessness) when transferring skills from LLMs to small models

Knowledge Distillation:

  • Scaling models to larger sizes is computationally expensive
  • Knowledge Distillation offers an effective solution by training a smaller student model to replicate the behavior of a larger teacher model
    • White-box distillation involves using internal states of the teacher model, providing transparency in the training process
    • Black-box knowledge distillation typically involves generating a distillation dataset through the teacher LLM and using it for fine-tuning the student model
  • Recent advancements include Chain-of-Thought distillation and Instruction Following Distillation to enhance reasoning abilities of smaller models

Data Synthesis:

  • Human-created data is finite, and large models are not always necessary for specific tasks
  • Using LLMs to generate training data or augment existing data can be efficient and feasible
    • Training Data Generation: Generating a dataset from scratch using LLMs, followed by training a small task-specific model
    • Data Augmentation: Modifying existing data points using LLMs to increase diversity and train smaller models

3 Competition

Preferability of Smaller Models: Competition vs. Collaboration

Background:

  • LLMs' impressive capabilities come with substantial computational demands
  • Scaling model size leads to exponential increase in training time and higher inference latency
  • High computational overhead prevents application in computation-constrained environments

Computation-constrained Environment (§3.1)

  • Smaller models are preferable due to lower resource requirements
    • Faster training and deployment times
    • Lower hardware needs
    • Reduced energy consumption
  • Examples: Phi-3.8B, MiniCPM, Gemma
  • Techniques like knowledge distillation enable transfer of knowledge from LLMs

Task-specific Environment (§3.2)

  • Not all tasks require large models
    • Diminishing returns observed in certain tasks
    • Tasks like information retrieval have critical inference speed requirements
  • Domain-specific tasks: biomedical, legal, tabular learning, short text tasks, and other specialized areas
    • Small models can outperform LLMs due to domain expertise
    • Fewer training tokens available for some domains

Interpretability-required Environment (§3.3)

  • Smaller, simpler models offer better interpretability compared to larger, more complex ones
    • Transparency: understanding how the model works
  • Industries favor small models due to human understanding requirements
    • Healthcare, finance, law examples

Conclusion

  • Collaboration between LLMs and SMs in balancing performance and efficiency
  • Competition under specific conditions (computation-constrained environments, task-specific applications, interpretability)
  • Careful evaluation of trade-offs essential when selecting models for tasks or applications.