What is the Role of Small Models in the LLM Era: A Survey

by Lihu Chen, Gaël Varoquaux

Imperial College London, UK

https://arxiv.org/pdf/2409.06857

TLDR;

Data Curation:
- Pre-training data curation: Selecting high-quality subsets from large datasets
- Instruction-tuning data curation: Selecting influential data for efficient instruction tuning
Weak-to-Strong Paradigm:
- Acting as supervisors for larger models
- Enhancing alignment of LLMs with human values
Efficient Inference:
- Model ensembling: Combining multiple models of varying sizes for cost-effective inference
- Model cascading: Sequential use of multiple models with different complexity levels
- Model routing: Directing input data to the most appropriate models based on performance
Evaluating LLMs:
- Automatically assessing LLM performance (e.g., BERTSCORE, BARTSCORE)
- Estimating uncertainty of LLM responses
- Predicting LLM performance to reduce computational costs during model selection
Domain Adaptation:
- White-box adaptation: Fine-tuning small models to adjust token distributions of frozen LLMs
- Black-box adaptation: Guiding LLMs toward target domains by providing relevant knowledge
Retrieval-Augmented Generation (RAG):
- Acting as lightweight retrievers to access external knowledge bases or document collections
Prompt-based Learning:
- Optimizing retrievers for zero-shot tasks
- Breaking down complex problems into subproblems
- Generating pseudo labels or verifying/rewriting outputs of LLMs
Deficiency Repair:
- Addressing repeated, untruthful, or toxic content generated by LLMs
- Contrastive decoding: Choosing tokens that maximize log-likelihood difference between larger and smaller models
- Addressing out-of-vocabulary words and detecting hallucinations
Knowledge Distillation:
- Acting as student models to replicate behavior of larger teacher models
Data Synthesis:
- Training on LLM-generated datasets for specific tasks
- Augmenting existing data using LLM-generated modifications
Computation-constrained Environments:
- Providing faster training and deployment times
- Reducing hardware and energy consumption requirements
Task-specific Applications:
- Excelling in domains with limited training data
- Outperforming LLMs in specialized areas (e.g., biomedical, legal, tabular learning)
Interpretability-required Environments:
- Offering better transparency and understanding of model workings
- Meeting human understanding requirements in industries like healthcare, finance, and law

Introduction

Large Language Models (LLMs) have revolutionized NLP through pre-training and fine-tuning paradigms
LLMs have demonstrated exceptional performance across a range of tasks, including language generation, understanding, and domain-specific applications
Theories suggest certain reasoning capabilities enhance with model size
Shift towards smaller language models (SLMs) due to resource constraints for academic researchers and businesses

Comparison between LLMs and SMs:

Accuracy: LLMs have superior performance, while SMs can achieve comparable results through techniques like knowledge distillation
Generality: LLMs are highly generalizable, while SMs are more specialized and can outperform for specific tasks with domain-specific datasets
Efficiency: LLMs require substantial resources, while SMs offer competitive performance while reducing resource demands
Interpretability: SMs are more transparent and interpretable than larger models

Role of Small Models:

Collaboration: SMs can strike a balance between power and efficiency, enabling systems that are cost-effective and scalable
Competition: SMs have advantages like simplicity, lower cost, and greater interpretability; assessing trade-offs depends on task/application requirements

2 Collaboration

2.1 Small Models Enhance LLMs

2.1.1 Data Curation

SMs and LLMs Collaboration Framework:

Small Models (SMs) Enhance LLMs:
- Data Curation:
  - Pre-training data curation
    - Use SMs to select high-quality subsets from large datasets Benefits: enhance model performance by focusing on the quality of data instead of quantity Techniques: simple classifiers trained for content assessment, perplexity scores based on proxy language models, data reweighting using domain weights
  - Instruction-tuning data curation
    - Use SMs to select influential data for efficient instruction tuning Approaches: Model-oriented data selection (MoDS), LESS framework

Pre-training Data Curation:

Less is more paradigm: prioritize quality over quantity
- Scale and complexity of raw text data make rule-based methods inadequate
Importance of selecting high-quality subsets for efficient data curation
- Techniques using small models to assess content quality, remove noise, toxicity, and private information
Data reweighting: adjust sampling probabilities based on domain weights trained by a proxy model

Instruction-tuning Data Curation:

Recent findings suggest that strong alignment can be achieved with fewer carefully curated instruction examples
- Importance of selecting high-quality data for efficient instruction tuning
Approaches using small models to evaluate instruction data based on quality, coverage, and necessity.

Future Directions:1. Develop more nuanced criteria for evaluating data quality: factuality, safety, diversity 2. Explore the potential of small models in curating synthetic data to supplement limited human-generated data.

2.1.2 Weak-to-Strong Paradigm

Weak-to-Strong Paradigm

Background:

LLMs aligned with human values through RLHF
Becoming superhuman models: complex tasks, challenging evaluation
Introducing weak-to-strong generalization paradigm
- Using smaller models as supervisors for larger ones

Advantages:

Enabling strong models to generalize beyond limitations of weaker ones
Several variants proposed: diverse set of weak teachers, reliability estimation, collaboration during inference phase

Comparison with Data Labeling:

Weak models can collaborate with large models during inference phase for alignment enhancement

Examples:

Aligner: learn correctional residuals between preferred and dispreferred responses
Weak-to-Strong Search: maximize log-likelihood difference between small tuned and untuned models

**Future Directions:**1. Ensuring strong model's deep understanding of task, capability to correct weak model errors, and natural alignment with objectives.2. Developing a deeper understanding of underlying mechanisms governing success or failure of alignment methods: theoretical analysis (Lang et al., 2024), errors in weak supervision (Guo and Yang, 2024), scaling laws for extrapolating generalization errors (Kaplan et al., 2020).

2.1.3 Efficient Inference Model Ensembling

Model Ensembling

Larger models are more powerful but have significant costs, including slower inference speed and higher API prices
Smaller models offer advantages in terms of lower cost and faster inference, especially for simple queries
Ensemble methods can be used to achieve cost-effective inference by combining multiple models of varying sizes
- Model cascading: sequential use of multiple models, where each model has a different level of complexity
  - Output of one model triggers activation of next model in sequence
  - Allows for collaboration between models and transferring tasks to larger models
  - Critical step is determining when to escalate query to more complex model
    - Techniques train small evaluator to assess correctness, confidence, or quality of model output
      - Some LLMs can perform self-verification and provide confidence levels
- Model routing: dynamically directs input data to most appropriate models based on performance
  - Straightforward approach: select best-performing model from all models
    - Does not significantly reduce inference costs
  - Efficient, reward-based routers trained to select optimal models without accessing their outputs
    - Retrieval-based dynamic router assigns instances with similar semantic embeddings to same expert
      - RouteLLM uses human preference data and data augmentation to train small router model
    - FORC proposes meta-model to assign queries to most suitable model without requiring execution of large models

Speculative Decoding

Aims to speed up decoding process of generative model by using a smaller, faster auxiliary model alongside larger main model
Auxiliary model generates multiple token candidates in parallel, which are then verified or refined by larger model

2.1.4 Evaluating LLMs Effectively

Evaluating Large Language Models (LLMs)

Traditional evaluation methods like BLEU and ROUGE have limitations for capturing nuanced semantic meaning and compositional diversity of generated text
Model-based approaches use smaller models to automatically assess performance
- Examples: BERTSCORE for machine translation, image captioning (Zhang et al., 2020b)
- BARTSCORE for various perspectives including informativeness, fluency, and factuality (Yuan et al., 2021)
Some methods use small natural language infer- ence (NLI) models to estimate uncertainty of LLM responses
Proxy models can be employed to predict LLM performance, reducing computational costs during model selection.

Future Directions

As large models generate longer and more complex texts, it becomes essential to develop efficient evaluators for assessing various aspects:
- Factuality (Min et al., 2023b)
- Safety (Zhang et al., 2024b)
- Uncertainty (Huang et al., 2023).

2.1.5 Domain Adaptation

Domain Adaptation for Large Language Models (LLMs)

Background:

LLMs require further customization for optimal performance in specific use cases and domains
Fine-tuning on specialized data is resource-intensive and not always feasible
Recent research explores adapting LLMs using smaller models

**Two Approaches:**1. White-Box Adaptation:

Involves fine-tuning a small model to adjust token distributions of frozen LLMs for specific domains
Examples: CombLM (Ormazabal et al., 2023), IPA (Lu et al., 2023b), Proxy-tuning (Liu et al., 2024a)
Only modify small domain-specific experts' parameters, allowing LLMs to be adapted to specific tasks

Black-Box Adaptation:
- Involves using a small domain-specific model to guide LLMs toward target domains by providing textual relevant knowledge
- Examples: Retrieval Augmented Generation (RAG)
- Enhances base LLM's performance without requiring access to internal model parameters

Summary:

Fine-tuning large models for specific domains is resource-intensive
Adapting LLMs using smaller domain-specific models offers a cost-effective solution
White-Box Adaptation: fine-tunes small models and guides base models during decoding
Black-Box Adaptation: uses small expert models to provide textual relevant knowledge, enhancing the base model's understanding of domain-specific knowledge

**Future Directions:**1. Develop techniques for adapting LLMs using a broader range of diverse models 2. Investigate methods to adapt LLMs using a limited number of samples (Sun et al., 2024).

2.1.6 Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG)

Background:

LLMs exhibit reasoning capabilities but limited memory
Struggle with domain expertise and up-to-date information
RAG enhances LLMs by using a lightweight retriever to access external knowledge bases, document collections, or other tools

Advantages:

Mitigates factually inaccurate content (hallucinations)
Categories of Retrieval Sources: Textual Documents, Structured Knowledge, Other Sources

Textual Document Sources:

Most commonly used in RAG methods
Encompasses resources such as Wikipedia, cross-lingual text, and domain-specific corpus
Lightweight retrievers like sparse BM25 or dense BERT-based models are employed

Structured Knowledge Sources:

Verified information from knowledge bases and databases
Enhances answers by concatenating retrieved tables with queries (KnowledgeGPT, T-RAG)
Retriever can be a lightweight entity linker, query executor, or API

Other Sources:

Codes, tools, images enabling LLMs to leverage external information for enhanced reasoning (DocPrompting, Toolformer)

**Future Directions:**1. Develop robust approaches to integrate noisy retrieved texts2. Extend RAG to multimodal scenarios beyond text-only information (images, audios)

2.1.7 Prompt-based Learning

Prompt-Based Learning

Definition: Paradigm for few-shot or zero-shot learning where prompts are crafted to facilitate adaptation to new scenarios with minimal labeled data.

Advantages: Leverages In-Context Learning (ICL), operates without parameter updates, and can use small models to enhance performance.

Techniques: Uprise - optimizes a lightweight retriever for zero-shot tasks; DaSLaM - uses a small model to break down complex problems into subproblems; generating pseudo labels or verifying/rewriting outputs of LLMs using small models.

Summary: Efficient process handling complex tasks without the need for parameter updates, using prompts embedded in natural language templates and small models to augment performance.

Future Directions: Exploring ways to develop trustworthy, safe, and fair LLMs within the prompt-based learning paradigm by leveraging small models.

2.1.8 Deficiency Repair

Deficiency Repair for LLMs

Powerful LLMs vs. Small Models:

Powerful LLMs may generate repeated, untruthful, and toxic contents
Small models can be used to repair these defects

Two Ways to Achieve Deficiency Repair:1. Contrastive Decoding: Chooses tokens that maximize the log-likelihood difference between a larger model (expert) and a smaller model (amateur)2. Small Model Plugins: Fine-tune a specialized small model to address the shortcomings of a larger model

Address unseen words (Out-Of-Vocabulary) by training a small model to mimic the behavior of the large model
Detect hallucinations or calibrate confidence scores

Summary and Future Directions:

Existing work explores synergistic use of logits from both LLMs and SMs to reduce repeated text, mitigate hallucinations, augment reasoning capabilities, and safeguard user privacy
Proxy Tuning: Fine-tunes a small model and contrasts the difference between the original LLMs and small models to adapt to the target task
Future Directions:
- Extend the use of small models to fix flaws of large models in other problems, e.g., mathematical reasoning
- Expand the range of knowledge transferred from the teacher model, including feedback on the student's outputs and feature knowledge
- Address trustworthiness issues (helpfulness, honesty, harmlessness) when transferring skills from LLMs to small models

Knowledge Distillation:

Scaling models to larger sizes is computationally expensive
Knowledge Distillation offers an effective solution by training a smaller student model to replicate the behavior of a larger teacher model
- White-box distillation involves using internal states of the teacher model, providing transparency in the training process
- Black-box knowledge distillation typically involves generating a distillation dataset through the teacher LLM and using it for fine-tuning the student model
Recent advancements include Chain-of-Thought distillation and Instruction Following Distillation to enhance reasoning abilities of smaller models

Data Synthesis:

Human-created data is finite, and large models are not always necessary for specific tasks
Using LLMs to generate training data or augment existing data can be efficient and feasible
- Training Data Generation: Generating a dataset from scratch using LLMs, followed by training a small task-specific model
- Data Augmentation: Modifying existing data points using LLMs to increase diversity and train smaller models

3 Competition

Preferability of Smaller Models: Competition vs. Collaboration

Background:

LLMs' impressive capabilities come with substantial computational demands
Scaling model size leads to exponential increase in training time and higher inference latency
High computational overhead prevents application in computation-constrained environments

Computation-constrained Environment (§3.1)

Smaller models are preferable due to lower resource requirements
- Faster training and deployment times
- Lower hardware needs
- Reduced energy consumption
Examples: Phi-3.8B, MiniCPM, Gemma
Techniques like knowledge distillation enable transfer of knowledge from LLMs

Task-specific Environment (§3.2)

Not all tasks require large models
- Diminishing returns observed in certain tasks
- Tasks like information retrieval have critical inference speed requirements
Domain-specific tasks: biomedical, legal, tabular learning, short text tasks, and other specialized areas
- Small models can outperform LLMs due to domain expertise
- Fewer training tokens available for some domains

Interpretability-required Environment (§3.3)

Smaller, simpler models offer better interpretability compared to larger, more complex ones
- Transparency: understanding how the model works
Industries favor small models due to human understanding requirements
- Healthcare, finance, law examples

Conclusion

Collaboration between LLMs and SMs in balancing performance and efficiency
Competition under specific conditions (computation-constrained environments, task-specific applications, interpretability)
Careful evaluation of trade-offs essential when selecting models for tasks or applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small-models_survey.md

small-models_survey.md

What is the Role of Small Models in the LLM Era: A Survey

TLDR;

Contents

Introduction

2 Collaboration

2.1.1 Data Curation

2.1.2 Weak-to-Strong Paradigm

2.1.3 Efficient Inference Model Ensembling

2.1.4 Evaluating LLMs Effectively

2.1.5 Domain Adaptation

2.1.6 Retrieval Augmented Generation

2.1.7 Prompt-based Learning

2.1.8 Deficiency Repair

3 Competition

Conclusion

Files

small-models_survey.md

Latest commit

History

small-models_survey.md

File metadata and controls

What is the Role of Small Models in the LLM Era: A Survey

TLDR;

Contents

Introduction

2 Collaboration

2.1.1 Data Curation

2.1.2 Weak-to-Strong Paradigm

2.1.3 Efficient Inference Model Ensembling

2.1.4 Evaluating LLMs Effectively

2.1.5 Domain Adaptation

2.1.6 Retrieval Augmented Generation

2.1.7 Prompt-based Learning

2.1.8 Deficiency Repair

3 Competition

Conclusion