This repository contains the source codes for reproducing the results of the paper: Shakespearean Sparks: The Dance of Hallucination and Creativity in LLMs' Decoding Layers.
Author List: *Zicong He, *Boxuan Zhang, Lu Cheng.
(* Equal Contribution)
This project explores the intricate relationship between hallucination and creativity in Large Language Models (LLMs) at different decoding layers. While hallucination is often considered a flaw, our study provides a quantitative perspective that reveals its potential contribution to creative outputs.
To systematically investigate this relationship, we propose HCL (Hallucination-Creativity Layerwise framework), which:
- Quantifies hallucination and creativity across different model layers.
- Identifies the tradeoff between creativity and factual accuracy.
- Determines the optimal decoding layer that balances both aspects.
Our findings suggest that earlier layers in LLMs tend to produce more creative outputs, while deeper layers prioritize factual accuracy. Leveraging this insight, we introduce a layer-wise early-exit strategy to enhance computational efficiency without sacrificing quality.
git clone [email protected]:ZicongHe2002/Shakespearean-Sparks-The-Dance-of-Hallucination-and-Creativity-in-LLMs-Decoding-Layers.git
cd code
conda create --name hcl_spark python=3.10
conda activate hcl_spark
pip install -r requirements.txt
./generate0.sh
Our research is built upon a three-stage evaluation process:
- Layer-wise Response Sampling: Using an early-exit strategy to extract responses at different layers.
- Evaluation Metrics: Creativity is measured as the semantic diversity of correct responses, while hallucination is assessed by error rates.
- HCB Calculation: We introduce the Hallucination-Creativity Balanced (HCB) score, which helps identify the optimal decoding layer for improved model performance.
- Creativity comes with hallucination: Models with higher creativity scores also exhibit a greater tendency for hallucination.
- Stronger models generate more creative, yet more hallucinatory, responses: Larger LLMs tend to balance this tradeoff better at intermediate layers.
- Final layer decoding isn’t always optimal: Selecting outputs from earlier layers can yield a better balance between diversity and factuality.
- Optimal layers are model-dependent but consistent across tasks: Our results generalize across different LLM architectures and datasets.
Access models: In order to observe speedup, you need to access LLMs that have been trained using the LayerSkip recipe. We provide 4 checkpoints on HuggingFace of different Llama models continually pretrained using the LayerSkip recipe:
We conduct experiments using open-weight LLMs, including:
- LLaMA 2-7B
- LLaMA 2-13B
- LLaMA 3.2-1B
- LLaMA 3-8B
Our dataset sources include TriviaQA and Natural Questions (NQ), ensuring a diverse benchmark for creativity and hallucination evaluation.
We sincerely thank the following authors, and HCL is based on their excellent open-source projects or impressive ideas.
Layerskip: https://github.com/facebookresearch/LayerSkip?tab=readme-ov-file