Welcome to the Large Language Model Evaluation Benchmarks repository! I have created this repo to collect list of benchmarks used by LLM developers to evaluate their model perofrmance. This repository serves as a centralized resource for researchers, developers, and enthusiasts interested in evaluating large language models. It can help enthusiast to learn what big comapnies are using to evaluate their models, expert can generate more insights on what can be done to better evaluate new LLMs (large language models), compare evaluation difference between models, etc.
Large language models have revolutionized natural language processing tasks, demonstrating impressive capabilities in tasks like text generation, sentiment analysis, language translation, and more. However, evaluating the performance of these models requires robust benchmarks that cover a diverse range of linguistic phenomena and real-world scenarios. This repository compiles such benchmarks, providing researchers and practitioners with a valuable resource for assessing and comparing large language models.
Capability | Benchmarks | Publication | Publication URL | Used By (Tags) |
---|---|---|---|---|
Factuality | BoolQ | BoolQ: Exploring the surprising difficulty of natural yes/no questions | https://aclanthology.org/N19-1300 | Gemini |
NaturalQuestions-Closed | Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics | https://aclanthology.org/Q19-1026 | Gemini | |
NaturalQuestions-Retrieved | Gemini | |||
RealtimeQA | RealTime QA: What’s the answer right now? | https://arxiv.org/abs/2207.13332 | Gemini | |
TydiQA-noContext and TydiQA-goldP | TydiQA: A benchmark for information-seeking question answering in typologically diverse languages | https://storage.googleapis.com/tydiqa/tydiqa.pdf | Gemini | |
Long context | NarrativeQA | The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics | https://aclanthology.org/Q18-1023 | Gemini |
Scrolls-Qasper, Scrolls-Quality | SCROLLS: Standardized CompaRison over long language sequences | https://aclanthology.org/2022.emnlp-main.823 | Gemini | |
XLsum (En) | XL-sum: Large-scale multilingual abstractive summarization for 44 languages | https://aclanthology.org/2021.findings-acl.413 | Gemini | |
XLSum (non-English languages) | Gemini | |||
One Internal benchmark | Gemini | |||
Math/Science | GSM8k (with CoT) | Training verifiers to solve math word problems | https://arxiv.org/abs/2110.14168 | Gemini |
Hendryck’s MATH pass@1 | Measuring mathematical problem solving with the MATH dataset | https://arxiv.org/abs/2103.03874 | Gemini | |
MMLU | Measuring massive multitask language understanding | https://arxiv.org/abs/2009.03300 | Gemini | |
Math-StackExchange | Gemini | |||
Math-AMC 2022-2023 problems | Gemini | |||
Three internal benchmark | Gemini | |||
Reasoning | BigBench Hard (with CoT) | Beyond the imitation game: Quantifying and extrapolating the capabilities of language models | https://arxiv.org/abs/2206.04615 | Gemini |
Challenging BIG-Bench tasks and whether chain-of-thought can solve them | https://arxiv.org/abs/2210.09261 | Gemini | ||
CLRS | The clrs algorithmic reasoning benchmark | https://arxiv.org/abs/2205.15659 | Gemini | |
Proof Writer | Proof Writer: Generating implications, proofs, and abductive statements over natural language | https://api.semanticscholar.org/CorpusID:229371222 | Gemini | |
Reasoning-Fermi problems | How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI | https://arxiv.org/abs/2110.14207 | Gemini | |
Lambada | The LAMBADA dataset: Word prediction requiring a broad discourse context | https://arxiv.org/abs/1606.06031 | Gemini | |
HellaSwag | Hellaswag: Can a machine really finish your sentence? | https://arxiv.org/abs/1905.07830 | Gemini | |
DROP | DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs | https://aclanthology.org/N19-1246 | Gemini | |
Summarization | XL Sum (English) | XL-sum: Large-scale multilingual abstractive summarization | https://aclanthology.org/2021.findings-acl.413 | Gemini |
XL Sum (non-English languages) | Gemini | |||
WikiLingua (non-English languages) | WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. | https://www.aclweb.org/anthology/2020.findings-emnlp.360 | Gemini | |
WikiLingua (English) | Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. | https://aclanthology.org/D18-1206 | Gemini | |
WikiLingua (English) | Gemini | |||
Multilinguality | XLSum (Non-English languages) | XL-sum: Large-scale multilingual abstractive summarization | https://aclanthology.org/2021.findings-acl.413 | Gemini |
WMT22 | Findings of the 2022 conference on machine translation (WMT22). | https://aclanthology.org/2022.wmt-1.1 | Gemini | |
WMT23 | Findings of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet. | https://aclanthology.org/2023.wmt-1.1/ | Gemini | |
FRMT | Frmt: A benchmark for few-shot region-aware machine translation | https://arxiv.org/abs/2210.00193 | Gemini | |
WikiLingua (Non-English languages) | WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization | https://www.aclweb.org/anthology/2020.findings-emnlp.360 | Gemini | |
TydiQA (no context) | TydiQA: A benchmark for information-seeking question answering in typologically diverse languages | https://storage.googleapis.com/tydiqa/tydiqa.pdf | Gemini | |
TydiQA (GoldP) | Gemini | |||
MGSM | Language models are multilingual chain-ofthought reasoners | https://arxiv.org/abs/2210.03057 | Gemini | |
translated MMLU | Measuring massive multitask language understanding | https://arxiv.org/abs/2009.03300 | Gemini | |
NTREX | NTREX-128 – news test references for MT evaluation of 128 languages | https://aclanthology.org/2022.sumeval-1.4.pdf | Gemini | |
FLORES-200 | No language left behind: Scaling human-centered machine translation | https://arxiv.org/abs/2207.04672 | Gemini | |
Image understanding | MMMU | Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi | https://arxiv.org/abs/2311.16502 | Gemini |
TextVQA | Towards VQA models that can read | https://arxiv.org/abs/1904.08920 | Gemini | |
DocVQA | Docvqa: A dataset for vqa on document images | https://arxiv.org/abs/2007.00398 | Gemini | |
ChartQA | ChartQA: A benchmark for question answering about charts with visual and logical reasoning | https://arxiv.org/abs/2203.10244 | Gemini | |
InfographicVQA | Infographicvqa | https://arxiv.org/abs/2104.12756 | Gemini | |
MathVista | Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts | https://arxiv.org/abs/2310.02255 | Gemini | |
AI2D | A diagram is worth a dozen images | https://arxiv.org/abs/1603.07396 | Gemini | |
VQAv2 | Making the V in VQA matter: Elevating the role of image understanding in visual question answering | https://arxiv.org/abs/1612.00837 | Gemini | |
XM3600 | Crossmodal-3600: A massively multilingual multimodal evaluation dataset | https://arxiv.org/abs/2205.12522 | Gemini | |
Video understanding | VATEX | VATEX: A large-scale, high-quality multilingual dataset for video-and-language research | https://arxiv.org/abs/1904.03493 | Gemini |
YouCook2 | Towards automatic learning of procedures from web instructional videos | https://arxiv.org/abs/1703.09788 | Gemini | |
NextQA | NExT-QA: Next phase of question-answering to explaining temporal actions | https://arxiv.org/abs/2105.08276 | Gemini | |
ActivityNet-QA | ActivityNet-QA: A dataset for understanding complex web videos via question answering | https://arxiv.org/abs/1906.02467 | Gemini | |
Perception Test MCQA | Perception test: A diagnostic benchmark for multimodal video models | https://arxiv.org/abs/2305.13786 | Gemini | |
Audio | FLEURS | Fleurs: Few-shot learning evaluation of universal representations of speech | https://arxiv.org/abs/2205.12446 | Gemini |
VoxPopuli | Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation | https://arxiv.org/abs/2101.00390 | Gemini | |
Multi-lingual Librispeech | Mls: A large-scale multilingual dataset for speech research | https://arxiv.org/abs/2012.03411 | Gemini | |
CoVoST 2 | Covost 2 and massively multilingual speech-to-text translation | https://arxiv.org/abs/2007.10310 | Gemini |
Contributions to this repository are highly encouraged! If you know of any benchmarks that are not listed here or have suggestions for improvements, please feel free to submit a pull request. Together, we can make this repository a comprehensive resource for the evaluation of large language models. I request your contribution in this work. I am requesting your help to make this collection as latest as possible with your contribution on finidng the benchmarks.
This repository is licensed under the MIT License.