Skip to content

Latest commit

 

History

History
85 lines (74 loc) · 18.2 KB

File metadata and controls

85 lines (74 loc) · 18.2 KB

Large-Language-Models-Evaluation-Benchmarks-Collection-

Welcome to the Large Language Model Evaluation Benchmarks repository! I have created this repo to collect list of benchmarks used by LLM developers to evaluate their model perofrmance. This repository serves as a centralized resource for researchers, developers, and enthusiasts interested in evaluating large language models. It can help enthusiast to learn what big comapnies are using to evaluate their models, expert can generate more insights on what can be done to better evaluate new LLMs (large language models), compare evaluation difference between models, etc.

Table of Contents

Introduction

Large language models have revolutionized natural language processing tasks, demonstrating impressive capabilities in tasks like text generation, sentiment analysis, language translation, and more. However, evaluating the performance of these models requires robust benchmarks that cover a diverse range of linguistic phenomena and real-world scenarios. This repository compiles such benchmarks, providing researchers and practitioners with a valuable resource for assessing and comparing large language models.

Benchmarks

Capability Benchmarks Publication Publication URL Used By (Tags)
Factuality BoolQ BoolQ: Exploring the surprising difficulty of natural yes/no questions https://aclanthology.org/N19-1300 Gemini
NaturalQuestions-Closed Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics https://aclanthology.org/Q19-1026 Gemini
NaturalQuestions-Retrieved Gemini
RealtimeQA RealTime QA: What’s the answer right now? https://arxiv.org/abs/2207.13332 Gemini
TydiQA-noContext and TydiQA-goldP TydiQA: A benchmark for information-seeking question answering in typologically diverse languages https://storage.googleapis.com/tydiqa/tydiqa.pdf Gemini
Long context NarrativeQA The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics https://aclanthology.org/Q18-1023 Gemini
Scrolls-Qasper, Scrolls-Quality SCROLLS: Standardized CompaRison over long language sequences https://aclanthology.org/2022.emnlp-main.823 Gemini
XLsum (En) XL-sum: Large-scale multilingual abstractive summarization for 44 languages https://aclanthology.org/2021.findings-acl.413 Gemini
XLSum (non-English languages) Gemini
One Internal benchmark Gemini
Math/Science GSM8k (with CoT) Training verifiers to solve math word problems https://arxiv.org/abs/2110.14168 Gemini
Hendryck’s MATH pass@1 Measuring mathematical problem solving with the MATH dataset https://arxiv.org/abs/2103.03874 Gemini
MMLU Measuring massive multitask language understanding https://arxiv.org/abs/2009.03300 Gemini
Math-StackExchange Gemini
Math-AMC 2022-2023 problems Gemini
Three internal benchmark Gemini
Reasoning BigBench Hard (with CoT) Beyond the imitation game: Quantifying and extrapolating the capabilities of language models https://arxiv.org/abs/2206.04615 Gemini
Challenging BIG-Bench tasks and whether chain-of-thought can solve them https://arxiv.org/abs/2210.09261 Gemini
CLRS The clrs algorithmic reasoning benchmark https://arxiv.org/abs/2205.15659 Gemini
Proof Writer Proof Writer: Generating implications, proofs, and abductive statements over natural language https://api.semanticscholar.org/CorpusID:229371222 Gemini
Reasoning-Fermi problems How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI https://arxiv.org/abs/2110.14207 Gemini
Lambada The LAMBADA dataset: Word prediction requiring a broad discourse context https://arxiv.org/abs/1606.06031 Gemini
HellaSwag Hellaswag: Can a machine really finish your sentence? https://arxiv.org/abs/1905.07830 Gemini
DROP DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs https://aclanthology.org/N19-1246 Gemini
Summarization XL Sum (English) XL-sum: Large-scale multilingual abstractive summarization https://aclanthology.org/2021.findings-acl.413 Gemini
XL Sum (non-English languages) Gemini
WikiLingua (non-English languages) WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. https://www.aclweb.org/anthology/2020.findings-emnlp.360 Gemini
WikiLingua (English) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. https://aclanthology.org/D18-1206 Gemini
WikiLingua (English) Gemini
Multilinguality XLSum (Non-English languages) XL-sum: Large-scale multilingual abstractive summarization https://aclanthology.org/2021.findings-acl.413 Gemini
WMT22 Findings of the 2022 conference on machine translation (WMT22). https://aclanthology.org/2022.wmt-1.1 Gemini
WMT23 Findings of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet. https://aclanthology.org/2023.wmt-1.1/ Gemini
FRMT Frmt: A benchmark for few-shot region-aware machine translation https://arxiv.org/abs/2210.00193 Gemini
WikiLingua (Non-English languages) WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization https://www.aclweb.org/anthology/2020.findings-emnlp.360 Gemini
TydiQA (no context) TydiQA: A benchmark for information-seeking question answering in typologically diverse languages https://storage.googleapis.com/tydiqa/tydiqa.pdf Gemini
TydiQA (GoldP) Gemini
MGSM Language models are multilingual chain-ofthought reasoners https://arxiv.org/abs/2210.03057 Gemini
translated MMLU Measuring massive multitask language understanding https://arxiv.org/abs/2009.03300 Gemini
NTREX NTREX-128 – news test references for MT evaluation of 128 languages https://aclanthology.org/2022.sumeval-1.4.pdf Gemini
FLORES-200 No language left behind: Scaling human-centered machine translation https://arxiv.org/abs/2207.04672 Gemini
Image understanding MMMU Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi https://arxiv.org/abs/2311.16502 Gemini
TextVQA Towards VQA models that can read https://arxiv.org/abs/1904.08920 Gemini
DocVQA Docvqa: A dataset for vqa on document images https://arxiv.org/abs/2007.00398 Gemini
ChartQA ChartQA: A benchmark for question answering about charts with visual and logical reasoning https://arxiv.org/abs/2203.10244 Gemini
InfographicVQA Infographicvqa https://arxiv.org/abs/2104.12756 Gemini
MathVista Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts https://arxiv.org/abs/2310.02255 Gemini
AI2D A diagram is worth a dozen images https://arxiv.org/abs/1603.07396 Gemini
VQAv2 Making the V in VQA matter: Elevating the role of image understanding in visual question answering https://arxiv.org/abs/1612.00837 Gemini
XM3600 Crossmodal-3600: A massively multilingual multimodal evaluation dataset https://arxiv.org/abs/2205.12522 Gemini
Video understanding VATEX VATEX: A large-scale, high-quality multilingual dataset for video-and-language research https://arxiv.org/abs/1904.03493 Gemini
YouCook2 Towards automatic learning of procedures from web instructional videos https://arxiv.org/abs/1703.09788 Gemini
NextQA NExT-QA: Next phase of question-answering to explaining temporal actions https://arxiv.org/abs/2105.08276 Gemini
ActivityNet-QA ActivityNet-QA: A dataset for understanding complex web videos via question answering https://arxiv.org/abs/1906.02467 Gemini
Perception Test MCQA Perception test: A diagnostic benchmark for multimodal video models https://arxiv.org/abs/2305.13786 Gemini
Audio FLEURS Fleurs: Few-shot learning evaluation of universal representations of speech https://arxiv.org/abs/2205.12446 Gemini
VoxPopuli Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation https://arxiv.org/abs/2101.00390 Gemini
Multi-lingual Librispeech Mls: A large-scale multilingual dataset for speech research https://arxiv.org/abs/2012.03411 Gemini
CoVoST 2 Covost 2 and massively multilingual speech-to-text translation https://arxiv.org/abs/2007.10310 Gemini

Contributing

Contributions to this repository are highly encouraged! If you know of any benchmarks that are not listed here or have suggestions for improvements, please feel free to submit a pull request. Together, we can make this repository a comprehensive resource for the evaluation of large language models. I request your contribution in this work. I am requesting your help to make this collection as latest as possible with your contribution on finidng the benchmarks.

License

This repository is licensed under the MIT License.