Large-Language-Models-Evaluation-Benchmarks-Collection-

Welcome to the Large Language Model Evaluation Benchmarks repository! I have created this repo to collect list of benchmarks used by LLM developers to evaluate their model perofrmance. This repository serves as a centralized resource for researchers, developers, and enthusiasts interested in evaluating large language models. It can help enthusiast to learn what big comapnies are using to evaluate their models, expert can generate more insights on what can be done to better evaluate new LLMs (large language models), compare evaluation difference between models, etc.

Introduction

Large language models have revolutionized natural language processing tasks, demonstrating impressive capabilities in tasks like text generation, sentiment analysis, language translation, and more. However, evaluating the performance of these models requires robust benchmarks that cover a diverse range of linguistic phenomena and real-world scenarios. This repository compiles such benchmarks, providing researchers and practitioners with a valuable resource for assessing and comparing large language models.

Benchmarks

Capability	Benchmarks	Publication	Publication URL	Used By (Tags)
Factuality	BoolQ	BoolQ: Exploring the surprising difficulty of natural yes/no questions	https://aclanthology.org/N19-1300	Gemini
	NaturalQuestions-Closed	Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics	https://aclanthology.org/Q19-1026	Gemini
	NaturalQuestions-Retrieved			Gemini
	RealtimeQA	RealTime QA: What’s the answer right now?	https://arxiv.org/abs/2207.13332	Gemini
	TydiQA-noContext and TydiQA-goldP	TydiQA: A benchmark for information-seeking question answering in typologically diverse languages	https://storage.googleapis.com/tydiqa/tydiqa.pdf	Gemini
Long context	NarrativeQA	The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics	https://aclanthology.org/Q18-1023	Gemini
	Scrolls-Qasper, Scrolls-Quality	SCROLLS: Standardized CompaRison over long language sequences	https://aclanthology.org/2022.emnlp-main.823	Gemini
	XLsum (En)	XL-sum: Large-scale multilingual abstractive summarization for 44 languages	https://aclanthology.org/2021.findings-acl.413	Gemini
	XLSum (non-English languages)			Gemini
	One Internal benchmark			Gemini
Math/Science	GSM8k (with CoT)	Training verifiers to solve math word problems	https://arxiv.org/abs/2110.14168	Gemini
	Hendryck’s MATH pass@1	Measuring mathematical problem solving with the MATH dataset	https://arxiv.org/abs/2103.03874	Gemini
	MMLU	Measuring massive multitask language understanding	https://arxiv.org/abs/2009.03300	Gemini
	Math-StackExchange			Gemini
	Math-AMC 2022-2023 problems			Gemini
	Three internal benchmark			Gemini
Reasoning	BigBench Hard (with CoT)	Beyond the imitation game: Quantifying and extrapolating the capabilities of language models	https://arxiv.org/abs/2206.04615	Gemini
		Challenging BIG-Bench tasks and whether chain-of-thought can solve them	https://arxiv.org/abs/2210.09261	Gemini
	CLRS	The clrs algorithmic reasoning benchmark	https://arxiv.org/abs/2205.15659	Gemini
	Proof Writer	Proof Writer: Generating implications, proofs, and abductive statements over natural language	https://api.semanticscholar.org/CorpusID:229371222	Gemini
	Reasoning-Fermi problems	How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI	https://arxiv.org/abs/2110.14207	Gemini
	Lambada	The LAMBADA dataset: Word prediction requiring a broad discourse context	https://arxiv.org/abs/1606.06031	Gemini
	HellaSwag	Hellaswag: Can a machine really finish your sentence?	https://arxiv.org/abs/1905.07830	Gemini
	DROP	DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs	https://aclanthology.org/N19-1246	Gemini
Summarization	XL Sum (English)	XL-sum: Large-scale multilingual abstractive summarization	https://aclanthology.org/2021.findings-acl.413	Gemini
	XL Sum (non-English languages)			Gemini
	WikiLingua (non-English languages)	WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization.	https://www.aclweb.org/anthology/2020.findings-emnlp.360	Gemini
	WikiLingua (English)	Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.	https://aclanthology.org/D18-1206	Gemini
	WikiLingua (English)			Gemini
Multilinguality	XLSum (Non-English languages)	XL-sum: Large-scale multilingual abstractive summarization	https://aclanthology.org/2021.findings-acl.413	Gemini
	WMT22	Findings of the 2022 conference on machine translation (WMT22).	https://aclanthology.org/2022.wmt-1.1	Gemini
	WMT23	Findings of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet.	https://aclanthology.org/2023.wmt-1.1/	Gemini
	FRMT	Frmt: A benchmark for few-shot region-aware machine translation	https://arxiv.org/abs/2210.00193	Gemini
	WikiLingua (Non-English languages)	WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization	https://www.aclweb.org/anthology/2020.findings-emnlp.360	Gemini
	TydiQA (no context)	TydiQA: A benchmark for information-seeking question answering in typologically diverse languages	https://storage.googleapis.com/tydiqa/tydiqa.pdf	Gemini
	TydiQA (GoldP)			Gemini
	MGSM	Language models are multilingual chain-ofthought reasoners	https://arxiv.org/abs/2210.03057	Gemini
	translated MMLU	Measuring massive multitask language understanding	https://arxiv.org/abs/2009.03300	Gemini
	NTREX	NTREX-128 – news test references for MT evaluation of 128 languages	https://aclanthology.org/2022.sumeval-1.4.pdf	Gemini
	FLORES-200	No language left behind: Scaling human-centered machine translation	https://arxiv.org/abs/2207.04672	Gemini
Image understanding	MMMU	Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi	https://arxiv.org/abs/2311.16502	Gemini
	TextVQA	Towards VQA models that can read	https://arxiv.org/abs/1904.08920	Gemini
	DocVQA	Docvqa: A dataset for vqa on document images	https://arxiv.org/abs/2007.00398	Gemini
	ChartQA	ChartQA: A benchmark for question answering about charts with visual and logical reasoning	https://arxiv.org/abs/2203.10244	Gemini
	InfographicVQA	Infographicvqa	https://arxiv.org/abs/2104.12756	Gemini
	MathVista	Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts	https://arxiv.org/abs/2310.02255	Gemini
	AI2D	A diagram is worth a dozen images	https://arxiv.org/abs/1603.07396	Gemini
	VQAv2	Making the V in VQA matter: Elevating the role of image understanding in visual question answering	https://arxiv.org/abs/1612.00837	Gemini
	XM3600	Crossmodal-3600: A massively multilingual multimodal evaluation dataset	https://arxiv.org/abs/2205.12522	Gemini
Video understanding	VATEX	VATEX: A large-scale, high-quality multilingual dataset for video-and-language research	https://arxiv.org/abs/1904.03493	Gemini
	YouCook2	Towards automatic learning of procedures from web instructional videos	https://arxiv.org/abs/1703.09788	Gemini
	NextQA	NExT-QA: Next phase of question-answering to explaining temporal actions	https://arxiv.org/abs/2105.08276	Gemini
	ActivityNet-QA	ActivityNet-QA: A dataset for understanding complex web videos via question answering	https://arxiv.org/abs/1906.02467	Gemini
	Perception Test MCQA	Perception test: A diagnostic benchmark for multimodal video models	https://arxiv.org/abs/2305.13786	Gemini
Audio	FLEURS	Fleurs: Few-shot learning evaluation of universal representations of speech	https://arxiv.org/abs/2205.12446	Gemini
	VoxPopuli	Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation	https://arxiv.org/abs/2101.00390	Gemini
	Multi-lingual Librispeech	Mls: A large-scale multilingual dataset for speech research	https://arxiv.org/abs/2012.03411	Gemini
	CoVoST 2	Covost 2 and massively multilingual speech-to-text translation	https://arxiv.org/abs/2007.10310	Gemini

Contributing

Contributions to this repository are highly encouraged! If you know of any benchmarks that are not listed here or have suggestions for improvements, please feel free to submit a pull request. Together, we can make this repository a comprehensive resource for the evaluation of large language models. I request your contribution in this work. I am requesting your help to make this collection as latest as possible with your contribution on finidng the benchmarks.

License

This repository is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large-Language-Models-Evaluation-Benchmarks-Collection-

Table of Contents

Introduction

Benchmarks

Contributing

License

About

Releases

Packages

License

dippatel1994/Large-Language-Models-Evaluation-Benchmarks-Collection

Folders and files

Latest commit

History

Repository files navigation

Large-Language-Models-Evaluation-Benchmarks-Collection-

Table of Contents

Introduction

Benchmarks

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages