Benchmark hallucination detection models on RAG datasets

This folder contains notebooks used to evaluate popular hallucination detection models on various RAG (Context, Question, LLM Response) datasets.

Datasets used in the benchmark include:

ELI5
FinQA
Halubench:
- CovidQA
- PubmedQA
- DROP
- FinanceBench

We omitted the other datasets from Halubench after discovering too many annotation errors. For FinQA: we specifically used the FinQA-hallucination-detection version of this dataset, after discovering annotation errors and synthetic responses in other versions.

Models compared in our benchmark include:

Notebook	Description
Patronus Lynx	Evaluates Patronux Lynx 70B model
Vectara HHEM	Evaluates Vectara's HHEM v2.1 model
Prometheus 2	Evaluates Prometheus2 8x7B model
LLM as judge and TLM	Evaluates LLM-as-judge and the Trustworthy Language Model on ELI5 dataset
LLM as judge and TLM	Evaluates LLM-as-judge and the Trustworthy Language Model on Halubench datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!