This folder contains notebooks used to evaluate popular hallucination detection models on various RAG (Context, Question, LLM Response) datasets.
The datasets used in the benchmark include:
The following table lists the models evaluated:
Notebook | Description |
---|---|
Patronus Lynx | Evaluates Patronux Lynx 70B model |
Vectara HHEM | Evaluates Vectara's HHEM v2.1 model |
Prometheus 2 | Evaluates Prometheus2 8x7B model |
LLM as judge and TLM | Evaluates LLM-as-judge and the Trustworthy Language Model on ELI5 dataset |
LLM as judge and TLM | Evaluates LLM-as-judge and the Trustworthy Language Model on Halubench datasets |