Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use chunk data in NIAH and QA evals #1176

Merged
merged 11 commits into from
Oct 7, 2024

Conversation

jalling97
Copy link
Contributor

@jalling97 jalling97 commented Oct 1, 2024

Description

This PR adds chunk data to the NIAH and QA evaluations to better evaluate the retrieval stage of RAG

  • Changes the NIAH retrieval metric to be based on the actual chunk data, not just the annotations
  • Adds chunk_rank metric to NIAH evals to evaluate how well ranked the chunks are
  • Adds 2 new LLM-as-judge metrics to QA evals: Contextual Relevancy and Faithfulness
  • Updates package versions
  • Fixes bug when not using padding data in NIAH evals

BREAKING CHANGES

NIAH Retrieval now measures slightly differently, meaning prior NIAH retrieval metrics should not be compared to these new values.

CHANGES

Replaces the current proxy annotation measure for NIAH retrievals with one based on the chunk data directly.

Related Issue

Relates to #1067

Checklist before merging

Sorry, something went wrong.

@jalling97 jalling97 linked an issue Oct 1, 2024 that may be closed by this pull request
Copy link

netlify bot commented Oct 1, 2024

Deploy Preview for leapfrogai-docs canceled.

Name Link
🔨 Latest commit 855cec5
🔍 Latest deploy log https://app.netlify.com/sites/leapfrogai-docs/deploys/67000ac85a9ad60008a999e9

@jalling97 jalling97 self-assigned this Oct 2, 2024
@jalling97
Copy link
Contributor Author

Evaluation results from most recent run using new metrics:

Final Results:
INFO:root:Average Needle in a Haystack (NIAH) Retrieval: 1.0
INFO:root:Average Needle in a Haystack (NIAH) Response: 1.0
INFO:root:Average Needle in a Haystack (NIAH) Chunk Rank: 0.9600000000000001
INFO:root:Average Correctness (GEval): 0.82
INFO:root:Average Answer Relevancy: 0.9583333333333335
INFO:root:Average Contextual Relevancy: 0.504
INFO:root:Average Faithfulness: 0.9278174603174603
INFO:root:Average Annotation Relevancy: 0.9359999999999999
INFO:root:MMLU: 0.696969696969697
INFO:root:HumanEval: 0.95
INFO:root:Eval Execution Runtime (seconds): 1655.2433378696442

@jalling97 jalling97 added chore enhancement New feature or request and removed chore labels Oct 3, 2024
@jalling97 jalling97 changed the title chore: use chunk data in NIAH and QA evals feat: use chunk data in NIAH and QA evals Oct 3, 2024
…data-in-niah-and-qa-evals
…data-in-niah-and-qa-evals
@jalling97 jalling97 marked this pull request as ready for review October 4, 2024 15:49
@jalling97 jalling97 requested a review from a team as a code owner October 4, 2024 15:49
@jalling97 jalling97 merged commit ad697cd into main Oct 7, 2024
37 of 39 checks passed
@jalling97 jalling97 deleted the 1067-chore-use-chunk-data-in-niah-and-qa-evals branch October 7, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: use chunk data in NIAH and QA evals
3 participants