[rag evals][2/n] add more braintrust scoring fns for RAG eval #666

yanxi0830 · 2024-12-20T00:55:15Z

What does this PR do?

add more braintrust scoring functions for RAG eval
add tests for evaluating against context

Test Plan

pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py

Example Output

https://gist.github.com/yanxi0830/2acf3b8b3e8132fda2a48b1f0a49711b

Sources

Please link relevant resources if necessary.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Ran pre-commit to handle lint / formatting issues.
Read the contributor guideline,
Pull Request section?
Updated relevant documentation.
Wrote necessary unit or integration tests.

…c eval pipeline (#668) # What does this PR do? - This PR adds the ability s.t. users can evaluate on both retrieval + generation separately & as a whole by passing an AgentConfig to the /eval API - The memory_retrieval context will be stored in the "context" column used for scoring functions that can evaluate the retrieved context. ## Test Plan - E2E Test RAG Agent Notebook: https://gist.github.com/yanxi0830/0377594d29958f9b6f9317ab049fa836 <img width="758" alt="image" src="https://github.com/user-attachments/assets/58ed9db7-f07b-400a-931b-923b0d612902" /> <img width="682" alt="image" src="https://github.com/user-attachments/assets/9ebd7fbd-2a6d-4c93-92fa-a9456fae2378" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.

…agentic eval pipeline (#664) # What does this PR do? - See #666 & #668 - Refactor BaseScoringFn to be just a minimal interface, add new RegistrableBaseScoring - Refactor data schema check - To separately evaluate retrieval component in RAG, we will have scoring functions needing "context" column additionally. - Refactor braintrust eval (more scoring fn added & tested in following PR) ## Test Plan ``` pytest -v -s -m llm_as_judge_scoring_together_inference scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct pytest -v -s -m basic_scoring_together_inference scoring/test_scoring.py pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py ``` <img width="847" alt="image" src="https://github.com/user-attachments/assets/d099cb2d-6f9c-4bdf-9d0d-f388cf758c0f" /> ``` pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio eval/test_eval.py ``` <img width="850" alt="image" src="https://github.com/user-attachments/assets/dce28fc3-0493-4d34-820a-567260873cc8" /> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.

more scoring function for rag

9aa4a40

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 20, 2024

yanxi0830 changed the base branch from main to rag_scoring_fn_1 December 20, 2024 00:57

yanxi0830 marked this pull request as ready for review December 20, 2024 01:11

yanxi0830 requested review from ashwinb, hardikjshah, dltn, raghotham and dineshyv as code owners December 20, 2024 01:11

yanxi0830 added 2 commits December 19, 2024 17:13

more scoring function for rag

40b8ec3

Merge branch 'rag_scoring_fn_1' into rag_scoring_fn_2

dbecff6

yanxi0830 requested a review from vladimirivic as a code owner December 31, 2024 01:20

yanxi0830 added 3 commits December 30, 2024 17:23

Merge branch 'rag_scoring_fn_1' into rag_scoring_fn_2

21650e0

Merge branch 'rag_scoring_fn_1' into rag_scoring_fn_2

86b6d41

refactor schema check

41cff91

ashwinb approved these changes Jan 2, 2025

View reviewed changes

yanxi0830 merged commit 2da455f into rag_scoring_fn_1 Jan 2, 2025
2 checks passed

yanxi0830 deleted the rag_scoring_fn_2 branch January 2, 2025 19:19

yanxi0830 mentioned this pull request Jan 2, 2025

[rag evals][1/n] refactor base scoring fn & data schema check #664

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rag evals][2/n] add more braintrust scoring fns for RAG eval #666

[rag evals][2/n] add more braintrust scoring fns for RAG eval #666

yanxi0830 commented Dec 20, 2024 •

edited

Loading

[rag evals][2/n] add more braintrust scoring fns for RAG eval #666

[rag evals][2/n] add more braintrust scoring fns for RAG eval #666

Conversation

yanxi0830 commented Dec 20, 2024 • edited Loading

What does this PR do?

Test Plan

Sources

Before submitting

yanxi0830 commented Dec 20, 2024 •

edited

Loading