Evals API MVP #235

yanxi0830 · 2024-10-10T18:36:31Z

DevX Flow

Step 1. Register Eval Dataset

python -m llama_stack.apis.datasets.client

Step 2. Run Eval Scorer

python -m llama_stack.apis.evals.client

(benchmark) run full preprocess->generation->postprocess->score eval task flow

(evaluate score only) run scorer only on prepared eval dataset w/ columns expected_answer and generated_answer

LLM As Judge

Scorer for braintrust AnswerCorrectness()

Scorer for using judge from LlamaStack distribution hosted models (via inference_api)

Eval Dataset Template Schema

@json_schema_type
class PostprocessedGeneration(BaseModel):
    completion_message: str
    logprobs: Optional[List[TokenLogProbs]] = None

@json_schema_type
class ScorerInputSample(DatasetSample):
    """
    A dataset is required to have the following columns to be used for scoring:
    - generated_answer: str
    - expected_answer: Union[str, List[str]]
    - (optional) input_query: str
    - (optional) generation_output: PostprocessedGeneration
    """

    generated_answer: str
    expected_answer: Union[str, List[str]]
    input_query: Optional[str] = None
    generation_output: Optional[PostprocessedGeneration] = None

High-Level Changes

1. New endpoints API added

/datasets for registering/deleting datasets
- /datasets/create --> add new datasets to distribution. supports custom file upload / url / huggingface datasets.
- /datasets/get
- /datasets/delete
- /datasets/list
/evals for running evaluation tasks
- /evals/run_eval_task --> run full eval including preprocessing -> generation -> postprocessing -> scoring
- /evals/run_scorer --> run scoring only

2. New datastructures added

Registry: for maintaining datasets, scorers, processors
- BaseScorer: evaluation methods for scoring
- BaseDataset: supports custom datasets / huggingface datasets
- BaseGeneratorProcessor: performing preprocessing/postprocessing before/after generation
- BaseGenerator: performing inference generation

(experimental) eleuther harness

integrate w/ Eleuther Eval Harness
add custom task for eleuther eval harness on mmlu_pro (e.g. recipe)
dummy loglikelihood outputs for inference

-- Not in this PR
🚧 jobs API for background eval job scheduling
🚧 batch inference
🚧 persist intermediate datasets during run_eval_task

llama_stack/apis/dataset/dataset.py

ashwinb · 2024-10-25T05:13:21Z

we could close this one?

yanxi0830 · 2024-10-25T20:53:35Z

See PRs #323

evals new rebase

31c046d

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2024

yanxi0830 mentioned this pull request Oct 10, 2024

[WIP] skeleton evals API & eleuther evals provider #187

Closed

yanxi0830 added 5 commits October 10, 2024 11:38

clean

c8de439

add dataset datatypes

99ed142

wip add datatypes

9816c9a

add data structure to tasks

ad18dc9

eleuther eval fix

fb565df

yanxi0830 commented Oct 11, 2024

View reviewed changes

llama_stack/apis/dataset/dataset.py Outdated Show resolved Hide resolved

yanxi0830 added 19 commits October 13, 2024 23:27

generator + scorer Api for MMLU

a25aff2

cleanup original BaseTask

8890de7

RunEvalTask / InferenceGenerator

78cb88c

registry refactor

18fe966

datasets api

f046899

datasets api crud

a9210cd

cleanup hardcoded dataset registry

9c501d0

scorer registry

c50686b

registry refactor

95fd53d

processor registry

a22c31b

scorer only api

fcb8dea

full accuracy

c8f6849

braintrust scorer

7b58950

input query optional input for braintrust scorer

3c29108

dataset accept file uploads

ec6c63b

rag correctness scorer w/ custom dataset

9cc0a54

openapi gen

d2b6215

move eval_task_config to client

cccd5be

full evals / full scoring flow

be4f395

yanxi0830 changed the title ~~[WIP] Evals API MVP~~ Evals API MVP Oct 15, 2024

Merge branch 'main' into evals_new

2c23a66

regen openapi

0c4ed66

yanxi0830 mentioned this pull request Oct 15, 2024

[evals api] example RAG agent evals meta-llama/llama-stack-apps#94

Closed

yanxi0830 marked this pull request as ready for review October 15, 2024 17:25

yanxi0830 requested review from ashwinb, hardikjshah, dltn and raghotham as code owners October 15, 2024 17:25

llm judge llamastack scorer

fa68809

yanxi0830 closed this Oct 25, 2024

yanxi0830 deleted the evals_new branch October 29, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals API MVP #235

Evals API MVP #235

yanxi0830 commented Oct 10, 2024 •

edited

Loading

ashwinb commented Oct 25, 2024

yanxi0830 commented Oct 25, 2024

Evals API MVP #235

Evals API MVP #235

Conversation

yanxi0830 commented Oct 10, 2024 • edited Loading

DevX Flow

Step 1. Register Eval Dataset

Step 2. Run Eval Scorer

LLM As Judge

Eval Dataset Template Schema

High-Level Changes

1. New endpoints API added

2. New datastructures added

(experimental) eleuther harness

ashwinb commented Oct 25, 2024

yanxi0830 commented Oct 25, 2024

yanxi0830 commented Oct 10, 2024 •

edited

Loading