Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evals API MVP #235

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Evals API MVP #235

wants to merge 28 commits into from

Conversation

yanxi0830
Copy link
Contributor

@yanxi0830 yanxi0830 commented Oct 10, 2024

DevX Flow

Step 1. Register Eval Dataset
python -m llama_stack.apis.datasets.client
image
Step 2. Run Eval Scorer
python -m llama_stack.apis.evals.client
image
  • (benchmark) run full preprocess->generation->postprocess->score eval task flow
image
  • (evaluate score only) run scorer only on prepared eval dataset w/ columns expected_answer and generated_answer
LLM As Judge
  • Scorer for braintrust AnswerCorrectness()
image
  • Scorer for using judge from LlamaStack distribution hosted models (via inference_api)
image
Eval Dataset Template Schema
@json_schema_type
class PostprocessedGeneration(BaseModel):
    completion_message: str
    logprobs: Optional[List[TokenLogProbs]] = None

@json_schema_type
class ScorerInputSample(DatasetSample):
    """
    A dataset is required to have the following columns to be used for scoring:
    - generated_answer: str
    - expected_answer: Union[str, List[str]]
    - (optional) input_query: str
    - (optional) generation_output: PostprocessedGeneration
    """

    generated_answer: str
    expected_answer: Union[str, List[str]]
    input_query: Optional[str] = None
    generation_output: Optional[PostprocessedGeneration] = None

High-Level Changes

1. New endpoints API added
  • /datasets for registering/deleting datasets

    • /datasets/create --> add new datasets to distribution. supports custom file upload / url / huggingface datasets.
    • /datasets/get
    • /datasets/delete
    • /datasets/list
  • /evals for running evaluation tasks

    • /evals/run_eval_task --> run full eval including preprocessing -> generation -> postprocessing -> scoring
    • /evals/run_scorer --> run scoring only
2. New datastructures added
  • Registry: for maintaining datasets, scorers, processors
    • BaseScorer: evaluation methods for scoring
    • BaseDataset: supports custom datasets / huggingface datasets
    • BaseGeneratorProcessor: performing preprocessing/postprocessing before/after generation
    • BaseGenerator: performing inference generation

(experimental) eleuther harness

  • integrate w/ Eleuther Eval Harness
  • add custom task for eleuther eval harness on mmlu_pro (e.g. recipe)
  • dummy loglikelihood outputs for inference
    image

-- Not in this PR
🚧 jobs API for background eval job scheduling
🚧 batch inference
🚧 persist intermediate datasets during run_eval_task

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2024
@yanxi0830 yanxi0830 changed the title [WIP] Evals API MVP Evals API MVP Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants