-
Notifications
You must be signed in to change notification settings - Fork 941
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(benchmarks) Add LLM evaluation pipeline for general NLP challenge (
#3767) Co-authored-by: jafermarq <[email protected]> Co-authored-by: Daniel J. Beutel <[email protected]>
- Loading branch information
1 parent
0f7c64e
commit 24e9af9
Showing
9 changed files
with
446 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# FlowerTune LLM Evaluation | ||
|
||
This directory provides various evaluation metrics to assess the quality of your fine-tuned LLMs. | ||
If you are participating [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard), evaluating your fine-tuned LLM is the final step prior to have your submission added to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate). The evaluation scores generated here will be displayed as the definitive values on the LLM Leaderboard. | ||
|
||
## How to run | ||
|
||
Navigate to the directory corresponding to your selected challenge (`general NLP`, `finance`, `medical`, or `code`) and follow the instructions there to execute the evaluation. | ||
|
||
> [!NOTE] | ||
> If you wish to participate in the LLM Leaderboard, you must not modify the evaluation code and should use the exact command provided in the respective directory to run the evaluation. | ||
|
||
## Baseline results | ||
|
||
The default template generated by `flwr new` (see the [Project Creation Instructions](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm#create-a-new-project)) for each challenge will produce results as follows, which serve as the lower bound on the LLM Leaderboard. | ||
|
||
### General NLP | ||
|
||
| | MT-1 | MT-2 | MT-Avg | | ||
|:--------:|:----:|:----:|:------:| | ||
| MT Score | 5.54 | 5.52 | 5.53 | | ||
|
||
### Finance | ||
|
||
| | FPB | FIQA | TFNS | Avg | | ||
|:-------:|:-----:|:-----:|:-----:|:-----:| | ||
| Acc (%) | 44.55 | 62.50 | 28.77 | 45.27 | | ||
|
||
### Medical | ||
|
||
| | PubMedQA | MedMCQA | MedQA | Avg | | ||
|:-------:|:--------:|:-------:|:-----:|:-----:| | ||
| Acc (%) | 59.00 | 23.69 | 27.10 | 36.60 | | ||
|
||
### Code | ||
|
||
| | MBPP | HumanEval | MultiPL-E (JS) | MultiPL-E (C++) | Avg | | ||
|:----------:|:-----:|:---------:|:--------------:|:---------------:|:-----:| | ||
| Pass@1 (%) | 32.60 | 26.83 | 29.81 | 24.22 | 28.37 | | ||
|
||
|
||
## Make submission on FlowerTune LLM Leaderboard | ||
|
||
If your LLM outperforms the listed benchmarks in any challenge, | ||
we encourage you to submit your code and model to the FlowerTune LLM Leaderboard without hesitation (see the [How-to-participate Instructions](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate)). |
63 changes: 63 additions & 0 deletions
63
benchmarks/flowertune-llm/evaluation/general-nlp/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Evaluation for General NLP challenge | ||
|
||
We leverage MT-Bench metric provided by [FastChat](https://github.com/lm-sys/FastChat) to evaluate fine-tuned LLMs. | ||
[MT-Bench](https://arxiv.org/abs/2306.05685) represents a comprehensive suite of multi-turn, open-ended questions designed to evaluate chat assistants. | ||
Strong LLMs, such as GPT-4, serve as judges to assess the quality of responses provided by the chat assistants under examination. | ||
|
||
## Environment Setup | ||
|
||
```shell | ||
git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/general-nlp ./flowertune-eval-general-nlp && rm -rf flower && cd flowertune-eval-general-nlp | ||
``` | ||
|
||
Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with: | ||
|
||
```shell | ||
# From a new python environment, run: | ||
pip install -r requirements.txt | ||
|
||
# Log in HuggingFace account | ||
huggingface-cli login | ||
``` | ||
|
||
Download data from [FastChat](https://github.com/lm-sys/FastChat): | ||
|
||
```shell | ||
git clone --depth=1 https://github.com/lm-sys/FastChat.git && cd FastChat && git checkout d561f87b24de197e25e3ddf7e09af93ced8dfe36 && mv fastchat/llm_judge/data ../data && cd .. && rm -rf FastChat | ||
``` | ||
|
||
|
||
## Generate model answers from MT-bench questions | ||
|
||
```bash | ||
python gen_model_answer.py --peft-path=/path/to/fine-tuned-peft-model-dir/ # e.g., ./peft_1 | ||
``` | ||
The answers will be saved to `data/mt_bench/model_answer/[base_model_name].jsonl` in default. | ||
|
||
|
||
## Generate judgments using GPT-4 | ||
|
||
Please follow these [instructions](https://platform.openai.com/docs/quickstart/developer-quickstart) to create a OpenAI API key. | ||
The estimated costs of running this evaluation is approximately USD10. | ||
|
||
> [!NOTE] | ||
> If you changed the base model of your LLM project specify it to the command below via `--model-list`. | ||
```bash | ||
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key | ||
python gen_judgement.py --model-list Mistral-7B-v0.3 | ||
``` | ||
|
||
The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_single.jsonl` in default. | ||
|
||
|
||
## Show MT-bench scores | ||
|
||
```bash | ||
python show_result.py --model-list Mistral-7B-v0.3 | ||
``` | ||
GPT-4 will give a score on a scale of 10 to the first-turn (MT-1) and second-turn (MT-2) of the conversations, along with an average value as the third score. | ||
|
||
> [!NOTE] | ||
> Please ensure that you provide all **three scores** when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section). | ||
130 changes: 130 additions & 0 deletions
130
benchmarks/flowertune-llm/evaluation/general-nlp/gen_judgement.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
""" | ||
This python file is adapted from https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_judgment.py | ||
FastChat (https://github.com/lm-sys/FastChat) is licensed under the Apache License, Version 2.0. | ||
Citation: | ||
@misc{zheng2023judging, | ||
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, | ||
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu | ||
and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang | ||
and Joseph E. Gonzalez and Ion Stoica}, | ||
year={2023}, | ||
eprint={2306.05685}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
} | ||
""" | ||
|
||
import argparse | ||
import json | ||
|
||
from fastchat.llm_judge.common import ( | ||
NEED_REF_CATS, | ||
check_data, | ||
get_model_list, | ||
load_judge_prompts, | ||
load_model_answers, | ||
load_questions, | ||
play_a_match_single, | ||
) | ||
from fastchat.llm_judge.gen_judgment import make_judge_single, make_match_single | ||
from tqdm import tqdm | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument( | ||
"--judge-file", | ||
type=str, | ||
default="data/judge_prompts.jsonl", | ||
help="The file of judge prompts.", | ||
) | ||
parser.add_argument("--judge-model", type=str, default="gpt-4") | ||
parser.add_argument( | ||
"--model-list", | ||
type=str, | ||
nargs="+", | ||
default=None, | ||
help="A list of models to be evaluated", | ||
) | ||
args = parser.parse_args() | ||
|
||
question_file = "data/mt_bench/question.jsonl" | ||
answer_dir = "data/mt_bench/model_answer" | ||
ref_answer_dir = "data/mt_bench/reference_answer" | ||
|
||
# Load questions | ||
questions = load_questions(question_file, None, None) | ||
|
||
# Load answers | ||
model_answers = load_model_answers(answer_dir) | ||
ref_answers = load_model_answers(ref_answer_dir) | ||
|
||
# Load judge | ||
judge_prompts = load_judge_prompts(args.judge_file) | ||
|
||
if args.model_list is None: | ||
models = get_model_list(answer_dir) | ||
else: | ||
models = args.model_list | ||
|
||
judges = make_judge_single(args.judge_model, judge_prompts) | ||
play_a_match_func = play_a_match_single | ||
output_file = f"data/mt_bench/model_judgment/{args.judge_model}_single.jsonl" | ||
make_match_func = make_match_single | ||
baseline_model = None | ||
|
||
check_data(questions, model_answers, ref_answers, models, judges) | ||
|
||
question_math = [q for q in questions if q["category"] in NEED_REF_CATS] | ||
question_default = [q for q in questions if q["category"] not in NEED_REF_CATS] | ||
|
||
# Make matches | ||
matches = [] | ||
matches += make_match_func( | ||
question_default, models, model_answers, judges["default"], baseline_model | ||
) | ||
matches += make_match_func( | ||
question_math, | ||
models, | ||
model_answers, | ||
judges["math"], | ||
baseline_model, | ||
ref_answers, | ||
) | ||
matches += make_match_func( | ||
question_default, | ||
models, | ||
model_answers, | ||
judges["default-mt"], | ||
baseline_model, | ||
multi_turn=True, | ||
) | ||
matches += make_match_func( | ||
question_math, | ||
models, | ||
model_answers, | ||
judges["math-mt"], | ||
baseline_model, | ||
ref_answers, | ||
multi_turn=True, | ||
) | ||
|
||
match_stat = {} | ||
match_stat["bench_name"] = "mt_bench" | ||
match_stat["mode"] = "single" | ||
match_stat["judge"] = args.judge_model | ||
match_stat["baseline"] = baseline_model | ||
match_stat["model_list"] = models | ||
match_stat["total_num_questions"] = len(questions) | ||
match_stat["total_num_matches"] = len(matches) | ||
match_stat["output_path"] = output_file | ||
|
||
# Show match stats and prompt enter to continue | ||
print("Stats:") | ||
print(json.dumps(match_stat, indent=4)) | ||
input("Press Enter to confirm...") | ||
|
||
# Play matches | ||
for match in tqdm(matches): | ||
play_a_match_func(match, output_file=output_file) |
Oops, something went wrong.