Skip to content

Commit

Permalink
feat(benchmarks) Add LLM evaluation pipeline for general NLP challenge (
Browse files Browse the repository at this point in the history
#3767)

Co-authored-by: jafermarq <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
  • Loading branch information
3 people authored Sep 2, 2024
1 parent 0f7c64e commit 24e9af9
Show file tree
Hide file tree
Showing 9 changed files with 446 additions and 24 deletions.
54 changes: 30 additions & 24 deletions benchmarks/flowertune-llm/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
![](_static/flower_llm.jpg)
![](_static/flower_llm.png)

# FlowerTune LLM Leaderboard

Expand All @@ -9,39 +9,40 @@ Please follow the instructions to run and evaluate the federated LLMs.

## Create a new project

As the first step, please register a Flower account on [Flower website](https://flower.ai/login).
Assuming `flwr` package is already installed on your system (check [here](https://flower.ai/docs/framework/how-to-install-flower.html) for `flwr` installation).
We provide a single-line command to create a new project directory based on your selected challenge:
As the first step, please register for a Flower account on [flower.ai/login](https://flower.ai/login).
Then, create a new Python environment and install Flower.

> [!TIP]
> We recommend using `pyenv` and the `virtualenv` plugin to create your environment. Other manager such as Conda would likely work too. Check the [documentation](https://flower.ai/docs/framework/how-to-install-flower.html) for alternative ways of installing Flower.
```shell
flwr new --framework=flwrtune --username=your_flower_account
pip install flwr
```

Then you will see a prompt to ask your project name and the choice of LLM challenges from the set of general NLP, finance, medical and code.
Type your project name and select your preferred challenge,
and then a new project directory will be generated automatically.

### Structure
On the new environment, create a new Flower project using the `FlowerTune` template. You will be prompted for a name to give to your project, your username, and for your choice of LLM challenge:
```shell
flwr new --framework=FlowerTune
```

After running `flwr new`, you will see a new directory generated with the following structure:
The `flwr new` command will generate a directory with the following structure:

```bash
<project-name>
├── README.md # <- Instructions
├── pyproject.toml # <- Environment dependencies
├── pyproject.toml # <- Environment dependencies and configs
└── <project_name>
├── app.py # <- Flower ClientApp/ServerApp build
├── client.py # <- Flower client constructor
├── server.py # <- Sever-related functions
├── models.py # <- Model build
├── client_app.py # <- Flower ClientApp build
├── dataset.py # <- Dataset and tokenizer build
├── conf/config.yaml # <- User configuration
└── conf/static_config.yaml # <- Static configuration
├── models.py # <- Model build
├── server_app.py # <- Flower ServerApp build
└── strategy.py # <- Flower strategy build
```

This can serve as the starting point for you to build up your own federated LLM fine-tuning methods.
Please note that any modification to the content of `conf/static_config.yaml` is strictly prohibited for those who wish to participate in the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard).
Otherwise, the submission will not be considered.

> [!IMPORTANT]
> Please note that if you intend to submit your project as an entry to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard) modifications to `[tool.flwr.app.config.static]` and `[tool.flwr.federations.local-simulation]` sections in the `pyproject.toml` are not allowed and will invalidate the submission.

## Run FlowerTune LLM challenges

Expand All @@ -50,12 +51,17 @@ With a new project directory created, running a baseline challenge can be done b
1. Navigate inside the directory that you just created.


2. Follow the `Environments setup` section of `README.md` in the project directory to install project dependencies.
2. Follow the `Environments setup` section of `README.md` in the project directory to install the project dependencies.


3. Run the challenge as indicated in the `Running the challenge` section in the `README.md`.

## Evaluate pre-trained LLMs
## Evaluate fine-tuned LLMs

Once the LLM fine-tuning finished, evaluate the performance of your fine-tuned LLM
following the `README.md` in [`evaluation`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation) directory.


After the LLM fine-tuning finished, evaluate the performance of your pre-trained LLMs
following the `README.md` in `evaluation` directory.
> [!NOTE]
> If you have any questions about running FlowerTune LLM challenges or evaluation, please feel free to make posts at [Flower Discuss](https://discuss.flower.ai) forum,
or join our [Slack channel](https://flower.ai/join-slack/) to ask questions in the `#flowertune-llm-leaderboard` channel.
Binary file removed benchmarks/flowertune-llm/_static/flower_llm.jpg
Binary file not shown.
Binary file added benchmarks/flowertune-llm/_static/flower_llm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
46 changes: 46 additions & 0 deletions benchmarks/flowertune-llm/evaluation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# FlowerTune LLM Evaluation

This directory provides various evaluation metrics to assess the quality of your fine-tuned LLMs.
If you are participating [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard), evaluating your fine-tuned LLM is the final step prior to have your submission added to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate). The evaluation scores generated here will be displayed as the definitive values on the LLM Leaderboard.

## How to run

Navigate to the directory corresponding to your selected challenge (`general NLP`, `finance`, `medical`, or `code`) and follow the instructions there to execute the evaluation.

> [!NOTE]
> If you wish to participate in the LLM Leaderboard, you must not modify the evaluation code and should use the exact command provided in the respective directory to run the evaluation.

## Baseline results

The default template generated by `flwr new` (see the [Project Creation Instructions](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm#create-a-new-project)) for each challenge will produce results as follows, which serve as the lower bound on the LLM Leaderboard.

### General NLP

| | MT-1 | MT-2 | MT-Avg |
|:--------:|:----:|:----:|:------:|
| MT Score | 5.54 | 5.52 | 5.53 |

### Finance

| | FPB | FIQA | TFNS | Avg |
|:-------:|:-----:|:-----:|:-----:|:-----:|
| Acc (%) | 44.55 | 62.50 | 28.77 | 45.27 |

### Medical

| | PubMedQA | MedMCQA | MedQA | Avg |
|:-------:|:--------:|:-------:|:-----:|:-----:|
| Acc (%) | 59.00 | 23.69 | 27.10 | 36.60 |

### Code

| | MBPP | HumanEval | MultiPL-E (JS) | MultiPL-E (C++) | Avg |
|:----------:|:-----:|:---------:|:--------------:|:---------------:|:-----:|
| Pass@1 (%) | 32.60 | 26.83 | 29.81 | 24.22 | 28.37 |


## Make submission on FlowerTune LLM Leaderboard

If your LLM outperforms the listed benchmarks in any challenge,
we encourage you to submit your code and model to the FlowerTune LLM Leaderboard without hesitation (see the [How-to-participate Instructions](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate)).
63 changes: 63 additions & 0 deletions benchmarks/flowertune-llm/evaluation/general-nlp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Evaluation for General NLP challenge

We leverage MT-Bench metric provided by [FastChat](https://github.com/lm-sys/FastChat) to evaluate fine-tuned LLMs.
[MT-Bench](https://arxiv.org/abs/2306.05685) represents a comprehensive suite of multi-turn, open-ended questions designed to evaluate chat assistants.
Strong LLMs, such as GPT-4, serve as judges to assess the quality of responses provided by the chat assistants under examination.

## Environment Setup

```shell
git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/general-nlp ./flowertune-eval-general-nlp && rm -rf flower && cd flowertune-eval-general-nlp
```

Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with:

```shell
# From a new python environment, run:
pip install -r requirements.txt

# Log in HuggingFace account
huggingface-cli login
```

Download data from [FastChat](https://github.com/lm-sys/FastChat):

```shell
git clone --depth=1 https://github.com/lm-sys/FastChat.git && cd FastChat && git checkout d561f87b24de197e25e3ddf7e09af93ced8dfe36 && mv fastchat/llm_judge/data ../data && cd .. && rm -rf FastChat
```


## Generate model answers from MT-bench questions

```bash
python gen_model_answer.py --peft-path=/path/to/fine-tuned-peft-model-dir/ # e.g., ./peft_1
```
The answers will be saved to `data/mt_bench/model_answer/[base_model_name].jsonl` in default.


## Generate judgments using GPT-4

Please follow these [instructions](https://platform.openai.com/docs/quickstart/developer-quickstart) to create a OpenAI API key.
The estimated costs of running this evaluation is approximately USD10.

> [!NOTE]
> If you changed the base model of your LLM project specify it to the command below via `--model-list`.
```bash
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key
python gen_judgement.py --model-list Mistral-7B-v0.3
```

The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_single.jsonl` in default.


## Show MT-bench scores

```bash
python show_result.py --model-list Mistral-7B-v0.3
```
GPT-4 will give a score on a scale of 10 to the first-turn (MT-1) and second-turn (MT-2) of the conversations, along with an average value as the third score.

> [!NOTE]
> Please ensure that you provide all **three scores** when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section).
130 changes: 130 additions & 0 deletions benchmarks/flowertune-llm/evaluation/general-nlp/gen_judgement.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
"""
This python file is adapted from https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_judgment.py
FastChat (https://github.com/lm-sys/FastChat) is licensed under the Apache License, Version 2.0.
Citation:
@misc{zheng2023judging,
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu
and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang
and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2306.05685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
"""

import argparse
import json

from fastchat.llm_judge.common import (
NEED_REF_CATS,
check_data,
get_model_list,
load_judge_prompts,
load_model_answers,
load_questions,
play_a_match_single,
)
from fastchat.llm_judge.gen_judgment import make_judge_single, make_match_single
from tqdm import tqdm

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--judge-file",
type=str,
default="data/judge_prompts.jsonl",
help="The file of judge prompts.",
)
parser.add_argument("--judge-model", type=str, default="gpt-4")
parser.add_argument(
"--model-list",
type=str,
nargs="+",
default=None,
help="A list of models to be evaluated",
)
args = parser.parse_args()

question_file = "data/mt_bench/question.jsonl"
answer_dir = "data/mt_bench/model_answer"
ref_answer_dir = "data/mt_bench/reference_answer"

# Load questions
questions = load_questions(question_file, None, None)

# Load answers
model_answers = load_model_answers(answer_dir)
ref_answers = load_model_answers(ref_answer_dir)

# Load judge
judge_prompts = load_judge_prompts(args.judge_file)

if args.model_list is None:
models = get_model_list(answer_dir)
else:
models = args.model_list

judges = make_judge_single(args.judge_model, judge_prompts)
play_a_match_func = play_a_match_single
output_file = f"data/mt_bench/model_judgment/{args.judge_model}_single.jsonl"
make_match_func = make_match_single
baseline_model = None

check_data(questions, model_answers, ref_answers, models, judges)

question_math = [q for q in questions if q["category"] in NEED_REF_CATS]
question_default = [q for q in questions if q["category"] not in NEED_REF_CATS]

# Make matches
matches = []
matches += make_match_func(
question_default, models, model_answers, judges["default"], baseline_model
)
matches += make_match_func(
question_math,
models,
model_answers,
judges["math"],
baseline_model,
ref_answers,
)
matches += make_match_func(
question_default,
models,
model_answers,
judges["default-mt"],
baseline_model,
multi_turn=True,
)
matches += make_match_func(
question_math,
models,
model_answers,
judges["math-mt"],
baseline_model,
ref_answers,
multi_turn=True,
)

match_stat = {}
match_stat["bench_name"] = "mt_bench"
match_stat["mode"] = "single"
match_stat["judge"] = args.judge_model
match_stat["baseline"] = baseline_model
match_stat["model_list"] = models
match_stat["total_num_questions"] = len(questions)
match_stat["total_num_matches"] = len(matches)
match_stat["output_path"] = output_file

# Show match stats and prompt enter to continue
print("Stats:")
print(json.dumps(match_stat, indent=4))
input("Press Enter to confirm...")

# Play matches
for match in tqdm(matches):
play_a_match_func(match, output_file=output_file)
Loading

0 comments on commit 24e9af9

Please sign in to comment.