feat(benchmarks) Add LLM evaluation pipeline for general NLP challenge (

#3767) Co-authored-by: jafermarq <[email protected]> Co-authored-by: Daniel J. Beutel <[email protected]>
adap · Sep 2, 2024 · 24e9af9 · 24e9af9
1 parent 0f7c64e
commit 24e9af9
Show file tree

Hide file tree

Showing 9 changed files with 446 additions and 24 deletions.
diff --git a/benchmarks/flowertune-llm/README.md b/benchmarks/flowertune-llm/README.md
@@ -1,4 +1,4 @@
-![](_static/flower_llm.jpg)
+![](_static/flower_llm.png)
 
 # FlowerTune LLM Leaderboard
 
@@ -9,39 +9,40 @@ Please follow the instructions to run and evaluate the federated LLMs.
 
 ## Create a new project
 
-As the first step, please register a Flower account on [Flower website](https://flower.ai/login).
-Assuming `flwr` package is already installed on your system (check [here](https://flower.ai/docs/framework/how-to-install-flower.html) for `flwr` installation).
-We provide a single-line command to create a new project directory based on your selected challenge:
+As the first step, please register for a Flower account on [flower.ai/login](https://flower.ai/login).
+Then, create a new Python environment and install Flower. 
+
+> [!TIP]
+> We recommend using `pyenv` and the `virtualenv` plugin to create your environment. Other manager such as Conda would likely work too. Check the [documentation](https://flower.ai/docs/framework/how-to-install-flower.html) for alternative ways of installing Flower.
 
 ```shell
-flwr new --framework=flwrtune --username=your_flower_account
+pip install flwr
 ```
 
-Then you will see a prompt to ask your project name and the choice of LLM challenges from the set of general NLP, finance, medical and code.
-Type your project name and select your preferred challenge,
-and then a new project directory will be generated automatically.
-
-### Structure
+On the new environment, create a new Flower project using the `FlowerTune` template. You will be prompted for a name to give to your project, your username, and for your choice of LLM challenge:
+```shell
+flwr new --framework=FlowerTune
+```
 
-After running `flwr new`, you will see a new directory generated with the following structure:
+The `flwr new` command will generate a directory with the following structure:
 
 ```bash
 <project-name>
        ├── README.md           # <- Instructions
-       ├── pyproject.toml      # <- Environment dependencies
+       ├── pyproject.toml      # <- Environment dependencies and configs
        └── <project_name>
-                  ├── app.py          # <- Flower ClientApp/ServerApp build
-                  ├── client.py       # <- Flower client constructor
-                  ├── server.py       # <- Sever-related functions
-                  ├── models.py       # <- Model build
+                  ├── client_app.py   # <- Flower ClientApp build
                   ├── dataset.py      # <- Dataset and tokenizer build
-                  ├── conf/config.yaml         # <- User configuration
-                  └── conf/static_config.yaml  # <- Static configuration
+                  ├── models.py       # <- Model build
+                  ├── server_app.py   # <- Flower ServerApp build
+                  └── strategy.py     # <- Flower strategy build
 ```
 
 This can serve as the starting point for you to build up your own federated LLM fine-tuning methods.
-Please note that any modification to the content of `conf/static_config.yaml` is strictly prohibited for those who wish to participate in the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard).
-Otherwise, the submission will not be considered.
+
+> [!IMPORTANT]
+> Please note that if you intend to submit your project as an entry to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard) modifications to `[tool.flwr.app.config.static]` and `[tool.flwr.federations.local-simulation]` sections in the `pyproject.toml` are not allowed and will invalidate the submission.
+
 
 ## Run FlowerTune LLM challenges
 
@@ -50,12 +51,17 @@ With a new project directory created, running a baseline challenge can be done b
 1. Navigate inside the directory that you just created.
 
 
-2. Follow the `Environments setup` section of `README.md` in the project directory to install project dependencies.
+2. Follow the `Environments setup` section of `README.md` in the project directory to install the project dependencies.
 
 
 3. Run the challenge as indicated in the `Running the challenge` section in the `README.md`.
 
-## Evaluate pre-trained LLMs
+## Evaluate fine-tuned LLMs
+
+Once the LLM fine-tuning finished, evaluate the performance of your fine-tuned LLM
+following the `README.md` in [`evaluation`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation) directory.
+
 
-After the LLM fine-tuning finished, evaluate the performance of your pre-trained LLMs
-following the `README.md` in `evaluation` directory.
+> [!NOTE]
+> If you have any questions about running FlowerTune LLM challenges or evaluation, please feel free to make posts at [Flower Discuss](https://discuss.flower.ai) forum, 
+or join our [Slack channel](https://flower.ai/join-slack/) to ask questions in the `#flowertune-llm-leaderboard` channel.
diff --git a/benchmarks/flowertune-llm/_static/flower_llm.jpg b/benchmarks/flowertune-llm/_static/flower_llm.jpg
diff --git a/benchmarks/flowertune-llm/_static/flower_llm.png b/benchmarks/flowertune-llm/_static/flower_llm.png
diff --git a/benchmarks/flowertune-llm/evaluation/README.md b/benchmarks/flowertune-llm/evaluation/README.md
@@ -0,0 +1,46 @@
+# FlowerTune LLM Evaluation
+
+This directory provides various evaluation metrics to assess the quality of your fine-tuned LLMs.
+If you are participating [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard), evaluating your fine-tuned LLM is the final step prior to have your submission added to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate). The evaluation scores generated here will be displayed as the definitive values on the LLM Leaderboard.
+
+## How to run
+
+Navigate to the directory corresponding to your selected challenge (`general NLP`, `finance`, `medical`, or `code`) and follow the instructions there to execute the evaluation.
+
+> [!NOTE]  
+> If you wish to participate in the LLM Leaderboard, you must not modify the evaluation code and should use the exact command provided in the respective directory to run the evaluation.
+
+
+## Baseline results
+
+The default template generated by `flwr new` (see the [Project Creation Instructions](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm#create-a-new-project)) for each challenge will produce results as follows, which serve as the lower bound on the LLM Leaderboard.
+
+### General NLP
+
+|          | MT-1 | MT-2 | MT-Avg |  
+|:--------:|:----:|:----:|:------:|
+| MT Score | 5.54 | 5.52 |  5.53  |
+
+### Finance
+
+|         |  FPB  | FIQA  | TFNS  |  Avg  |  
+|:-------:|:-----:|:-----:|:-----:|:-----:|
+| Acc (%) | 44.55 | 62.50 | 28.77 | 45.27 |
+
+### Medical
+
+|         | PubMedQA | MedMCQA | MedQA |  Avg  |  
+|:-------:|:--------:|:-------:|:-----:|:-----:|
+| Acc (%) |  59.00   |  23.69  | 27.10 | 36.60 |
+
+### Code
+
+|            | MBPP  | HumanEval | MultiPL-E (JS) | MultiPL-E (C++) |  Avg  |  
+|:----------:|:-----:|:---------:|:--------------:|:---------------:|:-----:|
+| Pass@1 (%) | 32.60 |   26.83   |     29.81      |      24.22      | 28.37 |
+
+
+## Make submission on FlowerTune LLM Leaderboard
+
+If your LLM outperforms the listed benchmarks in any challenge, 
+we encourage you to submit your code and model to the FlowerTune LLM Leaderboard without hesitation (see the [How-to-participate Instructions](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate)).
diff --git a/benchmarks/flowertune-llm/evaluation/general-nlp/README.md b/benchmarks/flowertune-llm/evaluation/general-nlp/README.md
@@ -0,0 +1,63 @@
+# Evaluation for General NLP challenge
+
+We leverage MT-Bench metric provided by [FastChat](https://github.com/lm-sys/FastChat) to evaluate fine-tuned LLMs.
+[MT-Bench](https://arxiv.org/abs/2306.05685) represents a comprehensive suite of multi-turn, open-ended questions designed to evaluate chat assistants.
+Strong LLMs, such as GPT-4, serve as judges to assess the quality of responses provided by the chat assistants under examination.
+
+## Environment Setup
+
+```shell
+git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/general-nlp ./flowertune-eval-general-nlp && rm -rf flower && cd flowertune-eval-general-nlp
+```
+
+Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with:
+
+```shell
+# From a new python environment, run:
+pip install -r requirements.txt
+
+# Log in HuggingFace account
+huggingface-cli login
+```
+
+Download data from [FastChat](https://github.com/lm-sys/FastChat):
+
+```shell
+git clone --depth=1 https://github.com/lm-sys/FastChat.git && cd FastChat && git checkout d561f87b24de197e25e3ddf7e09af93ced8dfe36 && mv fastchat/llm_judge/data ../data && cd .. && rm -rf FastChat
+```
+
+
+## Generate model answers from MT-bench questions
+
+```bash
+python gen_model_answer.py --peft-path=/path/to/fine-tuned-peft-model-dir/ # e.g., ./peft_1
+```
+The answers will be saved to `data/mt_bench/model_answer/[base_model_name].jsonl` in default.
+
+
+## Generate judgments using GPT-4
+
+Please follow these [instructions](https://platform.openai.com/docs/quickstart/developer-quickstart) to create a OpenAI API key.
+The estimated costs of running this evaluation is approximately USD10.
+
+> [!NOTE]
+> If you changed the base model of your LLM project specify it to the command below via `--model-list`.
+
+```bash
+export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
+python gen_judgement.py --model-list Mistral-7B-v0.3
+```
+
+The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_single.jsonl` in default.
+
+
+## Show MT-bench scores
+
+```bash
+python show_result.py --model-list Mistral-7B-v0.3
+```
+GPT-4 will give a score on a scale of 10 to the first-turn (MT-1) and second-turn (MT-2) of the conversations, along with an average value as the third score.
+
+> [!NOTE]
+> Please ensure that you provide all **three scores** when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section).
+
diff --git a/benchmarks/flowertune-llm/evaluation/general-nlp/gen_judgement.py b/benchmarks/flowertune-llm/evaluation/general-nlp/gen_judgement.py
@@ -0,0 +1,130 @@
+"""
+This python file is adapted from https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_judgment.py
+
+FastChat (https://github.com/lm-sys/FastChat) is licensed under the Apache License, Version 2.0.
+
+Citation:
+@misc{zheng2023judging,
+      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
+      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu
+      and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang
+      and Joseph E. Gonzalez and Ion Stoica},
+      year={2023},
+      eprint={2306.05685},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+"""
+
+import argparse
+import json
+
+from fastchat.llm_judge.common import (
+    NEED_REF_CATS,
+    check_data,
+    get_model_list,
+    load_judge_prompts,
+    load_model_answers,
+    load_questions,
+    play_a_match_single,
+)
+from fastchat.llm_judge.gen_judgment import make_judge_single, make_match_single
+from tqdm import tqdm
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--judge-file",
+        type=str,
+        default="data/judge_prompts.jsonl",
+        help="The file of judge prompts.",
+    )
+    parser.add_argument("--judge-model", type=str, default="gpt-4")
+    parser.add_argument(
+        "--model-list",
+        type=str,
+        nargs="+",
+        default=None,
+        help="A list of models to be evaluated",
+    )
+    args = parser.parse_args()
+
+    question_file = "data/mt_bench/question.jsonl"
+    answer_dir = "data/mt_bench/model_answer"
+    ref_answer_dir = "data/mt_bench/reference_answer"
+
+    # Load questions
+    questions = load_questions(question_file, None, None)
+
+    # Load answers
+    model_answers = load_model_answers(answer_dir)
+    ref_answers = load_model_answers(ref_answer_dir)
+
+    # Load judge
+    judge_prompts = load_judge_prompts(args.judge_file)
+
+    if args.model_list is None:
+        models = get_model_list(answer_dir)
+    else:
+        models = args.model_list
+
+    judges = make_judge_single(args.judge_model, judge_prompts)
+    play_a_match_func = play_a_match_single
+    output_file = f"data/mt_bench/model_judgment/{args.judge_model}_single.jsonl"
+    make_match_func = make_match_single
+    baseline_model = None
+
+    check_data(questions, model_answers, ref_answers, models, judges)
+
+    question_math = [q for q in questions if q["category"] in NEED_REF_CATS]
+    question_default = [q for q in questions if q["category"] not in NEED_REF_CATS]
+
+    # Make matches
+    matches = []
+    matches += make_match_func(
+        question_default, models, model_answers, judges["default"], baseline_model
+    )
+    matches += make_match_func(
+        question_math,
+        models,
+        model_answers,
+        judges["math"],
+        baseline_model,
+        ref_answers,
+    )
+    matches += make_match_func(
+        question_default,
+        models,
+        model_answers,
+        judges["default-mt"],
+        baseline_model,
+        multi_turn=True,
+    )
+    matches += make_match_func(
+        question_math,
+        models,
+        model_answers,
+        judges["math-mt"],
+        baseline_model,
+        ref_answers,
+        multi_turn=True,
+    )
+
+    match_stat = {}
+    match_stat["bench_name"] = "mt_bench"
+    match_stat["mode"] = "single"
+    match_stat["judge"] = args.judge_model
+    match_stat["baseline"] = baseline_model
+    match_stat["model_list"] = models
+    match_stat["total_num_questions"] = len(questions)
+    match_stat["total_num_matches"] = len(matches)
+    match_stat["output_path"] = output_file
+
+    # Show match stats and prompt enter to continue
+    print("Stats:")
+    print(json.dumps(match_stat, indent=4))
+    input("Press Enter to confirm...")
+
+    # Play matches
+    for match in tqdm(matches):
+        play_a_match_func(match, output_file=output_file)