Skip to content

Commit

Permalink
feat(benchmarks) Add LLM evaluation pipeline for Code challenge (#3801)
Browse files Browse the repository at this point in the history
Co-authored-by: jafermarq <[email protected]>
  • Loading branch information
yan-gao-GY and jafermarq authored Sep 10, 2024
1 parent c71a73c commit 801c0fc
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 0 deletions.
70 changes: 70 additions & 0 deletions benchmarks/flowertune-llm/evaluation/code/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Evaluation for Code challenge

We leverage the code generation evaluation metrics provided by [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main) to evaluate our fine-tuned LLMs.
Three datasets have been selected for this evaluation: [MBPP](https://huggingface.co/datasets/google-research-datasets/mbpp) (Python), [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) (Python), and [MultiPL-E](https://github.com/nuprl/MultiPL-E) (JavaScript, C++).

> [!WARNING]
> The evaluation process takes ~30 GB VRAM. On a 40GB A100 it requires 15-30mins depending on the dataset to complete.
## Environment Setup

```shell
git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/code ./flowertune-eval-code && rm -rf flower && cd flowertune-eval-code
```

Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with:

```shell
# From a new python environment, run:
pip install -r requirements.txt

# Log in HuggingFace account
huggingface-cli login
```

After that, install `Node.js` and `g++` for the evaluation of JavaScript, C++:

```shell
# Install nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

# Restart your terminal

# Download and install Node.js (you may need to restart the terminal)
nvm install 20

# Install g++
sudo apt-get install g++
```

Then, download the `main.py` script from `bigcode-evaluation-harness` repository.

```shell
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git && cd bigcode-evaluation-harness && git checkout 0f3e95f0806e78a4f432056cdb1be93604a51d69 && mv main.py ../ && cd .. && rm -rf bigcode-evaluation-harness
```


## Generate model answers & calculate pass@1 score

> [!NOTE]
> Evaluation needs to be run on MBPP, HumanEval, MultiPL-E (JS) and MultiPL-E (C++).
```bash
python main.py \
--model=mistralai/Mistral-7B-v0.3 \
--peft_model=/path/to/fine-tuned-peft-model-dir/ \ # e.g., ./peft_1
--max_length_generation=1024 \ # change to 2048 when running mbpp
--batch_size=4 \
--use_auth_token \
--allow_code_execution \
--save_generations \
--save_references \
--tasks=humaneval \ # chosen from [mbpp, humaneval, multiple-js, multiple-cpp]
--metric_output_path=./evaluation_results_humaneval.json # change dataset name based on your choice
```

The model answers and pass@1 scores will be saved to `generations_{dataset_name}.json` and `evaluation_results_{dataset_name}.json`, respectively.


> [!NOTE]
> Please ensure that you provide all **four pass@1 scores** for the evaluation datasets when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section).
7 changes: 7 additions & 0 deletions benchmarks/flowertune-llm/evaluation/code/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
peft==0.6.2
datasets==2.20.0
evaluate==0.3.0
sentencepiece==0.2.0
protobuf==5.27.1
bitsandbytes==0.43.1
git+https://github.com/bigcode-project/bigcode-evaluation-harness.git@0f3e95f0806e78a4f432056cdb1be93604a51d69

0 comments on commit 801c0fc

Please sign in to comment.