-
Notifications
You must be signed in to change notification settings - Fork 941
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(benchmarks) Add LLM evaluation pipeline for Code challenge (#3801)
Co-authored-by: jafermarq <[email protected]>
- Loading branch information
1 parent
c71a73c
commit 801c0fc
Showing
2 changed files
with
77 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Evaluation for Code challenge | ||
|
||
We leverage the code generation evaluation metrics provided by [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main) to evaluate our fine-tuned LLMs. | ||
Three datasets have been selected for this evaluation: [MBPP](https://huggingface.co/datasets/google-research-datasets/mbpp) (Python), [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) (Python), and [MultiPL-E](https://github.com/nuprl/MultiPL-E) (JavaScript, C++). | ||
|
||
> [!WARNING] | ||
> The evaluation process takes ~30 GB VRAM. On a 40GB A100 it requires 15-30mins depending on the dataset to complete. | ||
## Environment Setup | ||
|
||
```shell | ||
git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/code ./flowertune-eval-code && rm -rf flower && cd flowertune-eval-code | ||
``` | ||
|
||
Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with: | ||
|
||
```shell | ||
# From a new python environment, run: | ||
pip install -r requirements.txt | ||
|
||
# Log in HuggingFace account | ||
huggingface-cli login | ||
``` | ||
|
||
After that, install `Node.js` and `g++` for the evaluation of JavaScript, C++: | ||
|
||
```shell | ||
# Install nvm (Node Version Manager) | ||
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash | ||
|
||
# Restart your terminal | ||
|
||
# Download and install Node.js (you may need to restart the terminal) | ||
nvm install 20 | ||
|
||
# Install g++ | ||
sudo apt-get install g++ | ||
``` | ||
|
||
Then, download the `main.py` script from `bigcode-evaluation-harness` repository. | ||
|
||
```shell | ||
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git && cd bigcode-evaluation-harness && git checkout 0f3e95f0806e78a4f432056cdb1be93604a51d69 && mv main.py ../ && cd .. && rm -rf bigcode-evaluation-harness | ||
``` | ||
|
||
|
||
## Generate model answers & calculate pass@1 score | ||
|
||
> [!NOTE] | ||
> Evaluation needs to be run on MBPP, HumanEval, MultiPL-E (JS) and MultiPL-E (C++). | ||
```bash | ||
python main.py \ | ||
--model=mistralai/Mistral-7B-v0.3 \ | ||
--peft_model=/path/to/fine-tuned-peft-model-dir/ \ # e.g., ./peft_1 | ||
--max_length_generation=1024 \ # change to 2048 when running mbpp | ||
--batch_size=4 \ | ||
--use_auth_token \ | ||
--allow_code_execution \ | ||
--save_generations \ | ||
--save_references \ | ||
--tasks=humaneval \ # chosen from [mbpp, humaneval, multiple-js, multiple-cpp] | ||
--metric_output_path=./evaluation_results_humaneval.json # change dataset name based on your choice | ||
``` | ||
|
||
The model answers and pass@1 scores will be saved to `generations_{dataset_name}.json` and `evaluation_results_{dataset_name}.json`, respectively. | ||
|
||
|
||
> [!NOTE] | ||
> Please ensure that you provide all **four pass@1 scores** for the evaluation datasets when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
peft==0.6.2 | ||
datasets==2.20.0 | ||
evaluate==0.3.0 | ||
sentencepiece==0.2.0 | ||
protobuf==5.27.1 | ||
bitsandbytes==0.43.1 | ||
git+https://github.com/bigcode-project/bigcode-evaluation-harness.git@0f3e95f0806e78a4f432056cdb1be93604a51d69 |