-
Notifications
You must be signed in to change notification settings - Fork 942
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into fix-aggregate-inplace
- Loading branch information
Showing
84 changed files
with
1,885 additions
and
579 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
name: Build Docker Images Main Branch | ||
|
||
on: | ||
push: | ||
branches: | ||
- 'main' | ||
|
||
jobs: | ||
parameters: | ||
if: github.repository == 'adap/flower' | ||
name: Collect docker build parameters | ||
runs-on: ubuntu-22.04 | ||
timeout-minutes: 10 | ||
outputs: | ||
pip-version: ${{ steps.versions.outputs.pip-version }} | ||
setuptools-version: ${{ steps.versions.outputs.setuptools-version }} | ||
flwr-version-ref: ${{ steps.versions.outputs.flwr-version-ref }} | ||
steps: | ||
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1 | ||
|
||
- uses: ./.github/actions/bootstrap | ||
id: bootstrap | ||
|
||
- id: versions | ||
run: | | ||
echo "pip-version=${{ steps.bootstrap.outputs.pip-version }}" >> "$GITHUB_OUTPUT" | ||
echo "setuptools-version=${{ steps.bootstrap.outputs.setuptools-version }}" >> "$GITHUB_OUTPUT" | ||
echo "flwr-version-ref=git+${{ github.server_url }}/${{ github.repository }}.git@${{ github.sha }}" >> "$GITHUB_OUTPUT" | ||
build-docker-base-images: | ||
name: Build base images | ||
if: github.repository == 'adap/flower' | ||
uses: ./.github/workflows/_docker-build.yml | ||
needs: parameters | ||
with: | ||
namespace-repository: flwr/base | ||
file-dir: src/docker/base/ubuntu | ||
build-args: | | ||
PIP_VERSION=${{ needs.parameters.outputs.pip-version }} | ||
SETUPTOOLS_VERSION=${{ needs.parameters.outputs.setuptools-version }} | ||
FLWR_VERSION_REF=${{ needs.parameters.outputs.flwr-version-ref }} | ||
tags: unstable | ||
secrets: | ||
dockerhub-user: ${{ secrets.DOCKERHUB_USERNAME }} | ||
dockerhub-token: ${{ secrets.DOCKERHUB_TOKEN }} | ||
|
||
build-docker-binary-images: | ||
name: Build binary images | ||
if: github.repository == 'adap/flower' | ||
uses: ./.github/workflows/_docker-build.yml | ||
needs: build-docker-base-images | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
images: [ | ||
{ repository: "flwr/superlink", file_dir: "src/docker/superlink" }, | ||
{ repository: "flwr/supernode", file_dir: "src/docker/supernode" }, | ||
{ repository: "flwr/serverapp", file_dir: "src/docker/serverapp" }, | ||
{ repository: "flwr/superexec", file_dir: "src/docker/superexec" }, | ||
{ repository: "flwr/clientapp", file_dir: "src/docker/clientapp" } | ||
] | ||
with: | ||
namespace-repository: ${{ matrix.images.repository }} | ||
file-dir: ${{ matrix.images.file_dir }} | ||
build-args: BASE_IMAGE=unstable | ||
tags: unstable | ||
secrets: | ||
dockerhub-user: ${{ secrets.DOCKERHUB_USERNAME }} | ||
dockerhub-token: ${{ secrets.DOCKERHUB_TOKEN }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# FlowerTune LLM Evaluation | ||
|
||
This directory provides various evaluation metrics to assess the quality of your fine-tuned LLMs. | ||
If you are participating [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard), evaluating your fine-tuned LLM is the final step prior to have your submission added to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate). The evaluation scores generated here will be displayed as the definitive values on the LLM Leaderboard. | ||
|
||
## How to run | ||
|
||
Navigate to the directory corresponding to your selected challenge (`general NLP`, `finance`, `medical`, or `code`) and follow the instructions there to execute the evaluation. | ||
|
||
> [!NOTE] | ||
> If you wish to participate in the LLM Leaderboard, you must not modify the evaluation code and should use the exact command provided in the respective directory to run the evaluation. | ||
|
||
## Baseline results | ||
|
||
The default template generated by `flwr new` (see the [Project Creation Instructions](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm#create-a-new-project)) for each challenge will produce results as follows, which serve as the lower bound on the LLM Leaderboard. | ||
|
||
### General NLP | ||
|
||
| | MT-1 | MT-2 | MT-Avg | | ||
|:--------:|:----:|:----:|:------:| | ||
| MT Score | 5.54 | 5.52 | 5.53 | | ||
|
||
### Finance | ||
|
||
| | FPB | FIQA | TFNS | Avg | | ||
|:-------:|:-----:|:-----:|:-----:|:-----:| | ||
| Acc (%) | 44.55 | 62.50 | 28.77 | 45.27 | | ||
|
||
### Medical | ||
|
||
| | PubMedQA | MedMCQA | MedQA | Avg | | ||
|:-------:|:--------:|:-------:|:-----:|:-----:| | ||
| Acc (%) | 59.00 | 23.69 | 27.10 | 36.60 | | ||
|
||
### Code | ||
|
||
| | MBPP | HumanEval | MultiPL-E (JS) | MultiPL-E (C++) | Avg | | ||
|:----------:|:-----:|:---------:|:--------------:|:---------------:|:-----:| | ||
| Pass@1 (%) | 31.60 | 23.78 | 28.57 | 25.47 | 27.36 | | ||
|
||
|
||
## Make submission on FlowerTune LLM Leaderboard | ||
|
||
If your LLM outperforms the listed benchmarks in any challenge, | ||
we encourage you to submit your code and model to the FlowerTune LLM Leaderboard without hesitation (see the [How-to-participate Instructions](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate)). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Evaluation for Code challenge | ||
|
||
We leverage the code generation evaluation metrics provided by [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main) to evaluate our fine-tuned LLMs. | ||
Three datasets have been selected for this evaluation: [MBPP](https://huggingface.co/datasets/google-research-datasets/mbpp) (Python), [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) (Python), and [MultiPL-E](https://github.com/nuprl/MultiPL-E) (JavaScript, C++). | ||
|
||
> [!WARNING] | ||
> The evaluation process takes ~30 GB VRAM. On a 40GB A100 it requires 15-30mins depending on the dataset to complete. | ||
## Environment Setup | ||
|
||
```shell | ||
git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/code ./flowertune-eval-code && rm -rf flower && cd flowertune-eval-code | ||
``` | ||
|
||
Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with: | ||
|
||
```shell | ||
# From a new python environment, run: | ||
pip install -r requirements.txt | ||
|
||
# Log in HuggingFace account | ||
huggingface-cli login | ||
``` | ||
|
||
After that, install `Node.js` and `g++` for the evaluation of JavaScript, C++: | ||
|
||
```shell | ||
# Install nvm (Node Version Manager) | ||
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash | ||
|
||
# Restart your terminal | ||
|
||
# Download and install Node.js (you may need to restart the terminal) | ||
nvm install 20 | ||
|
||
# Install g++ | ||
sudo apt-get install g++ | ||
``` | ||
|
||
Then, download the `main.py` script from `bigcode-evaluation-harness` repository. | ||
|
||
```shell | ||
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git && cd bigcode-evaluation-harness && git checkout 0f3e95f0806e78a4f432056cdb1be93604a51d69 && mv main.py ../ && cd .. && rm -rf bigcode-evaluation-harness | ||
``` | ||
|
||
|
||
## Generate model answers & calculate pass@1 score | ||
|
||
> [!NOTE] | ||
> Evaluation needs to be run on MBPP, HumanEval, MultiPL-E (JS) and MultiPL-E (C++). | ||
```bash | ||
python main.py \ | ||
--model=mistralai/Mistral-7B-v0.3 \ | ||
--peft_model=/path/to/fine-tuned-peft-model-dir/ \ # e.g., ./peft_1 | ||
--max_length_generation=1024 \ # change to 2048 when running mbpp | ||
--batch_size=4 \ | ||
--use_auth_token \ | ||
--allow_code_execution \ | ||
--save_generations \ | ||
--save_references \ | ||
--tasks=humaneval \ # chosen from [mbpp, humaneval, multiple-js, multiple-cpp] | ||
--metric_output_path=./evaluation_results_humaneval.json # change dataset name based on your choice | ||
``` | ||
|
||
The model answers and pass@1 scores will be saved to `generations_{dataset_name}.json` and `evaluation_results_{dataset_name}.json`, respectively. | ||
|
||
|
||
> [!NOTE] | ||
> Please ensure that you provide all **four pass@1 scores** for the evaluation datasets when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section). |
Oops, something went wrong.