Merge branch 'main' into add-ci-e2e-test-containers

adap · Jun 25, 2024 · 28a9470 · 28a9470
2 parents 5d30d3c + f4ce64c
commit 28a9470
Show file tree

Hide file tree

Showing 181 changed files with 11,920 additions and 7,718 deletions.
diff --git a/.github/workflows/datasets.yml b/.github/workflows/datasets.yml
@@ -16,6 +16,9 @@ concurrency:
   group: ${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_id || github.event.pull_request.number || github.ref }}
   cancel-in-progress: true
 
+env:
+  FLWR_TELEMETRY_ENABLED: 0
+
 defaults:
   run:
     working-directory: datasets

diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml
@@ -164,6 +164,9 @@ jobs:
       - name: Run driver test with client authentication
         if: ${{ matrix.directory == 'bare-client-auth' }}
         run: ./../test_driver.sh bare client-auth
+      - name: Run reconnection test with SQLite database
+        if: ${{ matrix.directory == 'bare' }}
+        run: ./../test_reconnection.sh sqlite
       - name: Cache save Python location
         id: cache-save-python
         uses: actions/cache/save@v4

diff --git a/.github/workflows/framework.yml b/.github/workflows/framework.yml
@@ -31,6 +31,8 @@ jobs:
 
     steps:
       - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
       - name: Bootstrap
         uses: ./.github/actions/bootstrap
         with:

diff --git a/README.md b/README.md
@@ -153,6 +153,7 @@ Other [examples](https://github.com/adap/flower/tree/main/examples):
 - [Flower with KaplanMeierFitter from the lifelines library](https://github.com/adap/flower/tree/main/examples/federated-kaplan-meier-fitter)
 - [Sample Level Privacy with Opacus](https://github.com/adap/flower/tree/main/examples/opacus)
 - [Sample Level Privacy with TensorFlow-Privacy](https://github.com/adap/flower/tree/main/examples/tensorflow-privacy)
+- [Flower with a Tabular Dataset](https://github.com/adap/flower/tree/main/examples/fl-tabular)
 
 ## Community
 

diff --git a/benchmarks/flowertune-llm/README.md b/benchmarks/flowertune-llm/README.md
@@ -0,0 +1,61 @@
+![](_static/flower_llm.jpg)
+
+# FlowerTune LLM Leaderboard
+
+This repository guides you through the process of federated LLM instruction tuning with a
+pre-trained [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.3) model across 4 domains --- general NLP, finance, medical and code.
+
+Please follow the instructions to run and evaluate the federated LLMs.
+
+## Create a new project
+
+As the first step, please register a Flower account on [Flower website](https://flower.ai/login).
+Assuming `flwr` package is already installed on your system (check [here](https://flower.ai/docs/framework/how-to-install-flower.html) for `flwr` installation).
+We provide a single-line command to create a new project directory based on your selected challenge:
+
+```shell
+flwr new --framework=flwrtune --username=your_flower_account
+```
+
+Then you will see a prompt to ask your project name and the choice of LLM challenges from the set of general NLP, finance, medical and code.
+Type your project name and select your preferred challenge,
+and then a new project directory will be generated automatically.
+
+### Structure
+
+After running `flwr new`, you will see a new directory generated with the following structure:
+
+```bash
+<project-name>
+       ├── README.md           # <- Instructions
+       ├── pyproject.toml      # <- Environment dependencies
+       └── <project_name>
+                  ├── app.py          # <- Flower ClientApp/ServerApp build
+                  ├── client.py       # <- Flower client constructor
+                  ├── server.py       # <- Sever-related functions
+                  ├── models.py       # <- Model build
+                  ├── dataset.py      # <- Dataset and tokenizer build
+                  ├── conf/config.yaml         # <- User configuration
+                  └── conf/static_config.yaml  # <- Static configuration
+```
+
+This can serve as the starting point for you to build up your own federated LLM fine-tuning methods.
+Please note that any modification to the content of `conf/static_config.yaml` is strictly prohibited for those who wish to participate in the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard).
+Otherwise, the submission will not be considered.
+
+## Run FlowerTune LLM challenges
+
+With a new project directory created, running a baseline challenge can be done by:
+
+1. Navigate inside the directory that you just created.
+
+
+2. Follow the `Environments setup` section of `README.md` in the project directory to install project dependencies.
+
+
+3. Run the challenge as indicated in the `Running the challenge` section in the `README.md`.
+
+## Evaluate pre-trained LLMs
+
+After the LLM fine-tuning finished, evaluate the performance of your pre-trained LLMs
+following the `README.md` in `evaluation` directory.
diff --git a/benchmarks/flowertune-llm/_static/flower_llm.jpg b/benchmarks/flowertune-llm/_static/flower_llm.jpg
diff --git a/datasets/README.md b/datasets/README.md
@@ -7,6 +7,21 @@
 [![Slack](https://img.shields.io/badge/Chat-Slack-red)](https://flower.ai/join-slack)
 
 Flower Datasets (`flwr-datasets`) is a library to quickly and easily create datasets for federated learning, federated evaluation, and federated analytics. It was created by the `Flower Labs` team that also created Flower: A Friendly Federated Learning Framework.
+
+
+> [!TIP]
+> For complete documentation that includes API docs, how-to guides and tutorials please visit the [Flower Datasets Documentation](https://flower.ai/docs/datasets/) and for full FL example see the [Flower Examples page](https://github.com/adap/flower/tree/main/examples).
+
+## Installation
+
+For a complete installation guide visit the [Flower Datasets Documenation](https://flower.ai/docs/datasets/)
+
+```bash
+pip install flwr-datasets[vision]
+```
+
+## Overview
+
 Flower Datasets library supports:
 * **downloading datasets** - choose the dataset from Hugging Face's `datasets`,
 * **partitioning datasets** - customize the partitioning scheme,
@@ -21,43 +36,28 @@ Thanks to using Hugging Face's `datasets` used under the hood, Flower Datasets i
 * Jax,
 * Arrow.
 
-Create **custom partitioning schemes** or choose from the **implemented partitioning schemes**:
+Create **custom partitioning schemes** or choose from the **implemented [partitioning schemes](https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.html#module-flwr_datasets.partitioner)**:
+
 * Partitioner (the abstract base class) `Partitioner`
 * IID partitioning `IidPartitioner(num_partitions)`
-* Natural ID partitioner `NaturalIdPartitioner`
+* Dirichlet partitioning `DirichletPartitioner(num_partitions, partition_by, alpha)`
+* InnerDirichlet partitioning `InnerDirichletPartitioner(partition_sizes, partition_by, alpha)`
+* Natural ID partitioner `NaturalIdPartitioner(partition_by)`
 * Size partitioner (the abstract base class for the partitioners dictating the division based the number of samples) `SizePartitioner`
-* Linear partitioner `LinearPartitioner`
-* Square partitioner `SquarePartitioner`
-* Exponential partitioner `ExponentialPartitioner`
-* more to come in future releases.
-
-# Installation
-
-## With pip
+* Linear partitioner `LinearPartitioner(num_partitions)`
+* Square partitioner `SquarePartitioner(num_partitions)`
+* Exponential partitioner `ExponentialPartitioner(num_partitions)`
+* more to come in the future releases (contributions are welcome).
+<p align="center">
+  <img src="./doc/source/_static/readme/comparison_of_partitioning_schemes.png" alt="Comparison of partitioning schemes."/>
+  <br>
+  <em>Comparison of Partitioning Schemes on CIFAR10</em>
+</p>
 
-Flower Datasets can be installed from PyPi
-
-```bash
-pip install flwr-datasets
-```
-
-Install with an extension:
-
-* for image datasets:
-
-```bash
-pip install flwr-datasets[vision]
-```
-
-* for audio datasets:
-
-```bash
-pip install flwr-datasets[audio]
-```
+PS: This plot was generated using a library function (see [flwr_datasets.visualization](https://flower.ai/docs/datasets/ref-api/flwr_datasets.visualization.html) package for more).
 
-If you plan to change the type of the dataset to run the code with your ML framework, make sure to have it installed too.
 
-# Usage
+## Usage
 
 Flower Datasets exposes the `FederatedDataset` abstraction to represent the dataset needed for federated learning/evaluation/analytics. It has two powerful methods that let you handle the dataset preprocessing: `load_partition(partition_id, split)` and `load_split(split)`.
 
@@ -67,16 +67,16 @@ Here's a basic quickstart example of how to partition the MNIST dataset:
 from flwr_datasets import FederatedDataset
 
 # The train split of the MNIST dataset will be partitioned into 100 partitions
-mnist_fds = FederatedDataset("mnist", partitioners={"train": 100}
+fds = FederatedDataset("mnist", partitioners={"train": 100})
 
-mnist_partition_0 = mnist_fds.load_partition(0, "train")
+partition = fds.load_partition(0)
 
-centralized_data = mnist_fds.load_split("test")
+centralized_data = fds.load_split("test")
 ```
 
 For more details, please refer to the specific how-to guides or tutorial. They showcase customization and more advanced features.
 
-# Future release
+## Future release
 
 Here are a few of the things that we will work on in future releases:
 
@@ -85,6 +85,6 @@ Here are a few of the things that we will work on in future releases:
 * ✅ More out-of-the-box `Partitioner`s.
 * ✅ Passing `Partitioner`s via `FederatedDataset`'s `partitioners` argument.
 * ✅ Customization of the dataset splitting before the partitioning.
-* Simplification of the dataset transformation to the popular frameworks/types.
+* ✅ Simplification of the dataset transformation to the popular frameworks/types.
 * Creation of the synthetic data,
 * Support for Vertical FL.
diff --git a/datasets/dev/format.sh b/datasets/dev/format.sh
@@ -3,9 +3,16 @@ set -e
 cd "$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"/../
 
 # Python
-echo "Formatting started"
+echo "Formatting started: Python"
 python -m isort flwr_datasets/
 python -m black -q flwr_datasets/
 python -m docformatter -i -r flwr_datasets/
 python -m ruff check --fix flwr_datasets/
-echo "Formatting done"
+echo "Formatting done: Python"
+
+# Notebooks
+echo "Formatting started: Notebooks"
+python -m black --ipynb -q doc/source/*.ipynb
+KEYS="metadata.celltoolbar metadata.language_info metadata.toc metadata.notify_time metadata.varInspector metadata.accelerator metadata.vscode cell.metadata.id cell.metadata.heading_collapsed cell.metadata.hidden cell.metadata.code_folding cell.metadata.tags cell.metadata.init_cell cell.metadata.vscode cell.metadata.pycharm"
+python -m nbstripout --keep-output doc/source/*.ipynb --extra-keys "$KEYS"
+echo "Formatting done: Notebooks"
diff --git a/datasets/doc/source/_static/readme/comparison_of_partitioning_schemes.png b/datasets/doc/source/_static/readme/comparison_of_partitioning_schemes.png
diff --git a/datasets/doc/source/conf.py b/datasets/doc/source/conf.py
@@ -38,7 +38,7 @@
 author = "The Flower Authors"
 
 # The full version, including alpha/beta/rc tags
-release = "0.1.0"
+release = "0.2.0"
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/datasets/doc/source/how-to-install-flwr-datasets.rst b/datasets/doc/source/how-to-install-flwr-datasets.rst
@@ -42,5 +42,5 @@ If everything worked, it should print the version of Flower Datasets to the comm
 
 .. code-block:: none
 
-  0.0.1
+  0.2.0
 
diff --git a/datasets/doc/source/how-to-visualize-label-distribution.ipynb b/datasets/doc/source/how-to-visualize-label-distribution.ipynb
diff --git a/datasets/doc/source/index.rst b/datasets/doc/source/index.rst
@@ -7,6 +7,15 @@ learning/analytics/evaluation. It is created by the ``Flower Labs`` team that al
 Flower Datasets Framework
 -------------------------
 
+Install
+~~~~~~~
+
+.. code-block:: bash
+
+  python -m pip install "flwr-datasets[vision]"
+
+Check out all the details on how to install Flower Datasets in :doc:`how-to-install-flwr-datasets`.
+
 Tutorials
 ~~~~~~~~~
 
@@ -32,6 +41,7 @@ Problem-oriented how-to guides show step-by-step how to achieve a specific goal.
    how-to-use-with-tensorflow
    how-to-use-with-numpy
    how-to-use-with-local-data
+   how-to-visualize-label-distribution
    how-to-disable-enable-progress-bar
 
 References
@@ -47,15 +57,26 @@ Information-oriented API reference and other reference material.
 
       flwr_datasets
 
+.. toctree::
+   :maxdepth: 1
+   :caption: Reference docs
 
+   ref-telemetry
 
 Main features
 -------------
 Flower Datasets library supports:
 
-- **downloading datasets** - choose the dataset from Hugging Face's ``dataset``
-- **partitioning datasets** - customize the partitioning scheme
+- **downloading datasets** - choose the dataset from Hugging Face's ``dataset`` (`link <https://huggingface.co/datasets>`_)
+- **partitioning datasets** - choose one of the implemented partitioning scheme or create your own.
 - **creating centralized datasets** - leave parts of the dataset unpartitioned (e.g. for centralized evaluation)
+- **visualization of the partitioned datasets** - visualize the label distribution of the partitioned dataset (and compare the results on different parameters of the same partitioning schemes, different datasets, different partitioning schemes, or any mix of them)
+
+
+.. image:: ./_static/readme/comparison_of_partitioning_schemes.png
+  :align: center
+  :alt: Comparison of Partitioning Schemes on CIFAR10
+
 
 Thanks to using Hugging Face's ``datasets`` used under the hood, Flower Datasets integrates with the following popular formats/frameworks:
 
@@ -67,28 +88,19 @@ Thanks to using Hugging Face's ``datasets`` used under the hood, Flower Datasets
 - Jax
 - Arrow
 
-Install
--------
-
-The simplest install is
-
-.. code-block:: bash
-
-  python -m pip install flwr-datasets
-
-If you plan to use the image datasets
-
-.. code-block:: bash
-
-  python -m pip install flwr-datasets[vision]
-
-If you plan to use the audio datasets
-
-.. code-block:: bash
+Here are a few of the ``Partitioner`` s that are available: (for a full list see `link <ref-api/flwr_datasets.partitioner.html#module-flwr_datasets.partitioner>`_ )
 
-  python -m pip install flwr-datasets[audio]
+* Partitioner (the abstract base class) ``Partitioner``
+* IID partitioning ``IidPartitioner(num_partitions)``
+* Dirichlet partitioning ``DirichletPartitioner(num_partitions, partition_by, alpha)``
+* InnerDirichlet partitioning ``InnerDirichletPartitioner(partition_sizes, partition_by, alpha)``
+* Natural ID partitioner ``NaturalIdPartitioner(partition_by)``
+* Size partitioner (the abstract base class for the partitioners dictating the division based the number of samples) ``SizePartitioner``
+* Linear partitioner ``LinearPartitioner(num_partitions)``
+* Square partitioner ``SquarePartitioner(num_partitions)``
+* Exponential partitioner ``ExponentialPartitioner(num_partitions)``
+* more to come in the future releases (contributions are welcome).
 
-Check out the full details on the download in :doc:`how-to-install-flwr-datasets`.
 
 How To Use the library
 ----------------------
Original file line number	Diff line number	Diff line change
Expand Up		@@ -42,5 +42,5 @@ If everything worked, it should print the version of Flower Datasets to the comm

		.. code-block:: none

		0.0.1
		0.2.0