diff --git a/datasets/doc/source/conf.py b/datasets/doc/source/conf.py index 66a506d68ecc..4290b46a2948 100644 --- a/datasets/doc/source/conf.py +++ b/datasets/doc/source/conf.py @@ -110,9 +110,9 @@ def find_test_modules(package_path): # Sphinx redirects, implemented after the doc filename changes. # To prevent 404 errors and redirect to the new pages. -# redirects = { -# } - +redirects = { + "how-to-visualize-label-distribution.html": "tutorial-visualize-label-distribution.html", +} # -- Options for HTML output ------------------------------------------------- diff --git a/datasets/doc/source/index.rst b/datasets/doc/source/index.rst index 6d707de7dd42..79fdf97af67e 100644 --- a/datasets/doc/source/index.rst +++ b/datasets/doc/source/index.rst @@ -26,6 +26,8 @@ A learning-oriented series of tutorials is the best place to start. :caption: Tutorial tutorial-quickstart + tutorial-use-partitioners + tutorial-visualize-label-distribution How-to guides ~~~~~~~~~~~~~ @@ -41,7 +43,6 @@ Problem-oriented how-to guides show step-by-step how to achieve a specific goal. how-to-use-with-tensorflow how-to-use-with-numpy how-to-use-with-local-data - how-to-visualize-label-distribution how-to-disable-enable-progress-bar References diff --git a/datasets/doc/source/tutorial-quickstart.ipynb b/datasets/doc/source/tutorial-quickstart.ipynb index d8bc49102a7a..d0f37ed311dd 100644 --- a/datasets/doc/source/tutorial-quickstart.ipynb +++ b/datasets/doc/source/tutorial-quickstart.ipynb @@ -15,7 +15,7 @@ "id": "e0f34a29f74b13cb", "metadata": {}, "source": [ - "# Install Flower Datasets" + "## Install Flower Datasets" ] }, { @@ -45,7 +45,7 @@ "id": "499dd2f0d23d871e", "metadata": {}, "source": [ - "# Choose the dataset\n", + "## Choose the dataset\n", "\n", "To choose the dataset, go to Hugging Face [Datasets Hub](https://huggingface.co/datasets) and search for your dataset by name. You will pass that names to the `dataset` parameter of `FederatedDataset`. Note that the name is case-sensitive.\n", "\n", @@ -79,7 +79,7 @@ "id": "e0c146753048fb2a", "metadata": {}, "source": [ - "# Partition the dataset\n", + "## Partition the dataset\n", "\n", "To partition a dataset (in a basic scenario), you need to choose two things:\n", "1) A dataset (identified by a name),\n", @@ -99,7 +99,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "a759c5b6f25c9dd4", "metadata": {}, "outputs": [], @@ -131,7 +131,7 @@ "id": "efa7dbb120505f1f", "metadata": {}, "source": [ - "# Investigate the partition" + "## Investigate the partition" ] }, { @@ -139,7 +139,7 @@ "id": "bf986a1a9f0284cd", "metadata": {}, "source": [ - "## Features\n", + "### Features\n", "\n", "Now we will determine the names of the features of your dataset (you can alternatively do that directly on the Hugging Face\n", "website). The names can vary along different datasets e.g. \"img\" or \"image\", \"label\" or \"labels\". Additionally, if the label column is of [ClassLabel](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.ClassLabel) type, we will also see the names of labels." @@ -164,7 +164,7 @@ } ], "source": [ - "# Note this dataset has \n", + "# Note this dataset has\n", "partition.features" ] }, @@ -173,7 +173,7 @@ "id": "2e69ed05193a098a", "metadata": {}, "source": [ - "## Indexing\n", + "### Indexing\n", "\n", "To see the first sample of the partition, we can index it like a Python list." ] @@ -388,7 +388,7 @@ "id": "b5e683cfaddf92f", "metadata": {}, "source": [ - "# Use with PyTorch/NumPy/TensorFlow\n", + "## Use with PyTorch/NumPy/TensorFlow\n", "\n", "For more detailed instructions, go to:\n", "* [how-to-use-with-pytorch](https://flower.ai/docs/datasets/how-to-use-with-pytorch.html)\n", @@ -401,7 +401,7 @@ "id": "de14f09f0ee4f6ac", "metadata": {}, "source": [ - "## PyTorch\n", + "### PyTorch\n", "\n", "Transform the `Dataset` into the `DataLoader`, use the `PyTorch transforms` (`Compose` and all the others are possible)." ] @@ -444,7 +444,7 @@ "id": "71531613", "metadata": {}, "source": [ - "## NumPy\n", + "### NumPy\n", "\n", "NumPy can be used as input to the TensorFlow and scikit-learn models. The transformation is very simple." ] @@ -465,7 +465,7 @@ "id": "e4867834", "metadata": {}, "source": [ - "## TensorFlow Dataset\n", + "### TensorFlow Dataset\n", "\n", "Transformation to TensorFlow Dataset is a one-liner." ] @@ -497,32 +497,23 @@ "id": "61fd797c", "metadata": {}, "source": [ - "# Final remarks" - ] - }, - { - "cell_type": "markdown", - "id": "91ad1252", - "metadata": {}, - "source": [ + "## Final remarks\n", + "\n", "Congratulations, you now know the basics of Flower Datasets and are ready to perform basic dataset preparation for Federated Learning." ] }, { "cell_type": "markdown", - "id": "ade71d23", - "metadata": {}, - "source": [ - "# Next Steps" - ] - }, - { - "cell_type": "markdown", - "id": "f54d8031", + "id": "cbdfe1b5", "metadata": {}, "source": [ + "## Next \n", + "\n", "This is the first quickstart tutorial from the Flower Datasets series. See other tutorials:\n", - "* [Visualize Label Distribution](https://flower.ai/docs/datasets/how-to-visualize-label-distribution.html)" + "\n", + "* [Use Partitioners](https://flower.ai/docs/datasets/tutorial-use-partitioners.html)\n", + "\n", + "* [Visualize Label Distribution](https://flower.ai/docs/datasets/tutorial-visualize-label-distribution.html)" ] } ], @@ -531,18 +522,6 @@ "display_name": "flwr", "language": "python", "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.12" } }, "nbformat": 4, diff --git a/datasets/doc/source/how-to-use-partitioners.ipynb b/datasets/doc/source/tutorial-use-partitioners.ipynb similarity index 88% rename from datasets/doc/source/how-to-use-partitioners.ipynb rename to datasets/doc/source/tutorial-use-partitioners.ipynb index 4621fdee15ea..72fabda1504e 100644 --- a/datasets/doc/source/how-to-use-partitioners.ipynb +++ b/datasets/doc/source/tutorial-use-partitioners.ipynb @@ -4,16 +4,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# How to use `Partitioner`s\n", + "# Use Partitioners\n", "\n", - "The aim of this tutorial is to make you familiar with the available `Partitioner`s that `Flower Datasets` have out-of-the-box." + "Understand `Partitioner`s interactions with `FederatedDataset`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Install" + "## Install" ] }, { @@ -29,24 +29,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# What is `Partitioner`?\n", + "## What is `Partitioner`?\n", "\n", - "`Partitioner` is an object responsible for dividing a dataset according to a chosen strategy. There are many `Partitioner`s that you can use (see the full list [here](https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.html)) and all of them inherit from the `Partitioner` object which is an abstract class providing basic structure and methods that need to be implemented for any new `Partitioner` to integrate with the rest of `Flower Datasets` code. The creation of different `Partitioner` differs, but the behavior is the same = they produce the same type of objects.\n", - "\n" + "`Partitioner` is an object responsible for dividing a dataset according to a chosen strategy. There are many `Partitioner`s that you can use (see the full list [here](https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.html)) and all of them inherit from the `Partitioner` object which is an abstract class providing basic structure and methods that need to be implemented for any new `Partitioner` to integrate with the rest of `Flower Datasets` code. The creation of different `Partitioner` differs, but the behavior is the same = they produce the same type of objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## `IidPartitioner` Creation\n", + "### `IidPartitioner` Creation\n", "\n", "Let's create (instantiate) the most basic partitioner, [`IidPartitioner`](https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.IidPartitioner.html#flwr_datasets.partitioner.IidPartitioner) and learn how it interacts with `FederatedDataset`." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -73,7 +72,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## How do you specify the split to partition?\n", + "### How do you specify the split to partition?\n", "\n", "The specification of the split happens as you specify the `partitioners` argument for `FederatedDataset`. It maps `partition_id: str` to the partitioner that will be used for that split of the data. In the example below we're using the `train` split of the `cifar10` dataset to partition.\n", "\n", @@ -82,7 +81,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -94,7 +93,7 @@ "})" ] }, - "execution_count": 4, + "execution_count": null, "metadata": {}, "output_type": "execute_result" } @@ -112,7 +111,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -124,7 +123,7 @@ " 'label': [1, 2, 6]}" ] }, - "execution_count": 5, + "execution_count": null, "metadata": {}, "output_type": "execute_result" } @@ -138,7 +137,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Use Different `Partitioners`" + "### Use Different `Partitioners`" ] }, { @@ -164,14 +163,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Creating non-IID partitions: Use ``PathologicalPartitioner``\n", + "#### Creating non-IID partitions: Use ``PathologicalPartitioner``\n", "\n", "Now, we are going to create partitions that have only a subset of labels in each partition by using [`PathologicalPartitioner`](https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.PathologicalPartitioner.html#flwr_datasets.partitioner.PathologicalPartitioner). In this scenario we have the exact control about the number of unique labels on each partition. The smaller the number is the more heterogenous the division gets. Let's have a look at how it works with `num_classes_per_partition=2`." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -183,7 +182,7 @@ "})" ] }, - "execution_count": 6, + "execution_count": null, "metadata": {}, "output_type": "execute_result" } @@ -209,7 +208,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -221,7 +220,7 @@ " 'label': [0, 0, 7]}" ] }, - "execution_count": 7, + "execution_count": null, "metadata": {}, "output_type": "execute_result" } @@ -233,7 +232,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -242,7 +241,7 @@ "array([0, 7])" ] }, - "execution_count": 8, + "execution_count": null, "metadata": {}, "output_type": "execute_result" } @@ -260,14 +259,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Creating non-IID partitions: Use ``DirichletPartitioner``\n", + "#### Creating non-IID partitions: Use ``DirichletPartitioner``\n", "\n", "With the [`DirichletParitioner`](https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.DirichletPartitioner.html#flwr_datasets.partitioner.DirichletPartitioner), the primary tool for controlling heterogeneity is the `alpha` parameter; the smaller the value gets, the more heterogeneous the federated datasets are. Instead of choosing the exact number of classes on each partition, here we sample the probability distribution from the Dirichlet distribution, which tells how the samples associated with each class will be divided." ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -279,7 +278,7 @@ "})" ] }, - "execution_count": 10, + "execution_count": null, "metadata": {}, "output_type": "execute_result" } @@ -289,7 +288,9 @@ "\n", "# Set the partitioner to create 10 partitions with alpha 0.1 (so fairly non-IID)\n", "# Partition using column \"label\" (a column in the huggingface representation of CIFAR-10)\n", - "dirichlet_partitioner = DirichletPartitioner(num_partitions=10, alpha=0.1, partition_by=\"label\")\n", + "dirichlet_partitioner = DirichletPartitioner(\n", + " num_partitions=10, alpha=0.1, partition_by=\"label\"\n", + ")\n", "\n", "# Create the federated dataset passing the partitioner\n", "fds = FederatedDataset(\n", @@ -303,7 +304,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -317,7 +318,7 @@ " 'label': [4, 4, 0, 1, 4]}" ] }, - "execution_count": 13, + "execution_count": null, "metadata": {}, "output_type": "execute_result" } @@ -331,22 +332,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Final remarks\n", - "Congratulations, you now know how to use different `Partitioner`s with `FederatedDataset` in Flower Datasets.\n", - "\n", - "# Next Steps\n", - "This is the second quickstart tutorial from the Flower Datasets series. See next tutorials:\n", - "\n", - "* [Visualize Label Distribution](https://flower.ai/docs/datasets/how-to-visualize-label-distribution.html)\n", - "\n", - "Previous tutorials:\n", - "* [Quickstart Basics](https://flower.ai/docs/datasets/quickstart-tutorial.html)" + "## Final remarks\n", + "Congratulations, you now know how to use different `Partitioner`s with `FederatedDataset` in Flower Datasets." ] }, { "cell_type": "markdown", "metadata": {}, - "source": [] + "source": [ + "## Next Steps\n", + "This is the second quickstart tutorial from the Flower Datasets series. See next tutorials:\n", + "\n", + "* [Visualize Label Distribution](https://flower.ai/docs/datasets/tutorial-visualize-label-distribution.html)\n", + "\n", + "Previous tutorials:\n", + "\n", + "* [Quickstart Basics](https://flower.ai/docs/datasets/tutorial-quickstart.html)" + ] } ], "metadata": { @@ -354,18 +356,6 @@ "display_name": "flwr", "language": "python", "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.9" } }, "nbformat": 4, diff --git a/datasets/doc/source/how-to-visualize-label-distribution.ipynb b/datasets/doc/source/tutorial-visualize-label-distribution.ipynb similarity index 99% rename from datasets/doc/source/how-to-visualize-label-distribution.ipynb rename to datasets/doc/source/tutorial-visualize-label-distribution.ipynb index 26db72047cff..d37edde78559 100644 --- a/datasets/doc/source/how-to-visualize-label-distribution.ipynb +++ b/datasets/doc/source/tutorial-visualize-label-distribution.ipynb @@ -13,9 +13,9 @@ "id": "67c54a8d7c872547", "metadata": {}, "source": [ - "If you partition datasets to simulate heterogeneity through label skew and/or size skew, you can now effortlessly visualize the partitioned dataset using `flwr-datasets`.\n", + "Learn how to visualize and compare partitioned datasets when applying different `Partitioner`s or parameters.\n", "\n", - "In this how-to guide, you'll learn how to visualize and compare partitioned datasets when applying different methods or parameters.\n", + "If you partition datasets to simulate heterogeneity through label skew and/or size skew, you can now effortlessly visualize the partitioned dataset using `flwr-datasets`.\n", "\n", "All the described visualization functions are compatible with all ``Partitioner`` you can find in\n", "[flwr_datasets.partitioner](https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.html#module-flwr_datasets.partitioner)\n" @@ -26,15 +26,7 @@ "id": "7220467f2c6ba432", "metadata": {}, "source": [ - "## Common Setup" - ] - }, - { - "cell_type": "markdown", - "id": "4e2ad2f0a0f7174d", - "metadata": {}, - "source": [ - "Install Flower Datasets library:" + "## Install Flower Datasets" ] }, { @@ -1092,7 +1084,7 @@ "source": [ "## More resources\n", "\n", - "If you are looking for more resorces, feel free to check:\n", + "If you are looking for more resources, feel free to check:\n", "\n", "* `flwr-dataset` documentation\n", " * [plot_label_distributions](https://flower.ai/docs/datasets/ref-api/flwr_datasets.visualization.plot_label_distributions.html#flwr_datasets.visualization.plot_label_distributions)\n", @@ -1100,14 +1092,17 @@ "* if you want to do any custom modification of the returned plots\n", " * [matplotlib](https://matplotlib.org/)\n", " * [seaborn](https://seaborn.pydata.org/)\n", - " * or plot directly using pandas object [pd.DataFrame.plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)" + " * or plot directly using pandas object [pd.DataFrame.plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)\n", + "\n", + "\n", + "This was the last tutorial. \n", + "\n", + "Previous tutorials:\n", + "\n", + "* [Quickstart Basics](https://flower.ai/docs/datasets/tutorial-quickstart.html)\n", + "\n", + "* [Use Partitioners](https://flower.ai/docs/datasets/tutorial-use-partitioners.html)" ] - }, - { - "cell_type": "markdown", - "id": "52655972", - "metadata": {}, - "source": [] } ], "metadata": {