Skip to content

Commit

Permalink
docs(datasets) Add recommended datasets list to index (#4627)
Browse files Browse the repository at this point in the history
Co-authored-by: jafermarq <[email protected]>
  • Loading branch information
adam-narozniak and jafermarq authored Dec 4, 2024
1 parent 93d6ae4 commit d8b00b3
Show file tree
Hide file tree
Showing 3 changed files with 169 additions and 153 deletions.
14 changes: 14 additions & 0 deletions datasets/doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,20 @@ The Flower Community is growing quickly - we're a friendly group of researchers,

Join us on Slack

Recommended FL Datasets
-----------------------

Below we present a list of recommended datasets for federated learning research, which can be
used with Flower Datasets ``flwr-datasets``.

.. note::

All datasets from `HuggingFace Hub <https://huggingface.co/datasets>`_ can be used with our library. This page presents just a set of datasets we collected that you might find useful.

For more information about any dataset, visit its page by clicking the dataset name.

.. include:: recommended-fl-datasets-tables.rst

.. _demo:
Demo
----
Expand Down
153 changes: 153 additions & 0 deletions datasets/doc/source/recommended-fl-datasets-tables.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
Image Datasets
~~~~~~~~~~~~~~

.. list-table:: Image Datasets
:widths: 40 40 20
:header-rows: 1

* - Name
- Size
- Image Shape
* - `ylecun/mnist <https://huggingface.co/datasets/ylecun/mnist>`_
- train 60k;
test 10k
- 28x28
* - `uoft-cs/cifar10 <https://huggingface.co/datasets/uoft-cs/cifar10>`_
- train 50k;
test 10k
- 32x32x3
* - `uoft-cs/cifar100 <https://huggingface.co/datasets/uoft-cs/cifar100>`_
- train 50k;
test 10k
- 32x32x3
* - `zalando-datasets/fashion_mnist <https://huggingface.co/datasets/zalando-datasets/fashion_mnist>`_
- train 60k;
test 10k
- 28x28
* - `flwrlabs/femnist <https://huggingface.co/datasets/flwrlabs/femnist>`_
- train 814k
- 28x28
* - `zh-plus/tiny-imagenet <https://huggingface.co/datasets/zh-plus/tiny-imagenet>`_
- train 100k;
valid 10k
- 64x64x3
* - `flwrlabs/usps <https://huggingface.co/datasets/flwrlabs/usps>`_
- train 7.3k;
test 2k
- 16x16
* - `flwrlabs/pacs <https://huggingface.co/datasets/flwrlabs/pacs>`_
- train 10k
- 227x227
* - `flwrlabs/cinic10 <https://huggingface.co/datasets/flwrlabs/cinic10>`_
- train 90k;
valid 90k;
test 90k
- 32x32x3
* - `flwrlabs/caltech101 <https://huggingface.co/datasets/flwrlabs/caltech101>`_
- train 8.7k
- varies
* - `flwrlabs/office-home <https://huggingface.co/datasets/flwrlabs/office-home>`_
- train 15.6k
- varies
* - `flwrlabs/fed-isic2019 <https://huggingface.co/datasets/flwrlabs/fed-isic2019>`_
- train 18.6k;
test 4.7k
- varies
* - `ufldl-stanford/svhn <https://huggingface.co/datasets/ufldl-stanford/svhn>`_
- train 73.3k;
test 26k;
extra 531k
- 32x32x3
* - `sasha/dog-food <https://huggingface.co/datasets/sasha/dog-food>`_
- train 2.1k;
test 0.9k
- varies
* - `Mike0307/MNIST-M <https://huggingface.co/datasets/Mike0307/MNIST-M>`_
- train 59k;
test 9k
- 32x32

Audio Datasets
~~~~~~~~~~~~~~

.. list-table:: Audio Datasets
:widths: 35 30 15
:header-rows: 1

* - Name
- Size
- Subset
* - `google/speech_commands <https://huggingface.co/datasets/google/speech_commands>`_
- train 64.7k
- v0.01
* - `google/speech_commands <https://huggingface.co/datasets/google/speech_commands>`_
- train 105.8k
- v0.02
* - `flwrlabs/ambient-acoustic-context <https://huggingface.co/datasets/flwrlabs/ambient-acoustic-context>`_
- train 70.3k
-
* - `fixie-ai/common_voice_17_0 <https://huggingface.co/datasets/fixie-ai/common_voice_17_0>`_
- varies
- 14 versions
* - `fixie-ai/librispeech_asr <https://huggingface.co/datasets/fixie-ai/librispeech_asr>`_
- varies
- clean/other

Tabular Datasets
~~~~~~~~~~~~~~~~


.. list-table:: Tabular Datasets
:widths: 35 30
:header-rows: 1

* - Name
- Size
* - `scikit-learn/adult-census-income <https://huggingface.co/datasets/scikit-learn/adult-census-income>`_
- train 32.6k
* - `jlh/uci-mushrooms <https://huggingface.co/datasets/jlh/uci-mushrooms>`_
- train 8.1k
* - `scikit-learn/iris <https://huggingface.co/datasets/scikit-learn/iris>`_
- train 150

Text Datasets
~~~~~~~~~~~~~

.. list-table:: Text Datasets
:widths: 40 30 30
:header-rows: 1

* - Name
- Size
- Category
* - `sentiment140 <https://huggingface.co/datasets/sentiment140>`_
- train 1.6M;
test 0.5k
- Sentiment
* - `google-research-datasets/mbpp <https://huggingface.co/datasets/google-research-datasets/mbpp>`_
- full 974; sanitized 427
- General
* - `openai/openai_humaneval <https://huggingface.co/datasets/openai/openai_humaneval>`_
- test 164
- General
* - `lukaemon/mmlu <https://huggingface.co/datasets/lukaemon/mmlu>`_
- varies
- General
* - `takala/financial_phrasebank <https://huggingface.co/datasets/takala/financial_phrasebank>`_
- train 4.8k
- Financial
* - `pauri32/fiqa-2018 <https://huggingface.co/datasets/pauri32/fiqa-2018>`_
- train 0.9k; validation 0.1k; test 0.2k
- Financial
* - `zeroshot/twitter-financial-news-sentiment <https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment>`_
- train 9.5k; validation 2.4k
- Financial
* - `bigbio/pubmed_qa <https://huggingface.co/datasets/bigbio/pubmed_qa>`_
- train 2M; validation 11k
- Medical
* - `openlifescienceai/medmcqa <https://huggingface.co/datasets/openlifescienceai/medmcqa>`_
- train 183k; validation 4.3k; test 6.2k
- Medical
* - `bigbio/med_qa <https://huggingface.co/datasets/bigbio/med_qa>`_
- train 10.1k; test 1.3k; validation 1.3k
- Medical
155 changes: 2 additions & 153 deletions datasets/doc/source/recommended-fl-datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,157 +11,6 @@ see the full FL example with Flower and Flower Datasets open the `quickstart-pyt

All datasets from `HuggingFace Hub <https://huggingface.co/datasets>`_ can be used with our library. This page presents just a set of datasets we collected that you might find useful.

For more information about any dataset, visit its page by clicking the dataset name. For more information how to use the
For more information about any dataset, visit its page by clicking the dataset name.

Image Datasets
--------------

.. list-table:: Image Datasets
:widths: 40 40 20
:header-rows: 1

* - Name
- Size
- Image Shape
* - `ylecun/mnist <https://huggingface.co/datasets/ylecun/mnist>`_
- train 60k;
test 10k
- 28x28
* - `uoft-cs/cifar10 <https://huggingface.co/datasets/uoft-cs/cifar10>`_
- train 50k;
test 10k
- 32x32x3
* - `uoft-cs/cifar100 <https://huggingface.co/datasets/uoft-cs/cifar100>`_
- train 50k;
test 10k
- 32x32x3
* - `zalando-datasets/fashion_mnist <https://huggingface.co/datasets/zalando-datasets/fashion_mnist>`_
- train 60k;
test 10k
- 28x28
* - `flwrlabs/femnist <https://huggingface.co/datasets/flwrlabs/femnist>`_
- train 814k
- 28x28
* - `zh-plus/tiny-imagenet <https://huggingface.co/datasets/zh-plus/tiny-imagenet>`_
- train 100k;
valid 10k
- 64x64x3
* - `flwrlabs/usps <https://huggingface.co/datasets/flwrlabs/usps>`_
- train 7.3k;
test 2k
- 16x16
* - `flwrlabs/pacs <https://huggingface.co/datasets/flwrlabs/pacs>`_
- train 10k
- 227x227
* - `flwrlabs/cinic10 <https://huggingface.co/datasets/flwrlabs/cinic10>`_
- train 90k;
valid 90k;
test 90k
- 32x32x3
* - `flwrlabs/caltech101 <https://huggingface.co/datasets/flwrlabs/caltech101>`_
- train 8.7k
- varies
* - `flwrlabs/office-home <https://huggingface.co/datasets/flwrlabs/office-home>`_
- train 15.6k
- varies
* - `flwrlabs/fed-isic2019 <https://huggingface.co/datasets/flwrlabs/fed-isic2019>`_
- train 18.6k;
test 4.7k
- varies
* - `ufldl-stanford/svhn <https://huggingface.co/datasets/ufldl-stanford/svhn>`_
- train 73.3k;
test 26k;
extra 531k
- 32x32x3
* - `sasha/dog-food <https://huggingface.co/datasets/sasha/dog-food>`_
- train 2.1k;
test 0.9k
- varies
* - `Mike0307/MNIST-M <https://huggingface.co/datasets/Mike0307/MNIST-M>`_
- train 59k;
test 9k
- 32x32

Audio Datasets
--------------

.. list-table:: Audio Datasets
:widths: 35 30 15
:header-rows: 1

* - Name
- Size
- Subset
* - `google/speech_commands <https://huggingface.co/datasets/google/speech_commands>`_
- train 64.7k
- v0.01
* - `google/speech_commands <https://huggingface.co/datasets/google/speech_commands>`_
- train 105.8k
- v0.02
* - `flwrlabs/ambient-acoustic-context <https://huggingface.co/datasets/flwrlabs/ambient-acoustic-context>`_
- train 70.3k
-
* - `fixie-ai/common_voice_17_0 <https://huggingface.co/datasets/fixie-ai/common_voice_17_0>`_
- varies
- 14 versions
* - `fixie-ai/librispeech_asr <https://huggingface.co/datasets/fixie-ai/librispeech_asr>`_
- varies
- clean/other

Tabular Datasets
----------------

.. list-table:: Tabular Datasets
:widths: 35 30
:header-rows: 1

* - Name
- Size
* - `scikit-learn/adult-census-income <https://huggingface.co/datasets/scikit-learn/adult-census-income>`_
- train 32.6k
* - `jlh/uci-mushrooms <https://huggingface.co/datasets/jlh/uci-mushrooms>`_
- train 8.1k
* - `scikit-learn/iris <https://huggingface.co/datasets/scikit-learn/iris>`_
- train 150

Text Datasets
-------------

.. list-table:: Text Datasets
:widths: 40 30 30
:header-rows: 1

* - Name
- Size
- Category
* - `sentiment140 <https://huggingface.co/datasets/sentiment140>`_
- train 1.6M;
test 0.5k
- Sentiment
* - `google-research-datasets/mbpp <https://huggingface.co/datasets/google-research-datasets/mbpp>`_
- full 974; sanitized 427
- General
* - `openai/openai_humaneval <https://huggingface.co/datasets/openai/openai_humaneval>`_
- test 164
- General
* - `lukaemon/mmlu <https://huggingface.co/datasets/lukaemon/mmlu>`_
- varies
- General
* - `takala/financial_phrasebank <https://huggingface.co/datasets/takala/financial_phrasebank>`_
- train 4.8k
- Financial
* - `pauri32/fiqa-2018 <https://huggingface.co/datasets/pauri32/fiqa-2018>`_
- train 0.9k; validation 0.1k; test 0.2k
- Financial
* - `zeroshot/twitter-financial-news-sentiment <https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment>`_
- train 9.5k; validation 2.4k
- Financial
* - `bigbio/pubmed_qa <https://huggingface.co/datasets/bigbio/pubmed_qa>`_
- train 2M; validation 11k
- Medical
* - `openlifescienceai/medmcqa <https://huggingface.co/datasets/openlifescienceai/medmcqa>`_
- train 183k; validation 4.3k; test 6.2k
- Medical
* - `bigbio/med_qa <https://huggingface.co/datasets/bigbio/med_qa>`_
- train 10.1k; test 1.3k; validation 1.3k
- Medical
.. include:: recommended-fl-datasets-tables.rst

0 comments on commit d8b00b3

Please sign in to comment.