Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(datasets) Update information how to handle DatasetDict local data #4057

Merged
merged 4 commits into from
Aug 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 13 additions & 20 deletions datasets/doc/source/how-to-use-with-local-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,6 @@ CSV
data_files = [ "path-to-my-file-1.csv", "path-to-my-file-2.csv", ...]
dataset = load_dataset("csv", data_files=data_files)

# Divided Dataset
data_files = {
"train": single_train_file_or_list_of_files,
"test": single_test_file_or_list_of_files,
"can-have-more-splits": ...
}
dataset = load_dataset("csv", data_files=data_files)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Expand All @@ -60,18 +52,10 @@ JSON
# Single file
data_files = "path-to-my-file.json"

# Multitple Files
# Multiple Files
data_files = [ "path-to-my-file-1.json", "path-to-my-file-2.json", ...]
dataset = load_dataset("json", data_files=data_files)

# Divided Dataset
data_files = {
"train": single_train_file_or_list_of_files,
"test": single_test_file_or_list_of_files,
"can-have-more-splits": ...
}
dataset = load_dataset("json", data_files=data_files)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Expand Down Expand Up @@ -103,7 +87,12 @@ Then, the path you can give is `./mnist`.
from flwr_datasets.partitioner import ChosenPartitioner

# Directly from a directory
dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
dataset_dict = load_dataset("imagefolder", data_dir="/path/to/folder")
# Note that what we just loaded is a DatasetDict, we need to choose a single split
# and assign it to the partitioner.dataset
# e.g. "train" split but that depends on the structure of your directory
dataset = dataset_dict["train"]

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Expand Down Expand Up @@ -134,7 +123,11 @@ Analogously to the image datasets, there are two methods here:
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner

dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
dataset_dict = load_dataset("audiofolder", data_dir="/path/to/folder")
# Note that what we just loaded is a DatasetDict, we need to choose a single split
# and assign it to the partitioner.dataset
# e.g. "train" split but that depends on the structure of your directory
dataset = dataset_dict["train"]

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
Expand Down Expand Up @@ -230,7 +223,7 @@ Partitioner abstraction is designed to allow for a single dataset assignment.

.. code-block:: python

partitioner.dataset = your_dataset
partitioner.dataset = your_dataset # (your_dataset must be of type dataset.Dataset)

If you need to do the same partitioning on a different dataset, create a new Partitioner
for that, e.g.:
Expand Down
Loading