Skip to content

Commit

Permalink
Add an explanation what a shard is
Browse files Browse the repository at this point in the history
  • Loading branch information
adam-narozniak committed Jan 31, 2024
1 parent 5408c4f commit 8dd7771
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions datasets/flwr_datasets/partitioner/shard_partitioner.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,16 @@ class ShardPartitioner(Partitioner): # pylint: disable=R0902
label 1, samples with labels 2 ...], then the shards are created, with each
shard of size = `shard_size` if provided or automatically calculated:
shards_size = len(dataset) / `num_partitions` * `num_shards_per_node`.
A shard is just a block (part) of a `dataset` that contains `shard_size` consecutive
samples. There might be shards that contain samples associated with more than
a single unique label. The first case is (remember we have a sorted dataset which
is always the prepocessing step) we are at the border between the samples of two
classes the shard contains samples of two different classes e.g. the "leftover" of
samples of class 1 and the majority of class 2. The another scenario when a shard
has samples with more than one unique label is when the shard size is bigger than
the number of samples of a certain class.
Each partition is created from `num_shards_per_node` that are chosen randomly.
There are a few ways of partitioning data that result in certain properties
Expand Down

0 comments on commit 8dd7771

Please sign in to comment.