New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Improve speed of NaturalIdPartitioner #3276

Merged

jafermarq merged 19 commits into main from fds-improve-natural-id-partitioner

May 6, 2024

Contributor

adam-narozniak commented Apr 17, 2024 •

edited

Loading

Issue

When adding and testing the femnist dataset on NaturalIdPartitioner, I discovered that partitioning takes a very long time.

Description

The femnist dataset consists of over 800,000 samples corresponding to over 3,500 unique writers.

The current implementation filters (using Dataset.filter) the dataset for each unique writer (when there's a request for a particular partition, it filters that for a single unique value). It took about 1 minute to filter the data for a single unique writer_id (it takes 3600 minutes to do that for all = 60 hours). Also, the filtering takes all columns instead of a single column.

Proposal

Iterate a single time over the whole column specified by partition_by to create a mapping.

Time

Time old = time it takes to load 100 partitions (using slight optimization of the old way = still using filter) time(load 100) ~ 100 * time(load 1) and generally time(load n) = n*time(load 1)

 ps = [nip.load_partition(i) for i in range(100)]

Time new = the same as time old, but here the computation happens once, so time(load 100) ~ time(load 1) ~ time(load all_partitions)

	time_old	time_new	num_samples	num_unique_clients
speech_commands	22.8284	12.1519	51093	1504
femnist	208.243	1.9382	814277	3597
shakespeare	1007.3	11.1693	4.22616e+06	1129
synthetic	44.7938	0.540589	107553	1000
sentiment140	831.126	8.11362	1.6e+06	659775

Table 1: Table reporting time in seconds to load 100 partitions using an old (with slight modification but still using a filter) and a new implementation. It uses a few datasets that vary in terms of the total number of samples and the number of unique clients.

Changelog entry


          Improve speed of NaturalIdPartitioner

9369b81

adam-narozniak requested review from jafermarq, tanertopal and danieljanes as code owners

April 17, 2024 12:36

adam-narozniak self-assigned this


          Add new line

363d932

adam-narozniak marked this pull request as draft

April 17, 2024 12:50

adam-narozniak added 5 commits

April 19, 2024 11:57


          Fix the None case

c9ad68e


          Fix the indent on the np.unique

3506f31


          Remove main

961979c


          Fix formatting

d6ef2bf


          Update error type

152c5e0

adam-narozniak marked this pull request as ready for review

April 19, 2024 12:08


          Merge branch 'main' into fds-improve-natural-id-partitioner

41138c4

jafermarq reviewed

View reviewed changes

datasets/flwr_datasets/partitioner/natural_id_partitioner.py Outdated Show resolved Hide resolved

jafermarq reviewed

View reviewed changes

datasets/flwr_datasets/partitioner/natural_id_partitioner.py Outdated

Comment on lines 63 to 64

		natural_ids = np.array(self.dataset[self._partition_by])
		unique_natural_ids = self.dataset.unique(self._partition_by)

Contributor

jafermarq Apr 25, 2024

Could we do this without having to load the whole dataset into memory?

Contributor Author

adam-narozniak Apr 26, 2024 •

edited

Loading

To make it work quicker we need to load the column specified by self._partition_by to the memory.

datasets/flwr_datasets/partitioner/natural_id_partitioner.py Outdated Show resolved Hide resolved

adam-narozniak added 4 commits

April 26, 2024 09:44


          Check dtype based on only the first object

86b7d7e


          Change the method to counter the None str + object presents

c031420


          Merge remote-tracking branch 'origin/fds-improve-natural-id-partition…

eb14e86

…er' into fds-improve-natural-id-partitioner


          Add tqdm

9a61ba7

adam-narozniak mentioned this pull request

Add 2 modes in natural id partitioner #3364

Closed


          Check different indices creation method

2288bc0

adam-narozniak marked this pull request as draft

April 30, 2024 08:09

adam-narozniak added 3 commits

April 30, 2024 11:18


          Use a more efficient method

4fbe467


          Add docs


          Fix tests

be1d63f

adam-narozniak marked this pull request as ready for review

April 30, 2024 09:33

jafermarq reviewed

View reviewed changes

datasets/flwr_datasets/partitioner/natural_id_partitioner.py Outdated Show resolved Hide resolved


          Update datasets/flwr_datasets/partitioner/natural_id_partitioner.py

283388d

Co-authored-by: Javier <[email protected]>

jafermarq reviewed

View reviewed changes

datasets/flwr_datasets/partitioner/natural_id_partitioner.py Outdated Show resolved Hide resolved

jafermarq added 2 commits

May 6, 2024 12:25


          Update datasets/flwr_datasets/partitioner/natural_id_partitioner.py

6f050bf


          Merge branch 'main' into fds-improve-natural-id-partitioner

1b57a71

jafermarq approved these changes

View reviewed changes

jafermarq enabled auto-merge (squash)

May 6, 2024 11:38

jafermarq merged commit 977844f into main

34 checks passed

jafermarq deleted the fds-improve-natural-id-partitioner branch

May 6, 2024 11:42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet