Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of NaturalIdPartitioner #3276

Merged
merged 19 commits into from
May 6, 2024

Conversation

adam-narozniak
Copy link
Contributor

@adam-narozniak adam-narozniak commented Apr 17, 2024

Issue

When adding and testing the femnist dataset on NaturalIdPartitioner, I discovered that partitioning takes a very long time.

Description

The femnist dataset consists of over 800,000 samples corresponding to over 3,500 unique writers.

The current implementation filters (using Dataset.filter) the dataset for each unique writer (when there's a request for a particular partition, it filters that for a single unique value). It took about 1 minute to filter the data for a single unique writer_id (it takes 3600 minutes to do that for all = 60 hours). Also, the filtering takes all columns instead of a single column.

Proposal

Iterate a single time over the whole column specified by partition_by to create a mapping.

Time

Time old = time it takes to load 100 partitions (using slight optimization of the old way = still using filter) time(load 100) ~ 100 * time(load 1) and generally time(load n) = n*time(load 1)

 ps = [nip.load_partition(i) for i in range(100)]

Time new = the same as time old, but here the computation happens once, so time(load 100) ~ time(load 1) ~ time(load all_partitions)

time_old time_new num_samples num_unique_clients
speech_commands 22.8284 12.1519 51093 1504
femnist 208.243 1.9382 814277 3597
shakespeare 1007.3 11.1693 4.22616e+06 1129
synthetic 44.7938 0.540589 107553 1000
sentiment140 831.126 8.11362 1.6e+06 659775

Table 1: Table reporting time in seconds to load 100 partitions using an old (with slight modification but still using a filter) and a new implementation. It uses a few datasets that vary in terms of the total number of samples and the number of unique clients.

Changelog entry

@adam-narozniak adam-narozniak marked this pull request as draft April 17, 2024 12:50
@adam-narozniak adam-narozniak marked this pull request as ready for review April 19, 2024 12:08
Comment on lines 63 to 64
natural_ids = np.array(self.dataset[self._partition_by])
unique_natural_ids = self.dataset.unique(self._partition_by)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this without having to load the whole dataset into memory?

Copy link
Contributor Author

@adam-narozniak adam-narozniak Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it work quicker we need to load the column specified by self._partition_by to the memory.

@adam-narozniak adam-narozniak marked this pull request as draft April 30, 2024 08:09
@adam-narozniak adam-narozniak marked this pull request as ready for review April 30, 2024 09:33
@jafermarq jafermarq enabled auto-merge (squash) May 6, 2024 11:38
@jafermarq jafermarq merged commit 977844f into main May 6, 2024
34 checks passed
@jafermarq jafermarq deleted the fds-improve-natural-id-partitioner branch May 6, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants