Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rng.choice without unpacked in subsample_without_replacement with 64-bit support #935

Merged
merged 15 commits into from
Jul 28, 2023

Conversation

sfiligoi
Copy link
Collaborator

No description provided.

@sfiligoi
Copy link
Collaborator Author

sfiligoi commented Jun 13, 2023

Drastically reduces the memory needs when sums are large.
Also allows sums to be >2^31.

biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Outdated Show resolved Hide resolved
biom/_subsample.pyx Show resolved Hide resolved
@sfiligoi sfiligoi requested a review from wasade June 14, 2023 04:23
Copy link
Member

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a note be added to the change log with approximate performance boost and memory reduction?

Co-authored-by: Daniel McDonald <[email protected]>
@sfiligoi
Copy link
Collaborator Author

sfiligoi commented Jun 17, 2023

On EMP-style BIOM, the new without_replacement algorithm is about 2x faster (n=1000, on a EPYC 7302 CPU):
33s vs 58s

There is also a small reduction is memory consumption, but it is barely noticeable compared to the rest of the memory use when using the Table object.

For the record, the biom used for testing was mp.90.min25.deblur.withtax.onlytree_ACTUAL_overlap.biom

@sfiligoi
Copy link
Collaborator Author

For BIOM tables with very large per-column sums, it is an enabler;
the old code would outright fail if the columns sum was > 2^31.
(e.g. rna_integrity_metaG_woltka_wolr2_with_plasmids_per_gene_clean_gene_counts_per_g_stool.biom)

(But just fixing that would not help... a test showed it would have needed over 15.1 TiB of RAM)

The new code has a 2^63 limit, and the memory use is proportional to n, only.

The running time is quite fast for small n, but gets slower with high n (on same CPU as above):
n=2000 50s
n=6G 20min

@sfiligoi sfiligoi requested a review from wasade July 28, 2023 19:14
@wasade
Copy link
Member

wasade commented Jul 28, 2023

Thanks!!

@wasade wasade merged commit d3f82fe into biocore:master Jul 28, 2023
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants