Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise Spruce section to use local alleles implementation #189

Closed
jeromekelleher opened this issue Jan 8, 2025 · 11 comments
Closed

Revise Spruce section to use local alleles implementation #189

jeromekelleher opened this issue Jan 8, 2025 · 11 comments

Comments

@jeromekelleher
Copy link
Contributor

Currently the Spruce example is dominated by the PL field, which should be substantially improved by local alleles. It would be good to rerun once the code has been finalised sgkit-dev/bio2zarr#298

@jeromekelleher
Copy link
Contributor Author

@percyfal do you think you could rerun the encode step now using the latest GitHub version of bio2zarr please? This should make a big difference to the overall size, with LPL being much smaller than the current PL.

To do this:

  • Generate a local-alleles schema (vcf2zarr mkschema --local-alleles spruce.icf > spruce_schema.json)
  • Run encode again using this schema (vcf2zarr encode -s spruce_schema.json ....)

Sorry it's late in the day, but it would be great to get more indicative numbers using LPL.

Note the default chunk size has also changed to 1k x 10k (from 10k x 1k) so numbers might change slightly based on this also.

@percyfal
Copy link
Contributor

No problem, will set this up later today.

@percyfal
Copy link
Contributor

I'm running vcf2zarr encode with schema input. Running time seems to be 3-4 times longer than without schema. Is this expected?

Without schema:

Encode: 100%|██████████| 100T/100T [10:19:12<00:00, 2.69GB/s]

With (currently running):

Encode:  11%|█         | 10.6T/100T [5:40:17<38:45:36, 640MB/s]

Same number of threads, same chunking. Maybe it'll pick up steam? 🤞

@jeromekelleher
Copy link
Contributor Author

It's semi-expected that things would be slower with local alleles as we're doing some extra calculations on the input to derived the local alleles representation. I've not got a sense of how much slower though (hopefully not too much).

The chunking should have changed though, as we're using 1000 variants by default now not 10k.

Can you post the schema you've computed please?

@percyfal
Copy link
Contributor

Yes, I figured some slowdown would be expected for reasons you mentioned. I just realized I didn't apply updated chunking parameters to the mkschema command, so yes, the parameters are different. For the sake of comparison, I might rerun schema generation. Since the number of samples is low (n=1063) I have increased the variant chunk size.

I attach the schema here.

spruce.all.vcf.gz.icf.json

@jeromekelleher
Copy link
Contributor Author

Local alleles hasn't worked here @percyfal as we still have call_PL instead of call_LPL. Are you sure you specified the --local-alleles option?

@percyfal
Copy link
Contributor

Sorry, my bad, I somehow misread your original post. Relaunching...

@percyfal
Copy link
Contributor

Now it's shaping up, and it actually looks slightly faster:

Encode:  11%|█         | 5.79T/52.2T [45:17<8:34:46, 1.50GB/s]

@jeromekelleher
Copy link
Contributor Author

Aha, excellent. We halved the size of the decompressed zarr 👍

@jeromekelleher
Copy link
Contributor Author

How's this looking @percyfal, should have completed by now? 🤞

@percyfal
Copy link
Contributor

Yes, I'm just compiling the notebook results to generate the final table as we speak. I'll submit a PR shortly with the local alleles results as table 2. I also reran without the schema in case we want to discuss the differences. For starters, here are the benchmark times for the two cases:

mkschema	s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load       cpu_time
true	50978.1766      14:09:38        106263.22       5090017.80      103587.82       103610.34       7810646.80      2424700.64      5132.47 2616936.31
false	59419.4896      16:30:19        105749.09       5089623.25      103216.44       103237.12       7122025.49       7016360.57      7178.73 4265952.19

Zarr archive sizes are 2.3TiB vs 6.68TiB. The latter differs from the original run as I set the chunk settings to n_samples=1063, n_variants=10_000, to better reflect the dimensionality of the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants