-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise Spruce section to use local alleles implementation #189
Comments
@percyfal do you think you could rerun the To do this:
Sorry it's late in the day, but it would be great to get more indicative numbers using LPL. Note the default chunk size has also changed to 1k x 10k (from 10k x 1k) so numbers might change slightly based on this also. |
No problem, will set this up later today. |
I'm running vcf2zarr encode with schema input. Running time seems to be 3-4 times longer than without schema. Is this expected? Without schema:
With (currently running):
Same number of threads, same chunking. Maybe it'll pick up steam? 🤞 |
It's semi-expected that things would be slower with local alleles as we're doing some extra calculations on the input to derived the local alleles representation. I've not got a sense of how much slower though (hopefully not too much). The chunking should have changed though, as we're using 1000 variants by default now not 10k. Can you post the schema you've computed please? |
Yes, I figured some slowdown would be expected for reasons you mentioned. I just realized I didn't apply updated chunking parameters to the I attach the schema here. |
Local alleles hasn't worked here @percyfal as we still have call_PL instead of call_LPL. Are you sure you specified the --local-alleles option? |
Sorry, my bad, I somehow misread your original post. Relaunching... |
Now it's shaping up, and it actually looks slightly faster:
|
Aha, excellent. We halved the size of the decompressed zarr 👍 |
How's this looking @percyfal, should have completed by now? 🤞 |
Yes, I'm just compiling the notebook results to generate the final table as we speak. I'll submit a PR shortly with the local alleles results as table 2. I also reran without the schema in case we want to discuss the differences. For starters, here are the benchmark times for the two cases:
Zarr archive sizes are 2.3TiB vs 6.68TiB. The latter differs from the original run as I set the chunk settings to |
Currently the Spruce example is dominated by the PL field, which should be substantially improved by local alleles. It would be good to rerun once the code has been finalised sgkit-dev/bio2zarr#298
The text was updated successfully, but these errors were encountered: