Revise Spruce section to use local alleles implementation #189

jeromekelleher · 2025-01-08T13:04:07Z

Currently the Spruce example is dominated by the PL field, which should be substantially improved by local alleles. It would be good to rerun once the code has been finalised sgkit-dev/bio2zarr#298

jeromekelleher · 2025-01-17T12:58:30Z

@percyfal do you think you could rerun the encode step now using the latest GitHub version of bio2zarr please? This should make a big difference to the overall size, with LPL being much smaller than the current PL.

To do this:

Generate a local-alleles schema (vcf2zarr mkschema --local-alleles spruce.icf > spruce_schema.json)
Run encode again using this schema (vcf2zarr encode -s spruce_schema.json ....)

Sorry it's late in the day, but it would be great to get more indicative numbers using LPL.

Note the default chunk size has also changed to 1k x 10k (from 10k x 1k) so numbers might change slightly based on this also.

percyfal · 2025-01-17T13:36:26Z

No problem, will set this up later today.

percyfal · 2025-01-20T13:01:22Z

I'm running vcf2zarr encode with schema input. Running time seems to be 3-4 times longer than without schema. Is this expected?

Without schema:

Encode: 100%|██████████| 100T/100T [10:19:12<00:00, 2.69GB/s]

With (currently running):

Encode:  11%|█         | 10.6T/100T [5:40:17<38:45:36, 640MB/s]

Same number of threads, same chunking. Maybe it'll pick up steam? 🤞

jeromekelleher · 2025-01-20T13:23:24Z

It's semi-expected that things would be slower with local alleles as we're doing some extra calculations on the input to derived the local alleles representation. I've not got a sense of how much slower though (hopefully not too much).

The chunking should have changed though, as we're using 1000 variants by default now not 10k.

Can you post the schema you've computed please?

percyfal · 2025-01-20T13:54:11Z

Yes, I figured some slowdown would be expected for reasons you mentioned. I just realized I didn't apply updated chunking parameters to the mkschema command, so yes, the parameters are different. For the sake of comparison, I might rerun schema generation. Since the number of samples is low (n=1063) I have increased the variant chunk size.

I attach the schema here.

spruce.all.vcf.gz.icf.json

jeromekelleher · 2025-01-20T14:01:14Z

Local alleles hasn't worked here @percyfal as we still have call_PL instead of call_LPL. Are you sure you specified the --local-alleles option?

percyfal · 2025-01-20T14:15:30Z

Sorry, my bad, I somehow misread your original post. Relaunching...

percyfal · 2025-01-20T15:21:28Z

Now it's shaping up, and it actually looks slightly faster:

Encode:  11%|█         | 5.79T/52.2T [45:17<8:34:46, 1.50GB/s]

jeromekelleher · 2025-01-20T15:56:24Z

Aha, excellent. We halved the size of the decompressed zarr 👍

jeromekelleher · 2025-01-22T11:44:55Z

How's this looking @percyfal, should have completed by now? 🤞

percyfal · 2025-01-22T12:01:50Z

Yes, I'm just compiling the notebook results to generate the final table as we speak. I'll submit a PR shortly with the local alleles results as table 2. I also reran without the schema in case we want to discuss the differences. For starters, here are the benchmark times for the two cases:

mkschema	s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load       cpu_time
true	50978.1766      14:09:38        106263.22       5090017.80      103587.82       103610.34       7810646.80      2424700.64      5132.47 2616936.31
false	59419.4896      16:30:19        105749.09       5089623.25      103216.44       103237.12       7122025.49       7016360.57      7178.73 4265952.19

Zarr archive sizes are 2.3TiB vs 6.68TiB. The latter differs from the original run as I set the chunk settings to n_samples=1063, n_variants=10_000, to better reflect the dimensionality of the data.

jeromekelleher mentioned this issue Jan 8, 2025

Resubmission checklist #193

Closed

13 tasks

jeromekelleher closed this as completed in 0d79c58 Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise Spruce section to use local alleles implementation #189

Revise Spruce section to use local alleles implementation #189

jeromekelleher commented Jan 8, 2025

jeromekelleher commented Jan 17, 2025

percyfal commented Jan 17, 2025

percyfal commented Jan 20, 2025

jeromekelleher commented Jan 20, 2025

percyfal commented Jan 20, 2025

jeromekelleher commented Jan 20, 2025

percyfal commented Jan 20, 2025

percyfal commented Jan 20, 2025

jeromekelleher commented Jan 20, 2025

jeromekelleher commented Jan 22, 2025

percyfal commented Jan 22, 2025

Revise Spruce section to use local alleles implementation #189

Revise Spruce section to use local alleles implementation #189

Comments

jeromekelleher commented Jan 8, 2025

jeromekelleher commented Jan 17, 2025

percyfal commented Jan 17, 2025

percyfal commented Jan 20, 2025

jeromekelleher commented Jan 20, 2025

percyfal commented Jan 20, 2025

jeromekelleher commented Jan 20, 2025

percyfal commented Jan 20, 2025

percyfal commented Jan 20, 2025

jeromekelleher commented Jan 20, 2025

jeromekelleher commented Jan 22, 2025

percyfal commented Jan 22, 2025