-
Notifications
You must be signed in to change notification settings - Fork 35
Add concat algorithm parameter to vcf_to_zarr #365 #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use variable-length strings for storing alleles in Zarr sgkit-dev#643
Haven't thought hard about this @tomwhite, but I'd be happy to get rid of the |
Fix test coverage.
There's no performance reason for keeping it (it's slower and causes memory errors). If/when it's fixed upstream it would be straightforward to reintroduce it, so I think we can remove it. |
Codecov Report
@@ Coverage Diff @@
## main #665 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 36 36
Lines 2886 2913 +27
=========================================
+ Hits 2886 2913 +27
Continue to review full report at Codecov.
|
I think this can be merged as a first step, with the work to remove the I spent a bit of time running this on a 1GB VCF file (1kg chr2) and it worked fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Merge away
This is an implementation of #365, which will hopefully fix #661. It switches the concat algorithm to use an optimized memory-efficient version (using the same code as #324). Users don't have to change anything to get reduced memory use.
I've tested it on a smaller dataset (1kg chr22), and it reduces memory use significantly, but it still needs more testing.
As a side effect of the way this is implemented, it also makes addressesing #643 straightforward (and in fact more natural than the special cases needed to set the fixed string lengths). Xarray is effectively bypassed, so it is easy to use Zarr's variable-length strings directly.
I wonder if we should remove the
xarray_internal
option entirely for that reason. (It's not the default, the optimized concat and rechunk algorithm is, but it's another code path to maintain.)There are still potentially some race conditions lurking - I think I saw the problem in #486 again while testing locally, however it's very hard to reproduce.