-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584
Comments
We should probably revisit how we're parsing and representing contig metadata more generally, as noted in https://github.com/pystatgen/sgkit/issues/464 as well. @timothymillar do you have any example VCFs to share that overflow the |
We use int16 for bgen and PLINK, so we should probably just change VCF to be the same. |
We should make it an option to specify the dtype for Although, I guess this is the sort of thing we should be able to query the IO library for ("how many contigs are there" should be efficiently computable on any indexed VCF), so we should be able to automatically detect the minimal dtype. Even then though, I suppose people might want to manually specify the dtype, for their own reasons. |
Is there a workaround at the moment? Working with a non model species with > 50000 contigs |
I'm afraid I can't think of a workaround for this. The fix that @jeromekelleher sketched out above should be fairly straightforward though, so I'll work on a PR to fix it. |
Thanks @tomwhite and @jeromekelleher, I'll wait for the upstream fix. |
Quick and dirty workaround to reimport the contig names from the vcf, in case someone else is looking for a way to do this in the meantime.
|
…ented reference genomes sgkit-dev#584
…ented reference genomes sgkit-dev#584
Nice! I've created a fix in #667. Hopefully we can get that merged soon for you to use. |
…ented reference genomes sgkit-dev#584
…ented reference genomes sgkit-dev#584
…ented reference genomes sgkit-dev#584
…ented reference genomes sgkit-dev#584
Many (non human) reference genomes contain 1000s of contigs that have not been assembled into full chromosomes.
Currently the
variant_contig
array is hard coded as int8 (line) which results in integer overflow making it impossible to join variants to their contig.The text was updated successfully, but these errors were encountered: