Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584

Closed
timothymillar opened this issue May 27, 2021 · 10 comments · Fixed by #667

Comments

@timothymillar
Copy link
Collaborator

Many (non human) reference genomes contain 1000s of contigs that have not been assembled into full chromosomes.
Currently the variant_contig array is hard coded as int8 (line) which results in integer overflow making it impossible to join variants to their contig.

@hammer
Copy link
Contributor

hammer commented May 27, 2021

We should probably revisit how we're parsing and representing contig metadata more generally, as noted in https://github.com/pystatgen/sgkit/issues/464 as well.

@timothymillar do you have any example VCFs to share that overflow the int8 limit? It's trivial to manually construct such a VCF but I generally prefer to have real world data in our tests.

@timothymillar
Copy link
Collaborator Author

@hammer agreed, real data always hits more edge cases. I'll look into an example VCF for this.

For reference I ran into this issue when using the Red5 kiwifruit genome.

@tomwhite
Copy link
Collaborator

tomwhite commented Jun 1, 2021

We use int16 for bgen and PLINK, so we should probably just change VCF to be the same.

@jeromekelleher
Copy link
Collaborator

We should make it an option to specify the dtype for variant_contig probably - even int16 will overflow sometimes. There are lots of VCFs out there with huge numbers of contigs.

Although, I guess this is the sort of thing we should be able to query the IO library for ("how many contigs are there" should be efficiently computable on any indexed VCF), so we should be able to automatically detect the minimal dtype. Even then though, I suppose people might want to manually specify the dtype, for their own reasons.

@alxsimon
Copy link

Is there a workaround at the moment? Working with a non model species with > 50000 contigs

@jeromekelleher
Copy link
Collaborator

Nice - great to see you pushing the limits here @alxsimon! @tomwhite any thoughts on how we should address this?

@tomwhite
Copy link
Collaborator

I'm afraid I can't think of a workaround for this. The fix that @jeromekelleher sketched out above should be fairly straightforward though, so I'll work on a PR to fix it.

@alxsimon
Copy link

Thanks @tomwhite and @jeromekelleher, I'll wait for the upstream fix.

@alxsimon
Copy link

Quick and dirty workaround to reimport the contig names from the vcf, in case someone else is looking for a way to do this in the meantime.

from cyvcf2 import VCF

variant_contig_name = np.empty(ds.dims['variants'], dtype="O")
variant_position = np.empty(ds.dims['variants'], dtype="i4")
for idx, variant in enumerate(VCF(bcf_file)):
    variant_contig_name[idx] = variant.CHROM
    variant_position[idx] = variant.POS

ds = ds.merge(xr.DataArray(variant_contig_name, coords=ds.coords, dims=['variants'], name='variant_contig_name'))

@tomwhite
Copy link
Collaborator

Nice!

I've created a fix in #667. Hopefully we can get that merged soon for you to use.

@mergify mergify bot closed this as completed in #667 Sep 14, 2021
mergify bot pushed a commit that referenced this issue Sep 14, 2021
pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 15, 2021
pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 16, 2021
pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 16, 2021
pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants