Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584

timothymillar · 2021-05-27T02:52:52Z

Many (non human) reference genomes contain 1000s of contigs that have not been assembled into full chromosomes.
Currently the variant_contig array is hard coded as int8 (line) which results in integer overflow making it impossible to join variants to their contig.

The text was updated successfully, but these errors were encountered:

hammer · 2021-05-27T13:37:57Z

We should probably revisit how we're parsing and representing contig metadata more generally, as noted in https://github.com/pystatgen/sgkit/issues/464 as well.

@timothymillar do you have any example VCFs to share that overflow the int8 limit? It's trivial to manually construct such a VCF but I generally prefer to have real world data in our tests.

timothymillar · 2021-05-31T05:43:02Z

@hammer agreed, real data always hits more edge cases. I'll look into an example VCF for this.

For reference I ran into this issue when using the Red5 kiwifruit genome.

tomwhite · 2021-06-01T09:35:14Z

We use int16 for bgen and PLINK, so we should probably just change VCF to be the same.

jeromekelleher · 2021-06-01T18:10:48Z

We should make it an option to specify the dtype for variant_contig probably - even int16 will overflow sometimes. There are lots of VCFs out there with huge numbers of contigs.

Although, I guess this is the sort of thing we should be able to query the IO library for ("how many contigs are there" should be efficiently computable on any indexed VCF), so we should be able to automatically detect the minimal dtype. Even then though, I suppose people might want to manually specify the dtype, for their own reasons.

alxsimon · 2021-09-14T09:08:14Z

Is there a workaround at the moment? Working with a non model species with > 50000 contigs

jeromekelleher · 2021-09-14T09:14:07Z

Nice - great to see you pushing the limits here @alxsimon! @tomwhite any thoughts on how we should address this?

tomwhite · 2021-09-14T09:59:48Z

I'm afraid I can't think of a workaround for this. The fix that @jeromekelleher sketched out above should be fairly straightforward though, so I'll work on a PR to fix it.

alxsimon · 2021-09-14T10:08:44Z

Thanks @tomwhite and @jeromekelleher, I'll wait for the upstream fix.

alxsimon · 2021-09-14T10:44:05Z

Quick and dirty workaround to reimport the contig names from the vcf, in case someone else is looking for a way to do this in the meantime.

from cyvcf2 import VCF

variant_contig_name = np.empty(ds.dims['variants'], dtype="O")
variant_position = np.empty(ds.dims['variants'], dtype="i4")
for idx, variant in enumerate(VCF(bcf_file)):
    variant_contig_name[idx] = variant.CHROM
    variant_position[idx] = variant.POS

ds = ds.merge(xr.DataArray(variant_contig_name, coords=ds.coords, dims=['variants'], name='variant_contig_name'))

…ented reference genomes sgkit-dev#584

tomwhite · 2021-09-14T11:35:11Z

Nice!

I've created a fix in #667. Hopefully we can get that merged soon for you to use.

…ented reference genomes #584

…ented reference genomes sgkit-dev#584

timothymillar mentioned this issue Jul 26, 2021

Use of int8 for call_genotype results in integer overflow with complex variants #640

Closed

tomwhite added a commit to tomwhite/sgkit that referenced this issue Sep 14, 2021

Use of int8 for variant_contig results in integer overflow with fragm…

597bee9

…ented reference genomes sgkit-dev#584

tomwhite added a commit to tomwhite/sgkit that referenced this issue Sep 14, 2021

Use of int8 for variant_contig results in integer overflow with fragm…

468a8fc

…ented reference genomes sgkit-dev#584

tomwhite mentioned this issue Sep 14, 2021

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584 #667

Merged

mergify bot closed this as completed in #667 Sep 14, 2021

mergify bot pushed a commit that referenced this issue Sep 14, 2021

Use of int8 for variant_contig results in integer overflow with fragm…

22fe831

…ented reference genomes #584

pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 15, 2021

Use of int8 for variant_contig results in integer overflow with fragm…

8eb4b5a

…ented reference genomes sgkit-dev#584

pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 16, 2021

Use of int8 for variant_contig results in integer overflow with fragm…

583421a

…ented reference genomes sgkit-dev#584

pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 16, 2021

Use of int8 for variant_contig results in integer overflow with fragm…

213c1c9

…ented reference genomes sgkit-dev#584

pentschev pushed a commit to pentschev/sgkit that referenced this issue Sep 16, 2021

Use of int8 for variant_contig results in integer overflow with fragm…

20d042c

…ented reference genomes sgkit-dev#584

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584

timothymillar commented May 27, 2021

hammer commented May 27, 2021

timothymillar commented May 31, 2021

tomwhite commented Jun 1, 2021

jeromekelleher commented Jun 1, 2021

alxsimon commented Sep 14, 2021

jeromekelleher commented Sep 14, 2021

tomwhite commented Sep 14, 2021

alxsimon commented Sep 14, 2021

alxsimon commented Sep 14, 2021

tomwhite commented Sep 14, 2021

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584

Comments

timothymillar commented May 27, 2021

hammer commented May 27, 2021

timothymillar commented May 31, 2021

tomwhite commented Jun 1, 2021

jeromekelleher commented Jun 1, 2021

alxsimon commented Sep 14, 2021

jeromekelleher commented Sep 14, 2021

tomwhite commented Sep 14, 2021

alxsimon commented Sep 14, 2021

alxsimon commented Sep 14, 2021

tomwhite commented Sep 14, 2021