Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584 #667

Merged
merged 3 commits into from
Sep 14, 2021

Conversation

tomwhite
Copy link
Collaborator

Fixes #584

I've included a test that uses a VCF with >128 contigs to check that it correctly picks an appropriate dtype. I haven't added a way to override the dtype, as I'm not sure it's really needed. We could add it, but it would clutter the parameter list a bit.

@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2021

Codecov Report

Merging #667 (75fc5d2) into main (32a3067) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #667   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        36           
  Lines         2879      2886    +7     
=========================================
+ Hits          2879      2886    +7     
Impacted Files Coverage Δ
sgkit/io/vcf/vcf_reader.py 100.00% <100.00%> (ø)
sgkit/io/vcfzarr_reader.py 100.00% <100.00%> (ø)
sgkit/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32a3067...75fc5d2. Read the comment docs.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think we just raise an overflow error. If we get an input VCF with anything remotely like 2**63 rows, a whole bunch of other things will break before getting to this point!

sgkit/utils.py Outdated
for dtype in (np.int8, np.int16, np.int32, np.int64):
if np.iinfo(dtype).min <= value <= np.iinfo(dtype).max:
return dtype
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return None case basically can't happen, so why not raise an OverflowError here rather than requiring client code check the return value (C style)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's much better. I've implemented this in pystatgen/sgkit@f63600f. Should be ready to go in now.

@jeromekelleher jeromekelleher added the auto-merge Auto merge label for mergify test flight label Sep 14, 2021
@mergify mergify bot merged commit 15b3c7a into sgkit-dev:main Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes
3 participants