-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
END < POS in vcf/4.3/passed_body_alt.vcf #607
Comments
It's not valid. PR #436 clarified END in the 4.3 spec; IMHO the clarification applies to all VCF versions though the text was not changed in the earlier documents. The description in that PR (“INFO/END must be on the same chromosome as CHROM and come after POS”) makes it clear that the intention was to disallow END < POS. The text actually added to the spec (“indicating the variant spans positions POS–END” and “this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF“ — note that the latter, |
Well, SAM records have TLEN as negative so you can compute the other end of the template by adding it to POS, which may be before or after POS. There's nothing explicitly in the VCF spec that states END >= POS. I'd hope it's pretty obvious, but I could see people concluding some complex variant is permitted to swap POS with END if it's associated with the opposite strand (for example). This is reinforced with the use of signed data types, further giving credence to them being permitted to be negative. I'm not saying I believe this to be so, and as I said I recollected us deciding that END<POS was invalid, but I'm trying to look it this from a layman coming to the spec for the first time without any prior knowledge of what is permitted and what is not. We went through the BAM spec clarifying all of this to explicitly disallow negative values where they shouldn't be. |
As for the BCF part of the spec's description of
All the values for
|
Well, again I refer you back to the SAM spec where we explicitly have TLEN (Template LENgth) which can be negative. :-) I wouldn't argue we can assume people immediately know lengths aren't negative quantities. Granted it's a good default starting position and it's probably only my knowledge of SAM which taints my view of how to interpret VCF, but I'd be happier if the data types were corrected. Edit: we should probably do to BCF what we did to BAM, add a column describing the maximum legal values. Or alternatively discard int vs uint and have the column describing both minimum and maximum values. That then permits us to do tid: -1 to 2^31-1 (or " |
I refer you to go back and read the description of the TLEN field, which is fairly careful to say that its value is a length with a sign attached, and that its absolute value is a distance, and what the sign means. 😄 |
This was clarified in 035946a as illegal. Fixes samtools#607, although for added clarity we may wish to have a follow up PR to amend the BCF parts of the specification to use unsigned data types when the values are infact only permitted to be unsigned.
This was clarified in 035946a as illegal. Fixes samtools#607, although for added clarity we may wish to have a follow up PR to amend the BCF parts of the specification to use unsigned data types when the values are infact only permitted to be unsigned.
This line in the passed test data fails bcftools checks:
Is this valid? I thought we'd decided that we couldn't have negative lengths, with END < POS, but was that restriction only added for 4.4 onwards?
The spec doesn't have anything explicit to say about it other than it's used for computing BCF's
rlen
field, and the BCF spec definesrlen
asint32_t
. It's explicitly stating it to be signed rather than unsigned, so perhaps negative lengths are permitted and this is just a bcftools bug?Either way, this is a spec issue as it needs to be explicit (and perhaps changing to
uint32_t
).The text was updated successfully, but these errors were encountered: