Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

END < POS in vcf/4.3/passed_body_alt.vcf #607

Closed
jkbonfield opened this issue Nov 11, 2021 · 5 comments
Closed

END < POS in vcf/4.3/passed_body_alt.vcf #607

jkbonfield opened this issue Nov 11, 2021 · 5 comments
Labels

Comments

@jkbonfield
Copy link
Contributor

This line in the passed test data fails bcftools checks:

1	4389	.	C	<*>,<DEL>	.	.	END=4388	GT:DP:GQ	0/0:25:45	0/1:23:99

Is this valid? I thought we'd decided that we couldn't have negative lengths, with END < POS, but was that restriction only added for 4.4 onwards?

The spec doesn't have anything explicit to say about it other than it's used for computing BCF's rlen field, and the BCF spec defines rlen as int32_t. It's explicitly stating it to be signed rather than unsigned, so perhaps negative lengths are permitted and this is just a bcftools bug?

Either way, this is a spec issue as it needs to be explicit (and perhaps changing to uint32_t).

@jmarshall jmarshall added the vcf label Nov 11, 2021
@jmarshall
Copy link
Member

It's not valid.

PR #436 clarified END in the 4.3 spec; IMHO the clarification applies to all VCF versions though the text was not changed in the earlier documents.

The description in that PR (“INFO/END must be on the same chromosome as CHROM and come after POS”) makes it clear that the intention was to disallow END < POS.

The text actually added to the spec (“indicating the variant spans positions POS–END” and “this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF“ — note that the latter, len(REF), is by definition non-negative) is less explicit, but IMHO it'd be pretty obtuse to claim that this does not trivially imply that END is to the right of POS. Nonetheless the VCF maintainers may wish to add a truly explicit statement…

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 11, 2021

Well, SAM records have TLEN as negative so you can compute the other end of the template by adding it to POS, which may be before or after POS.

There's nothing explicitly in the VCF spec that states END >= POS. I'd hope it's pretty obvious, but I could see people concluding some complex variant is permitted to swap POS with END if it's associated with the opposite strand (for example). This is reinforced with the use of signed data types, further giving credence to them being permitted to be negative. I'm not saying I believe this to be so, and as I said I recollected us deciding that END<POS was invalid, but I'm trying to look it this from a layman coming to the spec for the first time without any prior knowledge of what is permitted and what is not.

We went through the BAM spec clarifying all of this to explicitly disallow negative values where they shouldn't be.

@jmarshall
Copy link
Member

As for the BCF part of the spec's description of rlen (which I think is the only instance of a signed data type you have in mind?), it should be read in conjunction with its description:

Field Type Notes
rlen int32_t Length of the record as projected onto the reference sequence. Must be the length of the REF allele or the declared length of a symbolic allele respecting the END attribute

All the values for rlen are described as lengths and are thus by definition non-negative. I was unsure whether there might be a special interpretation of -1 (as in SAM's tid field) but there is no evidence of such a thing in htslib/vcf.c. Hence one of these two tweaks should be made:

  1. Change the type to uint32_t, reflecting that there are no negative special values. But what this implies about the validity of rlen values beyond 231 would need to be considered.
  2. Otherwise if there are any special negative values that might be encountered, they need to be described in the Notes column.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 11, 2021

Well, again I refer you back to the SAM spec where we explicitly have TLEN (Template LENgth) which can be negative. :-)

I wouldn't argue we can assume people immediately know lengths aren't negative quantities. Granted it's a good default starting position and it's probably only my knowledge of SAM which taints my view of how to interpret VCF, but I'd be happier if the data types were corrected.

Edit: we should probably do to BCF what we did to BAM, add a column describing the maximum legal values. Or alternatively discard int vs uint and have the column describing both minimum and maximum values. That then permits us to do tid: -1 to 2^31-1 (or "[-1, 2^31)", perhaps with a reminded of [] vs () notation in a footnote).

@jmarshall
Copy link
Member

I refer you to go back and read the description of the TLEN field, which is fairly careful to say that its value is a length with a sign attached, and that its absolute value is a distance, and what the sign means. 😄

jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Nov 11, 2021
This was clarified in 035946a as illegal.

Fixes samtools#607, although for added clarity we may wish to have a follow up
PR to amend the BCF parts of the specification to use unsigned data
types when the values are infact only permitted to be unsigned.
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Feb 2, 2022
This was clarified in 035946a as illegal.

Fixes samtools#607, although for added clarity we may wish to have a follow up
PR to amend the BCF parts of the specification to use unsigned data
types when the values are infact only permitted to be unsigned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants