-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify the meaning of INFO/END #436
Conversation
@lbergelson's concern has been addressed by mentioning both “symbolic alleles” and “CHROM” in the summary within the table, with the former mostly unchanged. No other concerns have been raised. Can this be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, one minor wording thing, @jmarshall merge at will.
Thanks @lbergelson — it occurred to me rereading it that perhaps I should add “1-based” to the paragraph — even though the VCF spec doesn't go out of its way to clarify that, it might as well start… 👍 / 👎 ? |
@jmarshall I completely agree that positions (POS, END, etc) being 1-based should be mentioned at least once in the spec, but it seems to me like a different issue than this PR. IMO being 1-based shouldn't be first mentioned in the description of the INFO END field. |
@jmmut: It's already there for POS (“The reference position, with the 1st base having position 1” for VCF, “0-based leftmost coordinate” for BCF). |
Oh, right, I looked in the wrong place. Then it seems a good idea to confirm it here too. |
…ools#436) INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes samtools#425.
Thanks folks. I have applied @lbergelson's wording tweak and added “1-based”. This should be even more ready for merging now 😄 Usual policy would be for one of the VCF maintainers to do the actual merging… |
Thank you @jmarshall. |
* Update CRAM spec section on substitution matrix and codes. * Respond to review comments. * CRAM Slice and Container ref seq IDs must match (samtools#401) * Code review part 2. * Fix minor typo in the predecessors of BCF2 (samtools#427) * layed -> laid * Update MAINTAINERS.md (samtools#432) Proposal to add Rasko Leinonen as refget maintainer. * add jmmut to MAINTAINERS.md and move Cristina to "Past Members" * change order of maintainers * Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436) INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes samtools#425. * Allow C and Java native text spellings of NaN and infinities (samtools#409) C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.) * Adding a note about <NON_REF> (samtools#380) * Update PDFs (CRAM and VCF additions; others cosmetic) CRAMv3 PRs samtools#401 and samtools#412. VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END). All: Minor typo and whitespace formatting fixes. * Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385) Includes barebones authorizationCode Oauth2 flow, which should aid/inform code generation. Uses int64 with minimum 0, unsure if that is really an uint64 though. * Codify the existing policy of generally squashing PRs (PR samtools#444) * Permit AP_Delta in multi-ref slices. This means AP_delta can become negative. I have validated this decodes fine in both htsjdk and htslib. This is because AP is ITF8 and hence signed, like all other integers, so it would need explicit code to forbid this (which obviously isn't in the implementations). Hence the limitation is primarily one of an over-zealous specification. The impact of this is for position-sorted multi-ref slices AP can legally be stored efficiently. Also clarified fields in the container compression header when in multi-ref mode. Fixes samtools#431
INFO/END (when present) determines (together with the CHROM and POS fields) the interval that the variant is located in. This is used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field.
In particular, INFO/END must be on the same chromosome as CHROM and come after POS, otherwise .csi/.tbi indexing will not work. This is implied by the existing description of
rlen
:but that is hardly obvious, and people looking up INFO/END are unlikely to read it anyway — as it's in the BCF part of the spec.
There has been some discussion of this on PR #266 (starting at #266 (comment), hat tip @cwhelan), but that has not resulted in any clarifying text being added in that PR. That PR is likely to take a while yet to land, so here is a separate PR that clarifies INFO/END and can hopefully be merged quickly.
Fixes #425.