Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the meaning of INFO/END #436

Merged
merged 1 commit into from
Aug 19, 2019
Merged

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Jul 25, 2019

INFO/END (when present) determines (together with the CHROM and POS fields) the interval that the variant is located in. This is used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field.

In particular, INFO/END must be on the same chromosome as CHROM and come after POS, otherwise .csi/.tbi indexing will not work. This is implied by the existing description of rlen:

Length of the record as projected onto the reference sequence. Must be the length of the REF allele or the declared length of a symbolic allele respecting the END attribute

but that is hardly obvious, and people looking up INFO/END are unlikely to read it anyway — as it's in the BCF part of the spec.

There has been some discussion of this on PR #266 (starting at #266 (comment), hat tip @cwhelan), but that has not resulted in any clarifying text being added in that PR. That PR is likely to take a while yet to land, so here is a separate PR that clarifies INFO/END and can hopefully be merged quickly.

Fixes #425.

@jmarshall jmarshall added the vcf label Jul 25, 2019
@hts-specs-bot
Copy link

Changed PDFs as of edfc9f5: VCFv4.3 (diff).

@samtools samtools deleted a comment from hts-specs-bot Jul 25, 2019
@hts-specs-bot
Copy link

Changed PDFs as of 72380ed: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of 10a534a: VCFv4.3 (diff).

@jmarshall
Copy link
Member Author

I think the clarification about symbolic alleles is was useful here, probably more so than the new description including CHROM.

@lbergelson's concern has been addressed by mentioning both “symbolic alleles” and “CHROM” in the summary within the table, with the former mostly unchanged.

No other concerns have been raised. Can this be merged?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, one minor wording thing, @jmarshall merge at will.

@jmarshall
Copy link
Member Author

jmarshall commented Aug 19, 2019

Thanks @lbergelson — it occurred to me rereading it that perhaps I should add “1-based” to the paragraph — even though the VCF spec doesn't go out of its way to clarify that, it might as well start…

👍 / 👎 ?

@jmmut
Copy link
Contributor

jmmut commented Aug 19, 2019

@jmarshall I completely agree that positions (POS, END, etc) being 1-based should be mentioned at least once in the spec, but it seems to me like a different issue than this PR. IMO being 1-based shouldn't be first mentioned in the description of the INFO END field.

@jmarshall
Copy link
Member Author

@jmmut: It's already there for POS (“The reference position, with the 1st base having position 1” for VCF, “0-based leftmost coordinate” for BCF).

@jmmut
Copy link
Contributor

jmmut commented Aug 19, 2019

Oh, right, I looked in the wrong place. Then it seems a good idea to confirm it here too.

@hts-specs-bot
Copy link

Changed PDFs as of 09a9c44: VCFv4.3 (diff).

…ools#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.
@hts-specs-bot
Copy link

Changed PDFs as of 3545d09: VCFv4.3 (diff).

@jmarshall
Copy link
Member Author

Thanks folks. I have applied @lbergelson's wording tweak and added “1-based”.

This should be even more ready for merging now 😄 Usual policy would be for one of the VCF maintainers to do the actual merging…

@lbergelson lbergelson merged commit 035946a into samtools:master Aug 19, 2019
@lbergelson
Copy link
Member

Thank you @jmarshall.

@jmarshall jmarshall deleted the vcf-end branch August 19, 2019 17:00
jmarshall added a commit that referenced this pull request Aug 22, 2019
CRAMv3 PRs #401 and #412.
VCFv4.3 PRs #380 (<NON_REF>), #409 (infinity/NaN), and #436 (INFO/END).
All: Minor typo and whitespace formatting fixes.
thefferon added a commit to thefferon/hts-specs that referenced this pull request Oct 9, 2019
* Update CRAM spec section on substitution matrix and codes.

* Respond to review comments.

* CRAM Slice and Container ref seq IDs must match (samtools#401)

* Code review part 2.

* Fix minor typo in the predecessors of BCF2 (samtools#427)

* layed -> laid

* Update MAINTAINERS.md (samtools#432)

Proposal to add Rasko Leinonen as refget maintainer.

* add jmmut to MAINTAINERS.md and move Cristina to "Past Members"

* change order of maintainers

* Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.

* Allow C and Java native text spellings of NaN and infinities (samtools#409)

C's printf doesn't output mixed case, while Java's Double.valueOf and
Double.toString parse/output only `Infinity`, not `Inf`. Rather than
requiring special-case code for both input and output in both languages,
relax the VCF specification to allow NAN/INF/INFINITY case-insensitively.

(Add "IEEE-754" to be specific and to improve the line breaks.)

* Adding a note about <NON_REF> (samtools#380)

* Update PDFs (CRAM and VCF additions; others cosmetic)

CRAMv3 PRs samtools#401 and samtools#412.
VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END).
All: Minor typo and whitespace formatting fixes.

* Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385)

Includes barebones authorizationCode Oauth2 flow, which should aid/inform
code generation.

Uses int64 with minimum 0, unsure if that is really an uint64 though.

* Codify the existing policy of generally squashing PRs (PR samtools#444)

* Permit AP_Delta in multi-ref slices.

This means AP_delta can become negative.  I have validated this
decodes fine in both htsjdk and htslib.  This is because AP is ITF8
and hence signed, like all other integers, so it would need explicit
code to forbid this (which obviously isn't in the implementations).
Hence the limitation is primarily one of an over-zealous specification.

The impact of this is for position-sorted multi-ref slices AP can
legally be stored efficiently.

Also clarified fields in the container compression header when in
multi-ref mode.

Fixes samtools#431
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VCF spec should highlight INFO/END's use in indexing
6 participants