Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF spec should highlight INFO/END's use in indexing #425

Closed
jmarshall opened this issue Jun 26, 2019 · 4 comments · Fixed by #436
Closed

VCF spec should highlight INFO/END's use in indexing #425

jmarshall opened this issue Jun 26, 2019 · 4 comments · Fixed by #436
Labels

Comments

@jmarshall
Copy link
Member

The VCF specification's §1.6.1's Table 1 simply describes END as

END    1    Integer    End position (for use with symbolic alleles)

and describes its value in more detail later in §3 for structural variants:

For precise variants, END is POS + length of REF allele − 1, and the for imprecise variants the corresponding best estimate.

Readers would not guess from this that END, when present, provides the size of the interval that the variant is located in — and both htslib and htsjdk use it when constructing indexes for VCF/BCF files. So in fact it is vital that this field, if present, is correct and is used with the intended meaning.

The specification should highlight END's use in indexing. Ideally the specification should have a more detailed description of END's value.

@jmarshall jmarshall added the vcf label Jun 26, 2019
@jmarshall
Copy link
Member Author

For example, gnomAD's SV files use END for their own purposes with a different meaning and value. It was recently noticed that this breaks bcftools view with a query region (hat tip @lindenb): see this twitter thread, this biostars posting, and macarthur-lab/gnomad_browser#140.

@cwhelan
Copy link

cwhelan commented Jun 26, 2019

There has been some discussion about fixing this as part of #266

@jmarshall
Copy link
Member Author

@cwhelan: In particular, this comment of yours: #266 (comment). I wish I had realised its significance better at the time!

@jmarshall
Copy link
Member Author

Delly has also been using END like this for a number of years; see dellytools/delly#159.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jul 25, 2019
INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jul 25, 2019
INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Aug 19, 2019
INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Aug 19, 2019
…ools#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.
lbergelson pushed a commit that referenced this issue Aug 19, 2019
#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes #425.
jmarshall added a commit to jmarshall/htslib that referenced this issue Aug 22, 2019
Inserting malformed ranges into the index leads to a broken index
for which hts_itr_query() produces incorrect results for nearby
query regions.

For SAM/BAM/CRAM records, end is always > beg by construction, so
this test never triggers. For VCF/BCF, invalid values of INFO/END
can lead to end<beg -- so hts_idx_push() should refuse to create
a non-functional index. See samtools/hts-specs#425.
thefferon added a commit to thefferon/hts-specs that referenced this issue Oct 9, 2019
* Update CRAM spec section on substitution matrix and codes.

* Respond to review comments.

* CRAM Slice and Container ref seq IDs must match (samtools#401)

* Code review part 2.

* Fix minor typo in the predecessors of BCF2 (samtools#427)

* layed -> laid

* Update MAINTAINERS.md (samtools#432)

Proposal to add Rasko Leinonen as refget maintainer.

* add jmmut to MAINTAINERS.md and move Cristina to "Past Members"

* change order of maintainers

* Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.

* Allow C and Java native text spellings of NaN and infinities (samtools#409)

C's printf doesn't output mixed case, while Java's Double.valueOf and
Double.toString parse/output only `Infinity`, not `Inf`. Rather than
requiring special-case code for both input and output in both languages,
relax the VCF specification to allow NAN/INF/INFINITY case-insensitively.

(Add "IEEE-754" to be specific and to improve the line breaks.)

* Adding a note about <NON_REF> (samtools#380)

* Update PDFs (CRAM and VCF additions; others cosmetic)

CRAMv3 PRs samtools#401 and samtools#412.
VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END).
All: Minor typo and whitespace formatting fixes.

* Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385)

Includes barebones authorizationCode Oauth2 flow, which should aid/inform
code generation.

Uses int64 with minimum 0, unsure if that is really an uint64 though.

* Codify the existing policy of generally squashing PRs (PR samtools#444)

* Permit AP_Delta in multi-ref slices.

This means AP_delta can become negative.  I have validated this
decodes fine in both htsjdk and htslib.  This is because AP is ITF8
and hence signed, like all other integers, so it would need explicit
code to forbid this (which obviously isn't in the implementations).
Hence the limitation is primarily one of an over-zealous specification.

The impact of this is for position-sorted multi-ref slices AP can
legally be stored efficiently.

Also clarified fields in the container compression header when in
multi-ref mode.

Fixes samtools#431
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants