Skip to content

Commit

Permalink
Clarify that INFO/END is used to form a CHROM:POS-END region (PR #436) (
Browse files Browse the repository at this point in the history
#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes #425.
  • Loading branch information
jmarshall authored and lbergelson committed Aug 19, 2019
1 parent 24ef1bb commit 035946a
Showing 1 changed file with 12 additions and 2 deletions.
14 changes: 12 additions & 2 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ \subsubsection{Fixed fields}
INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: key[=data[,data]].
INFO keys must match the regular expression \texttt{\^{}([A-Za-z\_][0-9A-Za-z\_.]*|1000G)\$}, please note that ``1000G'' is allowed as a special legacy value.
Duplicate keys are not allowed.
Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} are reserved (albeit optional).
Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} and described below are reserved (albeit optional).

The exact format of each INFO key should be specified in the meta-information (as described above).
Example for an INFO field: DP=154;MQ=52;H2.
Expand Down Expand Up @@ -358,7 +358,7 @@ \subsubsection{Fixed fields}
CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\
DB & 0 & Flag & dbSNP membership \\
DP & 1 & Integer & Combined depth across samples \\
END & 1 & Integer & End position (for use with symbolic alleles) \\
END & 1 & Integer & End position on CHROM (used with symbolic alleles; see below) \\
H2 & 0 & Flag & HapMap2 membership \\
H3 & 0 & Flag & HapMap3 membership \\
MQ & 1 & Float & RMS mapping quality \\
Expand All @@ -370,6 +370,15 @@ \subsubsection{Fixed fields}
1000G & 0 & Flag & 1000 Genomes membership \\
\end{longtable}

\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item END: End reference position (1-based), indicating the variant spans positions POS--END on reference/contig CHROM.
Normally this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF, and no END INFO field is needed.
However when symbolic alleles are used, e.g.\ in gVCF or structural variants, an explicit END INFO field provides variant span information that is otherwise unknown.

This field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position.
\end{itemize}

\subsubsection{Genotype fields}
If genotype information is present, then the same types of data must be present for all samples.
First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed).
Expand Down Expand Up @@ -1496,6 +1505,7 @@ \subsection{BCF2 records}
Compression of a BCF file is recommended but not required.

\subsubsection{Site encoding}
\label{BcfSiteEncoding}

{\small
\begin{tabular}{|l | l | p{30em} | } \hline
Expand Down

0 comments on commit 035946a

Please sign in to comment.