From 3545d09e2b4a2e1870a51ffc48f5c982460682b1 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Thu, 25 Jul 2019 00:48:30 +0100 Subject: [PATCH] Clarify that INFO/END is used to form a CHROM:POS-END region (PR #436) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from ยง6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes #425. --- VCFv4.3.tex | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/VCFv4.3.tex b/VCFv4.3.tex index 0196099f6..2e7e3c2f7 100644 --- a/VCFv4.3.tex +++ b/VCFv4.3.tex @@ -327,7 +327,7 @@ \subsubsection{Fixed fields} INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: key[=data[,data]]. INFO keys must match the regular expression \texttt{\^{}([A-Za-z\_][0-9A-Za-z\_.]*|1000G)\$}, please note that ``1000G'' is allowed as a special legacy value. Duplicate keys are not allowed. - Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} are reserved (albeit optional). + Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} and described below are reserved (albeit optional). The exact format of each INFO key should be specified in the meta-information (as described above). Example for an INFO field: DP=154;MQ=52;H2. @@ -358,7 +358,7 @@ \subsubsection{Fixed fields} CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\ DB & 0 & Flag & dbSNP membership \\ DP & 1 & Integer & Combined depth across samples \\ - END & 1 & Integer & End position (for use with symbolic alleles) \\ + END & 1 & Integer & End position on CHROM (used with symbolic alleles; see below) \\ H2 & 0 & Flag & HapMap2 membership \\ H3 & 0 & Flag & HapMap3 membership \\ MQ & 1 & Float & RMS mapping quality \\ @@ -370,6 +370,15 @@ \subsubsection{Fixed fields} 1000G & 0 & Flag & 1000 Genomes membership \\ \end{longtable} +\begin{itemize} +\renewcommand{\labelitemii}{$\circ$} +\item END: End reference position (1-based), indicating the variant spans positions POS--END on reference/contig CHROM. +Normally this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF, and no END INFO field is needed. +However when symbolic alleles are used, e.g.\ in gVCF or structural variants, an explicit END INFO field provides variant span information that is otherwise unknown. + +This field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position. +\end{itemize} + \subsubsection{Genotype fields} If genotype information is present, then the same types of data must be present for all samples. First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed). @@ -1496,6 +1505,7 @@ \subsection{BCF2 records} Compression of a BCF file is recommended but not required. \subsubsection{Site encoding} +\label{BcfSiteEncoding} {\small \begin{tabular}{|l | l | p{30em} | } \hline