Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the meaning of INFO/END #436

Merged
merged 1 commit into from
Aug 19, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ \subsubsection{Fixed fields}
INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: key[=data[,data]].
INFO keys must match the regular expression \texttt{\^{}([A-Za-z\_][0-9A-Za-z\_.]*|1000G)\$}, please note that ``1000G'' is allowed as a special legacy value.
Duplicate keys are not allowed.
Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} are reserved (albeit optional).
Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} and described below are reserved (albeit optional).

The exact format of each INFO key should be specified in the meta-information (as described above).
Example for an INFO field: DP=154;MQ=52;H2.
Expand Down Expand Up @@ -358,7 +358,7 @@ \subsubsection{Fixed fields}
CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\
DB & 0 & Flag & dbSNP membership \\
DP & 1 & Integer & Combined depth across samples \\
END & 1 & Integer & End position (for use with symbolic alleles) \\
END & 1 & Integer & End position on CHROM (used with symbolic alleles; see below) \\
H2 & 0 & Flag & HapMap2 membership \\
H3 & 0 & Flag & HapMap3 membership \\
MQ & 1 & Float & RMS mapping quality \\
Expand All @@ -370,6 +370,15 @@ \subsubsection{Fixed fields}
1000G & 0 & Flag & 1000 Genomes membership \\
\end{longtable}

\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item END: End reference position (1-based), indicating the variant spans positions POS--END on reference/contig CHROM.
Normally this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF, and no END INFO field is needed.
However when symbolic alleles are used, e.g.\ in gVCF or structural variants, an explicit END INFO field provides variant span information that is otherwise unknown.

This field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position.
\end{itemize}

\subsubsection{Genotype fields}
If genotype information is present, then the same types of data must be present for all samples.
First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed).
Expand Down Expand Up @@ -1496,6 +1505,7 @@ \subsection{BCF2 records}
Compression of a BCF file is recommended but not required.

\subsubsection{Site encoding}
\label{BcfSiteEncoding}

{\small
\begin{tabular}{|l | l | p{30em} | } \hline
Expand Down