From f090acb4905d89166597e46f9b59e4397edbfe4c Mon Sep 17 00:00:00 2001
From: fridells51 <gus.fridell@gmail.com>
Date: Sat, 17 Jun 2023 11:31:09 -0400
Subject: [PATCH] copying vc docs rst

---
 docs/variant-calling.rst | 647 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 647 insertions(+)
 create mode 100644 docs/variant-calling.rst

diff --git a/docs/variant-calling.rst b/docs/variant-calling.rst
new file mode 100644
index 00000000..a80cb705
--- /dev/null
+++ b/docs/variant-calling.rst
@@ -0,0 +1,647 @@
+.. _Overview:
+
+LCDB-WF Variant Calling Overview
+================================
+
+This Snakemake workflow handles detection and annotation of germline and
+somatic variants in DNA sequencing reads. Analysis-ready VCFs are generated by
+following a GATK best practices workflow. Annotations are attached from
+variation databases like dbNSFP using SnpEff. This workflow also provides QC
+analysis on the input fastq files as well as some metrics recorded throughout
+the variant calling workflow.
+
+The workflow is primarily designed for human sequencing data, but it will
+support other organisms. However, due to limiting references in other
+organisms, some of the features of the workflow may not be available. See the
+section on `configuring other organisms <conforg_>`__ for details on how to run
+the workflow for non-human organisms. 
+
+The workflow can be thought of as having 5 main components: references,
+mapping, calling, annotating, and QC.
+
+
+
+.. _References:
+
+References
+----------
+
+There are a handful of files that can be considered references that are
+involved in calling and annotating variants. References can either be provided
+externally, or they will be generated inside of the workflow by implementing
+the LCDB-WF References workflow. See the section on `configuring references
+<confref_>`__ for details on how to properly provide or generate references.
+
+- Reference genome
+    * The reference genome will be downloaded by the LCDB-WF references
+      workflow if it is not supplied. The workflow supports chromosome
+      nomenclature from both GRCh38/hg38 and GRCh37/hg19 references.
+    * By default, Ensembl’s GRCh38 genome build is used because the known
+      variation file is also provided by Ensembl.
+- Known variation
+    * This is a VCF file that contains sites of “known variation” in an
+      organism’s genome. This is used to recalibrate bam files using `base
+      quality score recalibration
+      <https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR->`_
+      and avoid detection of variants that are extremely common.
+    * Note that this file is not essential to variant calling and the workflow
+      can run without it. For some organisms, this type of file is not
+      available.
+- BWA Index Files
+    * These index files are generated by ``bwa index``
+    * The references workflow will build these files, but if the references
+      workflow is not used (see `configuring references <confref_>`__), these
+      files will be generated by the Snakefile at workflow execution.
+- Sequence Dictionary
+    * Generated by ``picard CreateSequenceDictionary`` and used by several GATK
+      commands.
+- Fasta Index
+    * Generated by ``samtools faidx`` and used by the workflow to establish
+      contigs for joint-calling.
+- Variation Databases
+    * These are (usually) VCFs that contain variant annotations that are used
+      by SnpEff for annotating VCFs in the workflow.
+    * the dbNSFP database may be downloaded by the references workflow or you
+      can download it yourself (see `Downloading dbNSFP <dldbnsfp_>`__). It
+      contains a comprehensive set of annotations for humans. This file can be
+      provided externally if you are not using the references workflow (see the
+      section on `Configuring Annotation Databases <confdbnsfp_>`__)
+    * These are also not required and we can still run annotation without the
+      file present (we just won’t have all the fancy annotations that these
+      databases provide)
+
+.. _Mapping:
+
+
+Mapping
+-------
+
+First, adapters are trimmed from the reads using ``cutadapt``. After that,
+reads are mapped to the genome using ``bwa mem``.
+
+After bams are generated, we need to mark and delete duplicates using ``picard
+MarkDuplicates``. Duplicate reads are uninformative for variant calling, and
+deleting them will speed up the workflow. 
+
+Once duplicates are removed, we recalibrate base quality score using GATK
+``BaseRecalibrator`` and ``ApplyBQSR`` provided that a known variation file is
+present and the config option for base recalibration is set (see `filtering and
+processing <fp_>`__).
+
+
+
+.. _Calling:
+
+Calling Variants
+----------------
+
+The workflow contains support for both germline and somatic variant calling. 
+
+For germline variants, this workflow takes advantage of Snakemake
+parallelization by splitting up variant calling into regions. By default, these
+regions are defined by the contigs identified in the fasta index file. 
+
+However, you can provide a .bed file with regions that you want to restrict
+variant calling to such as in a targeted exon sequencing project (see
+`Processing <fp_>`__).
+
+Once regions are established, variants are called with GATK ``HaplotypeCaller``
+to produce a `GVCF
+<https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format>`_
+for every sample for every config. The GVCFs from each sample are combined for
+each contig to produce GVCFs that contain variants from ALL samples for each
+contig. Finally, GVCFs are genotyped and merged to produce a single VCF
+containing mutations in all samples for all contigs.
+
+Somatic variant calling roughly follows the same procedures as above, however
+a bit more configuration is involved. See the section on `Somatic Variant
+Calling with Mutect2 <mutect_>`__. Additionally, for filtering somatic variants
+we run `LROM
+<https://gatk.broadinstitute.org/hc/en-us/articles/360051305331-LearnReadOrientationModel>`_
+to help estimate artifacts produced by read orientation bias.
+
+
+
+.. _Annotation:
+
+Annotation
+----------
+
+Annotating variants is the process of looking up a variant in a database and
+adding a description or metric about that variant to a section in the ``INFO``
+column of a VCF file. `dbNSFP <https://sites.google.com/site/jpopgen/dbNSFP>`_
+is a popular database that contains annotations from hundreds of sources
+compiled into a single VCF. See the section on `Configuring Annotation
+Databases <confdbnsfp_>`__ for information on how to use dbNSFP and other
+variation databases.
+
+Annotations are attached using `SnpEff and SnpSift
+<https://pcingola.github.io/SnpEff/>`_. These are also useful tools for
+downstream analysis, which is not included in the workflow, but discussed
+`here <patho_>`__
+
+Finally, we run ``bcftools norm`` on all final VCFs in order to split
+multi-allelic variant sites into a series of SNPs.
+
+
+
+.. _QC:
+
+QC
+--
+
+The workflow generates QC metrics at several points and aggregates them using
+`MultiQC <https://multiqc.info/>`_. These include metrics from:
+
+- ``samtools stats``
+- ``picard MarkDuplicates``
+- ``FastQC``
+- ``SnpEff`` Variant annotation summary statistics
+
+MultiQC reads from all of these sources and generates an interactive html
+document that allows you to visualize and explore QC data.
+
+
+
+.. _Configuration:
+
+Configuration and config.yaml
+=============================
+
+The config file, ``config.yaml``,  is the main point of communication between
+the user and the Snakefile that the workflow executes. Users should always
+approach this file first before editing the Snakefile. This is how we pass
+paths to reference files, arguments to rules in the Snakefile, and configure
+patterns for somatic variant calling. Each section in the ``config.yaml`` is
+documented here.
+
+
+
+.. _samples:
+
+Input
+-----
+
+In order to pass sequencing data to the workflow, samples need to be configured
+into two tables, ``units.tsv`` and ``samples.tsv``
+
+``units.tsv`` is configured like this:
+
+
+========= ========================== =================== ================ ============================================
+sample    unit                       platform            fq1              fq2
+========= ========================== =================== ================ ============================================
+Sample ID Technical replicate number Sequencing platform Path to R1 fastq Path to R2 fastq (if using paired-end reads)
+========= ========================== =================== ================ ============================================
+
+
+Note:
+
+- sample names needn’t be unique, but no sample-unit combination can be the
+  same. Increment the unit starting from 1 for identical sample names to
+  represent technical replicates. A sampletable with no technical replicates
+  should have the value for unit set to 1 for every sample.
+- platform is used to attach read groups. It should be the sequencing platform
+  of your data, for example, "Illumina."
+- fq1 and fq2 represent the R1 reads and R2 reads from a paired-end sample. If
+  you samples are not paired-end, then leave the value in the fq2 column empty.
+
+``samples.tsv`` is a single column consisting of the sample names. You can use
+``cut -f1 units.tsv > samples.tsv`` to generate this file
+
+
+.. _confall:
+
+Configuring the All Rule
+------------------------
+
+The all rule is how we initially generate the DAG for which rules to execute in
+which order to generate the final outputs of the workflow. We specify the
+outputs we want at the end of the workflow here:
+
+- Germline variants, annotated, multiallelic-split (with ``bcftools norm``):
+  “results/annotated/ann.vcf.gz”
+
+    * If you decide to use dbNSFP to attach annotations AFTER you've already
+      generated the "results/annotated/ann.vcf.gz" file, you need to `configure
+      dbNSFP <confdbnsfp_>`__ and delete “results/annotated/ann.vcf.gz” to
+      regenerate output.
+
+- MultiQC: “results/qc/multiqc.html”
+- Mutect2 annotated: ``expand("results/mutect2_annotated/snpeff.{comp}.vcf.gz",
+  comp = config['mutect2'].keys())``
+
+    * See the `Mutect2 configuration <mutect_>`__
+
+If you do not wish to attach annotations to your variants:
+
+- Germline (multiallelic-split): “results/filtered/all.normed.vcf.gz”
+- Somatic
+  (multiallelic-split): ``expand("results/somatic_filtered/normed.{comp}.vcf.gz",
+  comp = config['mutect2'].keys())``
+
+If for some reason you do not want your variants to be multiallelic-split, then
+you will have to edit the ``rule merge_calls`` for germline variants and change
+the output to not be marked ``temp()``. For somatic variants, you will have to
+edit the ``filter_mutect2_calls`` rule in the same fashion. Specify the output
+of those rules in your all rule.
+
+
+
+.. _confref:
+
+Configuring References in config.yaml
+-------------------------------------
+
+The ``ref:`` section of the config.yaml is how we determine the patterns of the
+reference files generated by the references workflow or pass the paths of
+external references to the workflow.
+
+If you intend on using the LCDB-WF references workflow, set
+``use_references_workflow: true``. The Snakefile reads this argument and
+determines which references to use. At the bottom of the config file is an
+``include_references:`` key that must point to the reference config from the
+LCDB-WF references workflow that you wish to use. The structure of the
+reference config that you include is like so:
+
+.. code-block:: yaml
+
+    references:
+        organism:
+            tag:
+                type:
+
+It is crucial that the values for organism and tag in the ``ref:`` block of
+``config.yaml`` match exactly to the reference config. The ``aligner:`` and
+``faidx`` sub-fields must also match what is found in the ``indexes:`` subfield
+of the ``genome:`` “type” from the above yaml structure in the reference
+config. 
+
+For example, if the reference config looks like this:
+
+.. code-block:: yaml
+
+        references:
+            human:
+                ensembl-104:
+                    genome:
+                        url: ‘dummy.fasta.download.url’
+                        indexes:
+                            - ‘bwa’
+                            - ‘faidx’
+
+Then we will configure the ``config.yaml`` like this:
+
+.. code-block:: yaml
+
+        ref:
+        use_references_workflow: true
+            organism: ‘human’
+            genome:
+                tag: ‘ensembl-104’
+                build: 'GRCh38'
+            aligner:
+                index: ‘bwa’
+                tag: ‘ensembl-104’
+            faidx:
+                index: ‘faidx’
+                tag: ‘ensembl-104’
+
+The ``build:`` key in the ``genome:`` block must match what is provided in the
+metadata of the reference config.
+
+For variation databases, the versioning on the database used is controlled by
+a key-value provided to the reference config under the ``variation:`` “tag”.
+This value must match what is given in the ``config.yaml``. You must also
+supply the build of your genome to the reference config and it has to match the
+build you are using for your reference genome. This is because we need to
+process the dbNSFP file in the references workflow to make it compatible with
+older genomes. To mirror the example above if this is what is in our reference
+config:
+
+.. code-block:: yaml
+
+    references:
+        human:
+            known:
+                type: 'all'
+
+        variation:
+            dbnsfp:
+                version: 'dbNSFPv4.4'
+                url: ‘dummy.download.dbsnfp.url’
+                build 'GRCh38'
+
+Then we will configure the ``config.yaml`` like this:
+
+.. code-block:: yaml
+
+    ref:
+        use_references_workflow: true
+        organism: ‘human’
+
+        variation:
+            known: ‘known’
+            dbnsfp: ‘dbNSFPv4.4’
+
+The workflow supports the option to supply the ``known:`` and ``dbnsfp:`` keys
+with an ABSOLUTE path to a file you have locally. The path MUST start with
+``"/"``. Make sure to read on below to see what sort of processing and
+modifications must be made to these files in order for the workflow to use
+them.
+
+Note that in the reference config, the ``type:`` field under ``known:``
+corresponds to the type of known variation file you wish to download. Available
+options are ‘somatic’, ‘structural_variation’, or ‘all’. At the moment, this
+workflow does not support structural variant calling. Providing ‘all’ to this
+field will download germline known variation for all chromosomes. If your
+organism does not have a variation database, then simply leave these fields
+empty in the variant calling workflow config, and the workflow will run without
+them.
+
+If you are providing your own references to the workflow, set them in the paths
+section of the ref block:
+
+.. code-block:: yaml
+
+    ref:
+        paths:
+            ref: path/to/known/genome/fasta
+            known: path/to/known/variation/file
+            index: path/to/fasta/index
+            dbnsfp: path/to/dbnsfp/file
+
+If you are providing a known variation file that contains `IUPAC-coded alleles
+<https://www.bioinformatics.org/sms/iupac.html>`_, then you must run
+``rust-bio-tools`` on your variation file and provide this output as your known
+variation file:
+
+.. code-block:: bash
+
+    rbt vcf-fix-iupac-alleles < input | bcftools view -Oz > known-variation.vcf.gz 
+
+
+Of these, the only file that you absolutely must provide always is the genome
+fasta. NEVER provide the workflow with a "top level" assembly of the genome
+when working with human data. This contains a lot of unaligned contigs and
+haplotypes and BWA will not be able to map your reads to the genome. The
+workflow accepts both compressed and uncompressed fasta files. Some notes on
+references such as the known variation, dbNSFP, BWA index, and fasta index:
+
+- If you cannot provide the files in the config above, just leave their values
+  empty in the ``config.yaml`` field.
+
+    * If the known variation file is absent, we will not be able to recalibrate
+      bases and our bam file processing ends at marking and deleting
+      duplicates.
+    * If the dbNSFP file is absent, we will not able to run SnpSift to attach
+      fields from the dbNSFP file (see `Configuring Annotation Databases
+      <confdbnsfp_>`__), but we will still be able to run SnpEff.
+    * If the BWA index is not used from the LCDB-WF references workflow, then
+      it will be created in the Snakefile in the same directory as the provided
+      reference
+    * If the fasta index is not provided, it will be created in the same
+      directory as the provided reference during the workflow.
+
+
+.. _dldbnsfp:
+
+Downloading dbNSFP
+------------------
+
+To download the dbNSFP file grab the Box link from the `dbNSFP website
+<https://sites.google.com/site/jpopgen/dbNSFP>`_. Alternatively, you can use
+the `ftp server <ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/>`_. You can grab
+the file from here, but the zips are really big, and if you click the link you
+will open the ftp server on your machine in finder. You can get the file name
+of the file you want and fill in the string like this:
+
+.. code-block:: bash
+
+    version=<filename from ftp server>
+    wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP${version}.zip
+    # Alternatively, wget <link address from the Box link on the dbNSFP website>
+
+The most recent version may not be on the ftp server. Watch your disk space
+when downloading and processing this file, it will use hundreds of gigabytes.
+You may want to do this from a shell script writing to a tmp dir. Do not
+attempt to download this on a node that is not suitable for heavy computation
+
+Pay attention to the comments in the code block. The process differs for
+GRCh38/hg38 genomes and GRCh37/hg19 genomes. Recent versions of dbNSFP (v3.X
+and later) switched to GRCh38 coordinates. You can still use these later
+versions for GRCh37 coordinates, but you have to process the files differently.
+
+.. code-block:: bash
+
+    unzip <your downloaded dbNSFP zip>
+    zcat dbNSFP*_variant.chr1* | head -n1 > h
+    # head works if you run this command line by line, but if you are setting "set -euo pipefail" head may cause the pipe to exit. Alternatively, you can use pipe to awk "NR <= 1" > h
+
+    # For GRCh38/hg38 data (include the genome version somewhere in your output file name for clarity):
+    zgrep -v "^#chr" dbNSFP*_variant.chr* | sort -k1,1 -k2,2n - | cat h - | bgzip -c > <your named dbNSFP file output>
+    tabix -s 1 -b 2 -e 2 <your named dbNSFP file output>
+
+    # For GRCh37/hg19 data:
+    zgrep -h -v "^#chr" dbNSFP*_variant.chr* | awk '$8 != "." ' | sort -k8,8 -k9,9n - | cat h - | bgzip -c > <your named dbNSFP file output>
+    tabix -s 8 -b 9 -e 9 <your named dbNSFP file output>
+
+It is highly recommended to run ``sort`` with the ``--parallel=N`` option on
+a node with multiple CPUs as this will take a very long time. See the `sort man
+page <https://man7.org/linux/man-pages/man1/sort.1.html>`_ for details on how
+to properly parallelize.
+
+
+.. _conforg:
+
+Support for Non-Human Organisms
+-------------------------------
+
+If you wish to call variants on a non-human organism, then you will have to
+supply the references yourself following the directions above, or edit the
+reference config for your organism in the top level of the cloned LCDB-WF
+directory in ``include/reference_configs/``. You can add to the config:
+``variant-calling.yaml`` reference config if you are using an Ensembl genome.
+The tag should be configured as “ensembl-release”. ONLY include the variation
+block in the reference config if your organism has support for an annotation
+variation database like dbNSFP Downloading the known variation file in the
+references workflow relies on the build, release, and species values specified
+in the ``metadata:`` block.
+
+WARNING: The workflow relies heavily on references downloaded from ensembl,
+mainly due to Ensembl housing known variation files. You will run into issues
+if your references have a "chr" prefix for chromosome nomenclature. Check the
+``get_contigs()`` method in ``lib/helpers.smk`` (from the top-level directory
+of LCDB-WF) and see if this will cause any issues with your organism. 
+
+
+
+.. _fp:
+
+Filtering and Processing
+------------------------
+
+There are several filtering and processing configurations that can be provided
+to the workflow in the ``processing:`` and ``filtering:`` sections of
+``config.yaml``.
+
+In ``processing:`` there are several options:
+
+- ``remove-duplicates:`` if ``true`` is set for this option, we remove
+  duplicates with ``picard MarkDuplicates`` (you should do this unless you are
+  interested in duplicated reads in your data)
+- ``remove-mitochondrial:`` if ``true`` is set for this option, then
+  mitochondrial contigs will be ignored for variant calling (default)
+- ``pcr:`` This option takes a string according to `GATK PCR INDEL model docs
+  <https://gatk.broadinstitute.org/hc/en-us/articles/360036465912-HaplotypeCaller#--pcr-indel-model>`_.
+  The default option in the workflow is to not specify this argument, but if
+  you know that your data is PCR-free sequencing data then you should set this
+  option to ‘NONE’ for the most specific results.
+- ``restrict-regions:`` Supply a capture file in .bed format. Commonly used for
+  whole exome sequencing (WES) and targeted exome sequencing.
+
+    * ``region-padding:`` Increase the size of the regions specified in the
+      .bed file by a flat amount of base pairs. Leave both of these options
+      blank if you do not have a .bed file.
+
+In ``filtering:`` we have these options:
+
+- ``bqsr:`` This option controls whether or not we run `GATK base recalibrator
+  <https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR->`_.
+  Basically, if we are able to, we should. What defines if we are able to is
+  whether or not the known variation file is available for the organism we are
+  calling variants on. Turning this option to false means that our bam
+  processing ends after deleting duplicates.
+- ``hard:`` This is the hard-filtering option as outlined in `GATK filtering
+  docs
+  <https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set>`_.
+  These are basic, quality-based filters that remove low-confidence or
+  low-quality variants from our callset.
+
+    * ``snvs:`` and ``indels:`` We split the variants into SNPs and INDELS
+      because they require different hard-filtering options. The filter values
+      are read from these options and can be adjusted. What is present by
+      default is GATK’s standard recommendation.
+
+
+
+.. _confdbnsfp:
+
+Configuring Annotation Databases in config.yaml
+-----------------------------------------------
+
+The `SnpEff and SnpSift  docs <https://pcingola.github.io/SnpEff/>`_ are a very
+helpful resource for running SnpEff and SnpSift. 
+
+In the ``snpeff:`` section of ``config.yaml`` you can configure arguments that
+are passed to the annotation rules.
+
+
+- ``somatic:`` This is how we configure SnpEff summary output for MultiQC to
+  aggregate the correct output files for its report. Set this value to ``true``
+  if you are calling somatic variants
+- ``germline:`` Same as ``somatic:`` but set to ``true`` if you are calling
+  germline variants.
+- ``genome:`` For help with this value, see SnpEff’s `docs
+  <https://pcingola.github.io/SnpEff/se_build_db/>`_ for which database to use.
+- ``annotations:`` This key is an optional configuration that allows you to
+  specify which fields from dbNSFP you would like to attach to your VCF.
+  Leaving this field blank will attach ALL fields from dbNSFP (of which there
+  are many, so it will save you time and computation to specify your fields).
+  The value given to this key is a comma-separated string (WITHOUT WHITESPACE!)
+  of the names of the columns in the dbNSFP file you would like to attach. For
+  example, to get FATHMM and SIFT pathogenic predictions, you would configure
+  like this: ``annotations: ‘FATHMM_pred,SIFT_pred’``
+
+
+
+.. _mutect:
+
+Somatic Variant Calling with Mutect2
+-------------------------------------
+
+When talking about somatic calling, we say “tumor” and “normal” as being the two
+components of a Mutect2 contrast. In reality, the relationship does not have to
+be tumor and normal tissue. However, the “tumor” and “normal” samples should
+come from the same organism or from very genetically similar organisms. In
+somatic calling, we are trying to identify regions where the “tumor” sample
+differs from the “normal” sample.
+
+Directly above the ``mutect2:`` section is ``PON:``. PON stands for panel of
+normals, and is a file that is very similar in use to a known variation file.
+It is a file made from “normal” samples believed to be free from somatic
+alterations. Its purpose is to help identify technical artifacts. You can read
+more about panel of normals `here
+<https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON->`_.
+This file is optional and leaving it blank will not affect our ability to call
+somatic variants.
+
+In the ``mutect2:`` section of the ``config.yaml`` there is (importantly) only
+one option. In the default config, this single option is ``tumor-normal:`` This
+is how we will tell Mutect2 which samples should be analyzed together and which
+of those are “normals” or “tumors”. However, this key need not be named
+“tumor-normal”. Below is an example configuration:
+
+.. code-block:: yaml
+
+    mutect2:
+      patient-1:
+        tumor:
+            - ‘p1_tumor’
+        normal:
+            - ‘p1_normal’
+      patient-2:
+        tumor:
+            - ‘p2_tumor’
+            - ‘p2_tumor_metastasis’
+        normal:
+            - ‘p2_normal’
+
+The keys one level below ``mutect2:`` correspond to the name of the comparisons
+we are making. This is how we establish wildcards in the Snakefile for the
+Mutect2 rule. However, the keys inside of the comparison name MUST be named
+``tumor:`` and ``normal:``. Their values should be formatted as a list in yaml.
+Each value under ``tumor:`` or ``normal:`` corresponds to the sample name found
+in the `sampletable <samples_>`__. The workflow handles combining technical
+replicates, so you only have to provide the sample name.
+
+
+
+.. _down:
+
+Downstream Analysis
+===================
+
+
+.. _sliv:
+
+Slivar for Rare Disease or Pedigrees
+------------------------------------
+
+Several projects in the past have focused on calling variants on pedigree or
+trio data. This is a common archetype in investigating inheritance patterns of
+rare diseases. `Slivar <https://github.com/brentp/slivar>`_ is a set of
+command-line tools that query and filter VCF file in order to investigate
+non-Mendelian and Mendelian inheritance patterns in pedigree data. One of the
+most powerful aspects of this tool is the ability to create custom filtering
+patterns using basic boolean logic.
+
+
+.. _patho:
+
+Filtering Annotations
+---------------------
+
+One of the benefits of SnpSift is the ability to filter your VCF files based on
+what annotations are attached. The `docs page
+<https://pcingola.github.io/SnpEff/ss_filter/>`_ on SnpSift filter has great
+documentation on using this. Similar to Slivar, you can use arbitrary
+expressions and access fields in the VCF such as quality (``QUAL``) and
+genotype (``GEN[<sample number or sample name>]``) to enhance your annotation
+filters.
+
+For example, to get all variants marked as “Damaging” by the FATHMM pathogenic
+prediction annotation from dbNSFP, you can run this on your annotated vcf: 
+
+``SnpSift filter “( dbNSFP_FATHMM_pred = 'D' )" input.vcf > output.vcf``
+