From f090acb4905d89166597e46f9b59e4397edbfe4c Mon Sep 17 00:00:00 2001 From: fridells51 Date: Sat, 17 Jun 2023 11:31:09 -0400 Subject: [PATCH] copying vc docs rst --- docs/variant-calling.rst | 647 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 647 insertions(+) create mode 100644 docs/variant-calling.rst diff --git a/docs/variant-calling.rst b/docs/variant-calling.rst new file mode 100644 index 00000000..a80cb705 --- /dev/null +++ b/docs/variant-calling.rst @@ -0,0 +1,647 @@ +.. _Overview: + +LCDB-WF Variant Calling Overview +================================ + +This Snakemake workflow handles detection and annotation of germline and +somatic variants in DNA sequencing reads. Analysis-ready VCFs are generated by +following a GATK best practices workflow. Annotations are attached from +variation databases like dbNSFP using SnpEff. This workflow also provides QC +analysis on the input fastq files as well as some metrics recorded throughout +the variant calling workflow. + +The workflow is primarily designed for human sequencing data, but it will +support other organisms. However, due to limiting references in other +organisms, some of the features of the workflow may not be available. See the +section on `configuring other organisms `__ for details on how to run +the workflow for non-human organisms. + +The workflow can be thought of as having 5 main components: references, +mapping, calling, annotating, and QC. + + + +.. _References: + +References +---------- + +There are a handful of files that can be considered references that are +involved in calling and annotating variants. References can either be provided +externally, or they will be generated inside of the workflow by implementing +the LCDB-WF References workflow. See the section on `configuring references +`__ for details on how to properly provide or generate references. + +- Reference genome + * The reference genome will be downloaded by the LCDB-WF references + workflow if it is not supplied. The workflow supports chromosome + nomenclature from both GRCh38/hg38 and GRCh37/hg19 references. + * By default, Ensembl’s GRCh38 genome build is used because the known + variation file is also provided by Ensembl. +- Known variation + * This is a VCF file that contains sites of “known variation” in an + organism’s genome. This is used to recalibrate bam files using `base + quality score recalibration + `_ + and avoid detection of variants that are extremely common. + * Note that this file is not essential to variant calling and the workflow + can run without it. For some organisms, this type of file is not + available. +- BWA Index Files + * These index files are generated by ``bwa index`` + * The references workflow will build these files, but if the references + workflow is not used (see `configuring references `__), these + files will be generated by the Snakefile at workflow execution. +- Sequence Dictionary + * Generated by ``picard CreateSequenceDictionary`` and used by several GATK + commands. +- Fasta Index + * Generated by ``samtools faidx`` and used by the workflow to establish + contigs for joint-calling. +- Variation Databases + * These are (usually) VCFs that contain variant annotations that are used + by SnpEff for annotating VCFs in the workflow. + * the dbNSFP database may be downloaded by the references workflow or you + can download it yourself (see `Downloading dbNSFP `__). It + contains a comprehensive set of annotations for humans. This file can be + provided externally if you are not using the references workflow (see the + section on `Configuring Annotation Databases `__) + * These are also not required and we can still run annotation without the + file present (we just won’t have all the fancy annotations that these + databases provide) + +.. _Mapping: + + +Mapping +------- + +First, adapters are trimmed from the reads using ``cutadapt``. After that, +reads are mapped to the genome using ``bwa mem``. + +After bams are generated, we need to mark and delete duplicates using ``picard +MarkDuplicates``. Duplicate reads are uninformative for variant calling, and +deleting them will speed up the workflow. + +Once duplicates are removed, we recalibrate base quality score using GATK +``BaseRecalibrator`` and ``ApplyBQSR`` provided that a known variation file is +present and the config option for base recalibration is set (see `filtering and +processing `__). + + + +.. _Calling: + +Calling Variants +---------------- + +The workflow contains support for both germline and somatic variant calling. + +For germline variants, this workflow takes advantage of Snakemake +parallelization by splitting up variant calling into regions. By default, these +regions are defined by the contigs identified in the fasta index file. + +However, you can provide a .bed file with regions that you want to restrict +variant calling to such as in a targeted exon sequencing project (see +`Processing `__). + +Once regions are established, variants are called with GATK ``HaplotypeCaller`` +to produce a `GVCF +`_ +for every sample for every config. The GVCFs from each sample are combined for +each contig to produce GVCFs that contain variants from ALL samples for each +contig. Finally, GVCFs are genotyped and merged to produce a single VCF +containing mutations in all samples for all contigs. + +Somatic variant calling roughly follows the same procedures as above, however +a bit more configuration is involved. See the section on `Somatic Variant +Calling with Mutect2 `__. Additionally, for filtering somatic variants +we run `LROM +`_ +to help estimate artifacts produced by read orientation bias. + + + +.. _Annotation: + +Annotation +---------- + +Annotating variants is the process of looking up a variant in a database and +adding a description or metric about that variant to a section in the ``INFO`` +column of a VCF file. `dbNSFP `_ +is a popular database that contains annotations from hundreds of sources +compiled into a single VCF. See the section on `Configuring Annotation +Databases `__ for information on how to use dbNSFP and other +variation databases. + +Annotations are attached using `SnpEff and SnpSift +`_. These are also useful tools for +downstream analysis, which is not included in the workflow, but discussed +`here `__ + +Finally, we run ``bcftools norm`` on all final VCFs in order to split +multi-allelic variant sites into a series of SNPs. + + + +.. _QC: + +QC +-- + +The workflow generates QC metrics at several points and aggregates them using +`MultiQC `_. These include metrics from: + +- ``samtools stats`` +- ``picard MarkDuplicates`` +- ``FastQC`` +- ``SnpEff`` Variant annotation summary statistics + +MultiQC reads from all of these sources and generates an interactive html +document that allows you to visualize and explore QC data. + + + +.. _Configuration: + +Configuration and config.yaml +============================= + +The config file, ``config.yaml``, is the main point of communication between +the user and the Snakefile that the workflow executes. Users should always +approach this file first before editing the Snakefile. This is how we pass +paths to reference files, arguments to rules in the Snakefile, and configure +patterns for somatic variant calling. Each section in the ``config.yaml`` is +documented here. + + + +.. _samples: + +Input +----- + +In order to pass sequencing data to the workflow, samples need to be configured +into two tables, ``units.tsv`` and ``samples.tsv`` + +``units.tsv`` is configured like this: + + +========= ========================== =================== ================ ============================================ +sample unit platform fq1 fq2 +========= ========================== =================== ================ ============================================ +Sample ID Technical replicate number Sequencing platform Path to R1 fastq Path to R2 fastq (if using paired-end reads) +========= ========================== =================== ================ ============================================ + + +Note: + +- sample names needn’t be unique, but no sample-unit combination can be the + same. Increment the unit starting from 1 for identical sample names to + represent technical replicates. A sampletable with no technical replicates + should have the value for unit set to 1 for every sample. +- platform is used to attach read groups. It should be the sequencing platform + of your data, for example, "Illumina." +- fq1 and fq2 represent the R1 reads and R2 reads from a paired-end sample. If + you samples are not paired-end, then leave the value in the fq2 column empty. + +``samples.tsv`` is a single column consisting of the sample names. You can use +``cut -f1 units.tsv > samples.tsv`` to generate this file + + +.. _confall: + +Configuring the All Rule +------------------------ + +The all rule is how we initially generate the DAG for which rules to execute in +which order to generate the final outputs of the workflow. We specify the +outputs we want at the end of the workflow here: + +- Germline variants, annotated, multiallelic-split (with ``bcftools norm``): + “results/annotated/ann.vcf.gz” + + * If you decide to use dbNSFP to attach annotations AFTER you've already + generated the "results/annotated/ann.vcf.gz" file, you need to `configure + dbNSFP `__ and delete “results/annotated/ann.vcf.gz” to + regenerate output. + +- MultiQC: “results/qc/multiqc.html” +- Mutect2 annotated: ``expand("results/mutect2_annotated/snpeff.{comp}.vcf.gz", + comp = config['mutect2'].keys())`` + + * See the `Mutect2 configuration `__ + +If you do not wish to attach annotations to your variants: + +- Germline (multiallelic-split): “results/filtered/all.normed.vcf.gz” +- Somatic + (multiallelic-split): ``expand("results/somatic_filtered/normed.{comp}.vcf.gz", + comp = config['mutect2'].keys())`` + +If for some reason you do not want your variants to be multiallelic-split, then +you will have to edit the ``rule merge_calls`` for germline variants and change +the output to not be marked ``temp()``. For somatic variants, you will have to +edit the ``filter_mutect2_calls`` rule in the same fashion. Specify the output +of those rules in your all rule. + + + +.. _confref: + +Configuring References in config.yaml +------------------------------------- + +The ``ref:`` section of the config.yaml is how we determine the patterns of the +reference files generated by the references workflow or pass the paths of +external references to the workflow. + +If you intend on using the LCDB-WF references workflow, set +``use_references_workflow: true``. The Snakefile reads this argument and +determines which references to use. At the bottom of the config file is an +``include_references:`` key that must point to the reference config from the +LCDB-WF references workflow that you wish to use. The structure of the +reference config that you include is like so: + +.. code-block:: yaml + + references: + organism: + tag: + type: + +It is crucial that the values for organism and tag in the ``ref:`` block of +``config.yaml`` match exactly to the reference config. The ``aligner:`` and +``faidx`` sub-fields must also match what is found in the ``indexes:`` subfield +of the ``genome:`` “type” from the above yaml structure in the reference +config. + +For example, if the reference config looks like this: + +.. code-block:: yaml + + references: + human: + ensembl-104: + genome: + url: ‘dummy.fasta.download.url’ + indexes: + - ‘bwa’ + - ‘faidx’ + +Then we will configure the ``config.yaml`` like this: + +.. code-block:: yaml + + ref: + use_references_workflow: true + organism: ‘human’ + genome: + tag: ‘ensembl-104’ + build: 'GRCh38' + aligner: + index: ‘bwa’ + tag: ‘ensembl-104’ + faidx: + index: ‘faidx’ + tag: ‘ensembl-104’ + +The ``build:`` key in the ``genome:`` block must match what is provided in the +metadata of the reference config. + +For variation databases, the versioning on the database used is controlled by +a key-value provided to the reference config under the ``variation:`` “tag”. +This value must match what is given in the ``config.yaml``. You must also +supply the build of your genome to the reference config and it has to match the +build you are using for your reference genome. This is because we need to +process the dbNSFP file in the references workflow to make it compatible with +older genomes. To mirror the example above if this is what is in our reference +config: + +.. code-block:: yaml + + references: + human: + known: + type: 'all' + + variation: + dbnsfp: + version: 'dbNSFPv4.4' + url: ‘dummy.download.dbsnfp.url’ + build 'GRCh38' + +Then we will configure the ``config.yaml`` like this: + +.. code-block:: yaml + + ref: + use_references_workflow: true + organism: ‘human’ + + variation: + known: ‘known’ + dbnsfp: ‘dbNSFPv4.4’ + +The workflow supports the option to supply the ``known:`` and ``dbnsfp:`` keys +with an ABSOLUTE path to a file you have locally. The path MUST start with +``"/"``. Make sure to read on below to see what sort of processing and +modifications must be made to these files in order for the workflow to use +them. + +Note that in the reference config, the ``type:`` field under ``known:`` +corresponds to the type of known variation file you wish to download. Available +options are ‘somatic’, ‘structural_variation’, or ‘all’. At the moment, this +workflow does not support structural variant calling. Providing ‘all’ to this +field will download germline known variation for all chromosomes. If your +organism does not have a variation database, then simply leave these fields +empty in the variant calling workflow config, and the workflow will run without +them. + +If you are providing your own references to the workflow, set them in the paths +section of the ref block: + +.. code-block:: yaml + + ref: + paths: + ref: path/to/known/genome/fasta + known: path/to/known/variation/file + index: path/to/fasta/index + dbnsfp: path/to/dbnsfp/file + +If you are providing a known variation file that contains `IUPAC-coded alleles +`_, then you must run +``rust-bio-tools`` on your variation file and provide this output as your known +variation file: + +.. code-block:: bash + + rbt vcf-fix-iupac-alleles < input | bcftools view -Oz > known-variation.vcf.gz + + +Of these, the only file that you absolutely must provide always is the genome +fasta. NEVER provide the workflow with a "top level" assembly of the genome +when working with human data. This contains a lot of unaligned contigs and +haplotypes and BWA will not be able to map your reads to the genome. The +workflow accepts both compressed and uncompressed fasta files. Some notes on +references such as the known variation, dbNSFP, BWA index, and fasta index: + +- If you cannot provide the files in the config above, just leave their values + empty in the ``config.yaml`` field. + + * If the known variation file is absent, we will not be able to recalibrate + bases and our bam file processing ends at marking and deleting + duplicates. + * If the dbNSFP file is absent, we will not able to run SnpSift to attach + fields from the dbNSFP file (see `Configuring Annotation Databases + `__), but we will still be able to run SnpEff. + * If the BWA index is not used from the LCDB-WF references workflow, then + it will be created in the Snakefile in the same directory as the provided + reference + * If the fasta index is not provided, it will be created in the same + directory as the provided reference during the workflow. + + +.. _dldbnsfp: + +Downloading dbNSFP +------------------ + +To download the dbNSFP file grab the Box link from the `dbNSFP website +`_. Alternatively, you can use +the `ftp server `_. You can grab +the file from here, but the zips are really big, and if you click the link you +will open the ftp server on your machine in finder. You can get the file name +of the file you want and fill in the string like this: + +.. code-block:: bash + + version= + wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP${version}.zip + # Alternatively, wget + +The most recent version may not be on the ftp server. Watch your disk space +when downloading and processing this file, it will use hundreds of gigabytes. +You may want to do this from a shell script writing to a tmp dir. Do not +attempt to download this on a node that is not suitable for heavy computation + +Pay attention to the comments in the code block. The process differs for +GRCh38/hg38 genomes and GRCh37/hg19 genomes. Recent versions of dbNSFP (v3.X +and later) switched to GRCh38 coordinates. You can still use these later +versions for GRCh37 coordinates, but you have to process the files differently. + +.. code-block:: bash + + unzip + zcat dbNSFP*_variant.chr1* | head -n1 > h + # head works if you run this command line by line, but if you are setting "set -euo pipefail" head may cause the pipe to exit. Alternatively, you can use pipe to awk "NR <= 1" > h + + # For GRCh38/hg38 data (include the genome version somewhere in your output file name for clarity): + zgrep -v "^#chr" dbNSFP*_variant.chr* | sort -k1,1 -k2,2n - | cat h - | bgzip -c > + tabix -s 1 -b 2 -e 2 + + # For GRCh37/hg19 data: + zgrep -h -v "^#chr" dbNSFP*_variant.chr* | awk '$8 != "." ' | sort -k8,8 -k9,9n - | cat h - | bgzip -c > + tabix -s 8 -b 9 -e 9 + +It is highly recommended to run ``sort`` with the ``--parallel=N`` option on +a node with multiple CPUs as this will take a very long time. See the `sort man +page `_ for details on how +to properly parallelize. + + +.. _conforg: + +Support for Non-Human Organisms +------------------------------- + +If you wish to call variants on a non-human organism, then you will have to +supply the references yourself following the directions above, or edit the +reference config for your organism in the top level of the cloned LCDB-WF +directory in ``include/reference_configs/``. You can add to the config: +``variant-calling.yaml`` reference config if you are using an Ensembl genome. +The tag should be configured as “ensembl-release”. ONLY include the variation +block in the reference config if your organism has support for an annotation +variation database like dbNSFP Downloading the known variation file in the +references workflow relies on the build, release, and species values specified +in the ``metadata:`` block. + +WARNING: The workflow relies heavily on references downloaded from ensembl, +mainly due to Ensembl housing known variation files. You will run into issues +if your references have a "chr" prefix for chromosome nomenclature. Check the +``get_contigs()`` method in ``lib/helpers.smk`` (from the top-level directory +of LCDB-WF) and see if this will cause any issues with your organism. + + + +.. _fp: + +Filtering and Processing +------------------------ + +There are several filtering and processing configurations that can be provided +to the workflow in the ``processing:`` and ``filtering:`` sections of +``config.yaml``. + +In ``processing:`` there are several options: + +- ``remove-duplicates:`` if ``true`` is set for this option, we remove + duplicates with ``picard MarkDuplicates`` (you should do this unless you are + interested in duplicated reads in your data) +- ``remove-mitochondrial:`` if ``true`` is set for this option, then + mitochondrial contigs will be ignored for variant calling (default) +- ``pcr:`` This option takes a string according to `GATK PCR INDEL model docs + `_. + The default option in the workflow is to not specify this argument, but if + you know that your data is PCR-free sequencing data then you should set this + option to ‘NONE’ for the most specific results. +- ``restrict-regions:`` Supply a capture file in .bed format. Commonly used for + whole exome sequencing (WES) and targeted exome sequencing. + + * ``region-padding:`` Increase the size of the regions specified in the + .bed file by a flat amount of base pairs. Leave both of these options + blank if you do not have a .bed file. + +In ``filtering:`` we have these options: + +- ``bqsr:`` This option controls whether or not we run `GATK base recalibrator + `_. + Basically, if we are able to, we should. What defines if we are able to is + whether or not the known variation file is available for the organism we are + calling variants on. Turning this option to false means that our bam + processing ends after deleting duplicates. +- ``hard:`` This is the hard-filtering option as outlined in `GATK filtering + docs + `_. + These are basic, quality-based filters that remove low-confidence or + low-quality variants from our callset. + + * ``snvs:`` and ``indels:`` We split the variants into SNPs and INDELS + because they require different hard-filtering options. The filter values + are read from these options and can be adjusted. What is present by + default is GATK’s standard recommendation. + + + +.. _confdbnsfp: + +Configuring Annotation Databases in config.yaml +----------------------------------------------- + +The `SnpEff and SnpSift docs `_ are a very +helpful resource for running SnpEff and SnpSift. + +In the ``snpeff:`` section of ``config.yaml`` you can configure arguments that +are passed to the annotation rules. + + +- ``somatic:`` This is how we configure SnpEff summary output for MultiQC to + aggregate the correct output files for its report. Set this value to ``true`` + if you are calling somatic variants +- ``germline:`` Same as ``somatic:`` but set to ``true`` if you are calling + germline variants. +- ``genome:`` For help with this value, see SnpEff’s `docs + `_ for which database to use. +- ``annotations:`` This key is an optional configuration that allows you to + specify which fields from dbNSFP you would like to attach to your VCF. + Leaving this field blank will attach ALL fields from dbNSFP (of which there + are many, so it will save you time and computation to specify your fields). + The value given to this key is a comma-separated string (WITHOUT WHITESPACE!) + of the names of the columns in the dbNSFP file you would like to attach. For + example, to get FATHMM and SIFT pathogenic predictions, you would configure + like this: ``annotations: ‘FATHMM_pred,SIFT_pred’`` + + + +.. _mutect: + +Somatic Variant Calling with Mutect2 +------------------------------------- + +When talking about somatic calling, we say “tumor” and “normal” as being the two +components of a Mutect2 contrast. In reality, the relationship does not have to +be tumor and normal tissue. However, the “tumor” and “normal” samples should +come from the same organism or from very genetically similar organisms. In +somatic calling, we are trying to identify regions where the “tumor” sample +differs from the “normal” sample. + +Directly above the ``mutect2:`` section is ``PON:``. PON stands for panel of +normals, and is a file that is very similar in use to a known variation file. +It is a file made from “normal” samples believed to be free from somatic +alterations. Its purpose is to help identify technical artifacts. You can read +more about panel of normals `here +`_. +This file is optional and leaving it blank will not affect our ability to call +somatic variants. + +In the ``mutect2:`` section of the ``config.yaml`` there is (importantly) only +one option. In the default config, this single option is ``tumor-normal:`` This +is how we will tell Mutect2 which samples should be analyzed together and which +of those are “normals” or “tumors”. However, this key need not be named +“tumor-normal”. Below is an example configuration: + +.. code-block:: yaml + + mutect2: + patient-1: + tumor: + - ‘p1_tumor’ + normal: + - ‘p1_normal’ + patient-2: + tumor: + - ‘p2_tumor’ + - ‘p2_tumor_metastasis’ + normal: + - ‘p2_normal’ + +The keys one level below ``mutect2:`` correspond to the name of the comparisons +we are making. This is how we establish wildcards in the Snakefile for the +Mutect2 rule. However, the keys inside of the comparison name MUST be named +``tumor:`` and ``normal:``. Their values should be formatted as a list in yaml. +Each value under ``tumor:`` or ``normal:`` corresponds to the sample name found +in the `sampletable `__. The workflow handles combining technical +replicates, so you only have to provide the sample name. + + + +.. _down: + +Downstream Analysis +=================== + + +.. _sliv: + +Slivar for Rare Disease or Pedigrees +------------------------------------ + +Several projects in the past have focused on calling variants on pedigree or +trio data. This is a common archetype in investigating inheritance patterns of +rare diseases. `Slivar `_ is a set of +command-line tools that query and filter VCF file in order to investigate +non-Mendelian and Mendelian inheritance patterns in pedigree data. One of the +most powerful aspects of this tool is the ability to create custom filtering +patterns using basic boolean logic. + + +.. _patho: + +Filtering Annotations +--------------------- + +One of the benefits of SnpSift is the ability to filter your VCF files based on +what annotations are attached. The `docs page +`_ on SnpSift filter has great +documentation on using this. Similar to Slivar, you can use arbitrary +expressions and access fields in the VCF such as quality (``QUAL``) and +genotype (``GEN[]``) to enhance your annotation +filters. + +For example, to get all variants marked as “Damaging” by the FATHMM pathogenic +prediction annotation from dbNSFP, you can run this on your annotated vcf: + +``SnpSift filter “( dbNSFP_FATHMM_pred = 'D' )" input.vcf > output.vcf`` +