Skip to content

Latest commit

 

History

History
118 lines (103 loc) · 7.97 KB

README.md

File metadata and controls

118 lines (103 loc) · 7.97 KB

Germline Variant Calling Pipeline built in Snakemake


Selma

Travis Build Status

Graphical visualization of the workflow steps

Instructions

Raw metal (or Guix shell environment) execution
N.B No matter which method you use you always need to edit the workspace/config.yaml and workspace/samples.tsv files with correct paths, samples etc. N.B

The raw metal method is if you have taken care of the dependency installation yourself.

snakemake -j --config version=hg38 interval=/path/to/hg38/interval_list

The following instructions guides you how to set up the Selma environment using docker, conda or guix.

Docker execution
Either download the image manually with docker pull oskarv/selma or run

docker run --rm -ti -v $PWD:/data -w /data selma snakemake -j --config version=hg38 interval=/path/to/hg38/interval_list

and it'll get downloaded automatically. Alternatively build it manually with the method described farther down.

Conda execution
Selma can also run in a virtualized environment using conda as such:

conda env create -n selma --file conda/env.yaml
snakemake -j --config version=hg38/or/b37 interval=/path/to/interval_list

Alternatively you can use the --use-conda flag:

snakemake -j --config version=hg38 interval=/path/to/hg38/interval_list --use-conda

Guix execution
Guix can be used to install all dependencies apart from gatk4 as such:

guix shell -m manifest.scm

gatk4 can be installed in a location of your choice with this:

wget --no-check-certificate https://github.com/broadinstitute/gatk/releases/download/4.3.0.0/gatk-4.3.0.0.zip -O $PWD/gatk4.zip && \
unzip -q $PWD/gatk4.zip -d $PWD/ && \
mv $PWD/gatk*/gatk* $PWD/ && \
rm -r $PWD/gatk*/ $PWD/gatk4.zip && \
export PATH="$PATH:$PWD/" && \
export GATK_LOCAL_JAR=$PWD/gatk-package-4.3.0.0-local.jar

Running gatk should by now work as expected.

Guix docker build
The docker image was built as follows:

guix pack -f docker -S python=python3 -S /usr/bin/env=bin/env -S /bin=bin samtools gnuplot bwa snakemake bcftools python-pandas openjdk nss-certs bash wget unzip coreutils python2-minimal python-minimal sed python-matplotlib tectonic texlive-base

This produces a tar.gz named /gnu/store/a-long-hash-samtools-gnuplot-bwa-snakemake-bcftools-docker-pack.tar.gz and this file is then loaded as a docker image like this:

docker load < /gnu/store/a-long-hash-samtools-gnuplot-bwa-snakemake-bcftools-docker-pack.tar.gz

Then it's on to building the Selma docker image to add and configure gatk4:

docker build -t oskarv/selma .

And now you can run Selma with docker!

docker run --rm -ti -v $PWD:/data -w /data selma snakemake -j --config version=hg38 interval=/path/to/hg38/interval_list

Reference files
The default reference files are the hg38 reference files from the Broad Institute, they host them at their public ftp server here:
ftp://[email protected]/bundle
There is no password. You can automatically download the hg38 folder with this command:
wget -m ftp://[email protected]/bundle/hg38

If you haven't indexed the fasta file with bwa you must do that before you run the pipeline.

Hardware requirements and optimizations

At the current state the pipeline is highly optimized for use on a single server with 16 threads, 64GB RAM and at least 500GB storage assuming that there are 8 fastq.gz files totalling 51GB with ~30x coverage. But when using the test files in the fastq folder it should run on any laptop using 2 threads and 8GB RAM, but preferrably 4 threads and 16GB RAM, the storage requirements apart from the reference files is negligible.
The run time on my current test machine that has 16 threads and 64 GB RAM has been between 16 hours and 14 minutes to 16 hours and 25 minutes with 8 fastq.gz file pairs totalling ~51GB/30x coverage.
The execution time on a server with 16 threads and 16 GB RAM is roughly 18 hours and 30 minutes if each scatter gather tool is given 2GB RAM each and using the same input files as above.

=======

About Selma

Selma is a whole genome germline variant calling workflow initially developed at the University of Bergen heavily inspired by the GATK best practices workflow. The guiding philosophy behind it is that it should be easy to setup, easy to use and that it utilizes system resources efficiently. The workflow is based on Snakemake and supports Conda, Guix, Docker and (soon to be tested) Singularity execution modes.
Selma is named after the mythical Norwegian sea serpent that supposedly lives in Lake Seljord

Tools

bwa version 0.7.17 - Maps fastq file to reference genome
samtools version 1.14 - bwa pipes its output to samtools to make a bam output file
The following tools are all gatk version 4.3.0.0
SplitIntervals - Splits interval list for scatter gather parallelization
FastqToSam - Converts fastq files to unmapped bam files
MergeBamAlignment - Merge aligned BAM file from bwa with the unmapped BAM file from FastqToSam
MarkDuplicates - Identifies duplicate reads
BaseRecalibrator - Generates recalibration table for Base Quality Score Recalibration
GatherBQSRReports - Gather base recalibration files from BaseRecalibrator
ApplyBQSR - Apply base recalibration from BaseRecalibrator
GatherBamFiles - Concatenate efficiently BAM files from ApplyBQSR
HaplotypeCaller - Call germline SNPs and indels via local re-assembly of haplotypes
GenotypeGVCFs - Perform genotyping on one pre-called sample from HaplotypeCaller
VariantRecalibrator - Build a recalibration model to score variant quality for filtering purposes
ApplyVQSR - Apply a score cutoff to filter variants based on a recalibration table