Skip to content

Runs Quality Control workflow on Illumina WGS data

Notifications You must be signed in to change notification settings

Snitkin-Lab-Umich/QCD

Repository files navigation

QCD - Quality Control and Contamination Detection workflow.

QCD is a workflow for microbial Illumina sequencing quality control and contamination detection implemented in Snakemake.

Summary

As part of the SOP in the Snitkin lab, this pipeline should be run on raw sequencing data as soon as the data is available from the sequencing core department.

In short, it performs the following steps:

  • Fastqc is used to generate HTML reports to asses quality of sequencing reads before and after trimming reads.
  • Trims and filters low-quality bases and adapter sequences from raw FASTQ reads using Trimmomatic.
  • fastq-scan is used to estimate genome coverage of FASTQ files.
  • Assembles trimmed reads into contigs using SPAdes.
  • Kraken2 is used to provide detailed reports on the taxonomic composition of the trimmed raw reads.
  • The assembled contigs from SPAdes is then passed through Prokka for annotation, QUAST for assembly statistics, MLST for determining sequence type based on sequences of housekeeping genes, AMRFinderPlus to identify antimicrobial resistance genes, skani to identify closest reference genome and BUSCO for assembly completeness statistics.
  • Multiqc aggregates the final outputs from Fastqc, Kraken2 , Prokka and QUAST to produce a HTML report

The workflow generates all the output in the output prefix folder set in the config file (instructions on setup found below). Each workflow steps gets its own individual folder as shown:

results/2024-05-21_QCD_Project_MDHHS/
|—— sample_name
|   ├── amrfinder
|   ├── downsample
|   ├── kraken
|   ├── mlst
|   ├── prokka
|   ├── quality_aftertrim
|   ├── quality_raw
|   ├── quast
|   ├── raw_coverage
|   ├── spades
|   └── trimmomatic
└── another_sample_name

Installation

If you are using Great Lakes HPC, ensure you are cloning the repository in your scratch directory. Change your_uniqname to your uniqname.


cd /scratch/esnitkin_root/esnitkin1/your_uniqname/

Clone the github directory onto your system.


git clone https://github.com/Snitkin-Lab-Umich/QCD.git

Ensure you have successfully cloned QCD. Type ls and you should see the newly created directory QCD. Move to the newly created directory.


cd QCD

Load Bioinformatics, snakemake and singularity modules from Great Lakes modules.


module load Bioinformatics snakemake singularity

This workflow makes use of singularity containers available through State Public Health Bioinformatics group. If you are working on Great Lakes (umich cluster)—you can load snakemake and singularity modules as shown above. However, if you are running it on your local or other computing platform, ensure you have snakemake and singularity installed.

Setup config, samples and cluster files

If you are just testing this pipeline, the config and sample files are already loaded with test data, so you do not need to make any additional changes to them. However, it is a good idea to change the prefix (name of your output folder) in the config file to give you an idea of what variables need to be modified when running your own samples on QCD.

Config

As an input, the snakemake file takes a config file where you can set the path to sample.csv, path to your raw sequencing reads, path to adapter fasta file etc. Instructions on how to modify config/config.yaml is found in config.yaml.

Samples

Add samples to config/sample.csv following the explanation provided below. sample.csv should be a comma seperated file consisting of two columns—sample_id and illumina_r1.

  • sample_id is the prefix that should be extracted from your FASTQ reads. For example, in your raw FASTQ files directory, if you have a file called Rush_KPC_110_R1.fastq.gz, your sample_id would be Rush_KPC_110.

  • illumina_r1 is the name of the entire raw FASTQ file. In the same directory, if your file is called Rush_KPC_110_R1.fastq.gz, your sample_id would be Rush_KPC_110_R1.fastq.gz. Only include forward reads.

You can create sample.csv file using the following for loop. Replace path_to_your_raw_reads below with the actual path to your raw sequencing reads.


echo "sample_id,illumina_r1" > config/sample.csv

for read1 in path_to_your_raw_reads/*_R1.fastq.gz; do
    sample_id=$(basename $read1 | sed 's/_R1.fastq.gz//g')
    read1_basename=$(basename $read1)
    echo $sample_id,$read1_basename
done >> config/sample.csv

Cluster file

Reduce the walltime (to ~6 hours) in config/cluster.json to ensure the jobs are being submitted in a timely manner.

Quick start

Start an interactive session before you run the pipeline.

salloc --mem-per-cpu=10G --account=esnitkin1

Run QCD on a set of samples.

Preview the steps in QCD by performing a dryrun of the pipeline.


snakemake -s QCD.smk --dryrun -p

Run QCD locally


snakemake -s QCD.smk -p --configfile config/config.yaml --cores all

Run QCD on Great lakes HPC


snakemake -s QCD.smk -p --use-conda --use-singularity --use-envmodules -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes}  -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem} --output=slurm_out/slurm-%j.out" --conda-frontend conda --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 1000 --nolock

Submit QCD as a batch job.

Change these SBATCH commands: --job-name to a more descriptive name like run_QCD, --mail-user to your email address, --time depending on the number of samples you have (should be more than what you specified in cluster.json). Feel free to make changes to the other flags if you are comfortable doing so. Once you have made the necessary changes, save the below script as run_QCD.sbat. Don't forget to submit QCD to Slurm! sbatch run_QCD.sbat.

#!/bin/bash

#SBATCH --job-name=QCD
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=10gb
#SBATCH --time=06:15:00
#SBATCH --account=esnitkin1
#SBATCH --partition=standard

# Load necessary modules
module load Bioinformatics
module load snakemake singularity

# Run Snakemake
snakemake -s QCD.smk -p --use-conda --use-singularity --use-envmodules -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes}  -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem} --output=slurm_out/slurm-%j.out" --conda-frontend conda --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 1000 --nolock

Alt text

Dependencies

Near Essential

Tool stack used in workflow

About

Runs Quality Control workflow on Illumina WGS data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •