Skip to content
Luke Thompson edited this page Sep 23, 2020 · 45 revisions

Tourmaline Wiki

This Wiki describes in detail how to use Tourmaline. Navigate using the sidebar on the right.

Tourmaline is an amplicon sequence processing workflow for Illumina sequence data that uses QIIME 2 and the software packages it wraps. Tourmaline manages commands, inputs, and outputs using the Snakemake workflow management system.

Why should I use Tourmaline?

  • QIIME 2. The core commands of Tourmaline, including the DADA2 package, are all commands of QIIME 2, one of the most popular amplicon sequence analysis software tools available. You can print all of the QIIME 2 and other shell commands of your workflow before or while running the workflow.
  • Snakemake. Managing the workflow with Snakemake provides several benefits:
    • Configuration file contains all parameters in one file, so you can see what your workflow is doing and make changes for a subsequent run.
    • Directory structure is the same for every Tourmaline run, so you always know where your outputs are.
    • On-demand commands mean that only the commands required for output files not yet generated are run, saving time and computation when re-running part of a workflow.
  • Parameter optimization. The configuration file and standard directory structure make it simple to test and compare different parameter sets to optimize your workflow. Included code helps choose read truncation parameters and identify outliers in representative sequences (ASVs).
  • Reports. Every Tourmaline run produces an HTML report containing a summary of your metadata and outputs, with links to web-viewable QIIME 2 visualization files.
  • Tourmaline Toolkit. Analyze multiple outputs programmatically using the provided code and notebooks written in R and Python.

Ready to get started? If this is your first time using Tourmaline or Snakemake, you may want to browse the Wiki pages on the right. If you want to get started right away, check out the Quick Start below.

Quick Start

Tourmaline provides Snakemake rules for DADA2 (single-end and paired-end) and Deblur (single-end). For each type of processing, there are five steps:

  1. the denoise rule imports FASTQ data and runs denoising, generating a feature table and representative sequences;
  2. the taxonomy rule assigns taxonomy to representative sequences;
  3. the filter rule (optional) filters out undesired taxonomic groups or individual sequences from the feature table and representative sequences;
  4. the diversity rule does (on filtered or unfiltered data) representative sequence curation, core diversity analyses, and alpha and beta group significance; and
  5. the report rule generates (on filtered or unfiltered data) an HTML report of the outputs plus metadata, inputs, and parameters.

Install

Tourmaline requires a Conda installation of QIIME 2, Snakemake, and other dependencies:

wget https://data.qiime2.org/distro/core/qiime2-2020.8-py36-osx-conda.yml
conda env create -n qiime2-2020.8 --file qiime2-2020.8-py36-osx-conda.yml
conda activate qiime2-2020.8
conda install -c bioconda snakemake biopython tabulate pandoc tabview
conda install -c bioconda bioconductor-msa bioconductor-odseq
pip install git+https://github.com/biocore/empress.git
qiime dev refresh-cache

Setup

Start by cloning the Tourmaline directory and files:

git clone https://github.com/aomlomics/tourmaline.git

If this is your first time running Tourmaline, you'll need to set up your directory. See the Wiki's Setup page for instructions. Briefly, to process the Test data:

  • Put reference database taxonomy and FASTA (as imported QIIME 2 archives) in 01-imported.
  • Edit FASTQ manifests manifest_se.csv and manifest_pe.csv in 00-data so file paths match the location of your tourmaline directory.
  • Create a symbolic link from Snakefile_mac or Snakefile_linux (depending on your system) to Snakefile.

Or to process Your data:

  • Put reference database taxonomy and FASTA files in 00-data or imported QIIME 2 archives in 01-imported.
  • Edit FASTQ manifests manifest_se.csv and manifest_pe.csv so file paths point to your .fastq.gz files (they can be anywhere on your computer) and sample names match the metadata file.
  • Edit metadata file metadata.tsv to contain your sample names and any relevant metadata for your samples.
  • Edit configuration file config.yaml to change PCR locus/primers, DADA2/Deblur parameters, and rarefaction depth.
  • Create a symbolic link from Snakefile_mac or Snakefile_linux (depending on your system) to Snakefile.

If you've run Tourmaline on your dataset before, you can initialize a new Tourmaline directory with the files and symlinks of an existing one using the command below:

cd /PATH/TO/NEW/TOURMALINE
scripts/initialize_dir_from_existing_tourmaline_dir.sh /PATH/TO/EXISTING/TOURMALINE
# then make any changes to your configuration before running

Run Snakemake

Shown here is the DADA2 paired-end workflow. From the tourmaline directory (which you may rename), run Snakemake with the denoise rule as the target:

snakemake dada2_pe_denoise

Pausing after the denoise step allows you to make changes before proceeding:

  • Check the table summaries and representative sequence lengths to determine if DADA2 or Deblur parameters need to be modified. If so, you can rename or delete the output directories and then rerun the denoise rule.
  • View the table visualization to decide an appropriate subsampling (rarefaction) depth. Then modify the parameters "alpha_max_depth" and "core_sampling_depth" in config.yaml.

After you are satisfied with your parameters and files, run the taxonomy rule:

snakemake dada2_pe_taxonomy

You may wish to pause again to examine the taxonomy summary and bar plot. Here you have the option to filter your feature table and representative sequences by taxonomy or feature ID; see the Wiki's Run page for instructions. For now, let's run the workflow without filtering. When you do filter your data, all output from that point on will go in a separate folder so you can compare output with and without filtering.

Continue with the diversity rule (for unfiltered data):

snakemake dada2_pe_diversity_unfiltered

Finally, run the report rule (for unfiltered data):

snakemake dada2_pe_report_unfiltered

Troubleshooting

  • The whole workflow should take ~3–5 minutes to complete with the test data. A normal dataset may take several hours to complete.
  • If any of the above commands don't work, read the error messages carefully, try to figure out what went wrong, and attempt to fix the offending file. A common issue is the file paths in your FASTQ manifest file need to be updated.
  • Do not use the --cores option. Tourmaline should be run with 1 core (default).

Power tips

  • The whole workflow can be run with just the command snakemake dada2_pe_report_unfiltered (without filtering representative sequences) or snakemake dada2_pe_report_filtered (after filtering representative sequences). Warning: If your parameters are not optimized, the results will be suboptimal (garbage in, garbage out).
  • If you want to make a fresh run and not save the previous output, simply delete the output directories (e.g., 02-output-{method}-{filter} and 03-report) generated in the previous run.
Clone this wiki locally