Here, we will utilize a pipeline called QIIME (v2) to analyze and visualize microbial diversity using raw DNA sequences in fastq files. In contrast to QIIME 1, QIIME 2 features several new ways of analyzing NGS data and has been modified significantly bioinformatically, but NOT biologically.
Below is a list of a few terms you should know when utilizing QIIME 2.
- Action - a generic term for a method or visualizer.
- Artifact - zipped input or output data for QIIME actions. Artifacts have file extension
. - Parameter - an input to an action. For instance,
is a parameter that takes in an integer value for any number of bases that the user wishes to trim off from the start of the sequence. - Plugin - a general term for an external tool that is built around QIIME 2.
- Visualization - output data from a QIIME visualizer that have file extension
. These can be viewed online.
Please use this pipeline if your fastq files are already demultiplexed - meaning each fastq file pairs (R1 and R2) represent sequences from ONE sample type.
Please also keep in mind that QIIME 2 is a work in progress; so some features may not yet be available. But rest assured that QIIME 2 team is working hard and should you have any uber specific questions we cannot answer, please sign-up and post them on the QIIME 2 forum.
Link to the main QIIME 2 website (for more tutorials and detailed documentation of the pipeline).
QIIME 2 is a microbiome analysis pipeline, and it is significantly different from the previous version QIIME 1. Instead of directly using data files such as FASTQ and FASTA files, QIIME 2 utilizes artifacts. See definition above.
Here is a list of files you must have in order to run the QIIME 2 pipeline.
A mapping file
R1 fastq
- This file contains reads returned by the sequencer first.
R2 fastq
- This file contains reads returned by the sequencer second.
QIIME 2 supports various data formats for sequences files and BIOM tables, however the descriptions of these formats are still being developed. Some common data formats are described in the Importing Data tutorial.
source activate qiime2-2017.12
source tab-qiime
mkdir q2-tutorial
cd q2-tutorial
cp -r /data/share/BITMaB-2018/18S_metabarcoding_Project_FranPanama/* .
Here is an overview of the general steps of the QIIME pipeline for already demultiplexed reads that we will carry out during the BITMaB workshop (click links to jump to detailed instructions for each step):
Step 1: Importing data, summarize the results, and examing quality of the reads
Step 2: Quality controlling sequences and building Feature Table and Feature Data
Step 3: Assigning Taxonomy
Step 4: Summarizing Feature Table and Feature Data
Step 5: Generating a phylogenetic tree
Step 6: Analyzing Alpha and Beta diversities
- NOTE: For the purposes of this tutorial, we are running all the analysis in a single directory and using non-descriptive names when assigning output files.
In order to work with your data within QIIME 2, we first must import the FASTQ files as a QIIME artifact. The action to import files is qiime tools import
Let's start by pulling the help menu for the qiime tools
action first. To do this for any QIIME 2 action, you can run that particular action followed by --help
as shown below.
qiime tools --help
And, to get more info on the commands associated with an action, run the action along with the desired command as shown below.
qiime tools import --help
And and, to get more info on the options associated with a command associated with an action, run the option along with the desired command and action as shown below. This may not work for all the options fyi.
qiime tools import --show-importable-formats --help
Now, we can use the import
command to import our files as QIIME artifact. The data format used here is called CasavaOneEightSingleLanePerSampleDirFmt
. In this format, there are two fastq.gz
files for each sample. The forward and reverse read file names for a single sample might look like L2S357_15_L001_R1_001.fastq.gz
and L2S357_15_L001_R2_001.fastq.gz
, respectively. The underscore-separated fields in this file name are the sample identifier, the barcode sequence or a barcode identifier, the lane number, the read number, and the set number.
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path raw_reads_paired/ \
--source-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path demux-paired-end.qza
- NOTE: In case your paired-end data are multiplexed, you may use the following command, after importing the multiplexed files as QIIME artifact, for separating/demultiplexing your read files based on sample names.
qiime demux emp-paired \
--m-barcodes-file sample-metadata.tsv \
--m-barcodes-category BarcodeSequence \
--i-seqs emp-paired-end-sequences.qza \
--o-per-sample-sequences demux \
qiime demux summarize \
--i-data demux-paired-end.qza \
--o-visualization demux.qzv
- Here, you must copy over the
output to your computer, and opendemux.qzv
QIIME 2 has plugins for various quality control methods such as DADA2 and Deblur. The result of both of these methods will be a
QIIME 2 artifact containing counts (frequencies) of each unique sequence in each sample in the dataset, and aFeatureData[Sequence]
QIIME 2 artifact, which maps feature identifiers in the FeatureTable to the sequences they represent. We will use DADA2 in this tutorial. TheFeatureTable[Frequency]
are analogous to QIIME 1's Biom table and rep_set fasta file, respectively. -
dada2 denoise-paired
requires four parameters:--p-trim-left-f
, and--p-trunc-len-r
. The--p-trim-left m
trims off the firstm
bases of each sequence, and--p-trunc-len n
truncates each sequence at positionn
. Thef
in each parameter stand for forward and reverse read, correspondingly. -
Please consider the question below before you quality trim the sequences.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux-paired-end.qza \
--p-trim-left-f VALUE \
--p-trim-left-r VALUE \
--p-trunc-len-f VALUE \
--p-trunc-len-r VALUE \
--p-n-threads 12 \
--o-representative-sequences rep-seqs.qza \
--o-table table.qza
If this step completed correctly, your command line prompt should notify you with the following information:
Saved FeatureTable[Frequency] to: table.qza
Saved FeatureData[Sequence] to: rep-seqs.qza
The default QIIME2 workflow does not include a typical OTU picking step - the developers now reccomend working with "Amplicon Sequence Variants", whereby you go directly into taxonomy assignment after using dada2/deblur to quality filter your dataset.
Here, we are comparing our metabarcoding sequences to the SILVA reference database to assign taxonomy based on pairwise identity of rRNA seqeunces.
We are using the manually curated SILVA database to assign taxonomy to unkonwn (eukaryotic) 18S rRNA sequences.
The databases have been pre-downloaded onto the server from the the ARB-SILVA website:
qiime tools import \
--type FeatureData[Sequence] \
--input-path /usr/local/share/SILVA_databases/SILVA_128_QIIME_release/rep_set/rep_set_18S_only/99/99_otus_18S.fasta \
--output-path 99_otus_18S
qiime tools import \
--type FeatureData[Taxonomy] \
--input-path /usr/local/share/SILVA_databases/SILVA_128_QIIME_release/taxonomy/18S_only/99/majority_taxonomy_7_levels.txt \
--source-format HeaderlessTSVTaxonomyFormat \
--output-path majority_taxonomy_7_levels
Taxonomy assignment can be done using either SILVA's "consensus" or "majority" taxonomy mapping files - we STRONGLY reccomend you read the SILVA release notes to understand the differences in how these have been constructed:
For eukaryotic 18S data - especially for meiofaunal groups where the databases are pretty sparse - we recommend using the majority_taxonomy_7_levels.txt
taxonomy mapping file, since it does a better job of incorporating "environmental" rRNA sequences and the seven levels have been manually curated to better reflect the known phylogenetic classifications of diverse eukarytoic groups.
Here is complete explanation of the taxonomy differences from the SILVA database curators:
Taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.
For example, if a cluster had two reads, and one taxonomy string was:
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter sp. HW3
and the second taxonomy string was:
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter smithii
Then for either consensus or majority strings, the level 7 (0 is the first level, the domain) data would become ambiguous, as the species levels do not match. The above string for the representative sequence taxonomy mapping file becomes:
Because the taxonomy strings are not perfectly matched in terms of names/depths across all of the SILVA data, this can lead to some taxonomies being more ambiguous with my approach (exact string matches) than they actually are, particularly for the eukaryotes. There are over 1.5 million taxonomy strings in the non-redundant SILVA 119 release (even more in later releases), so I can't fault the maintainers of SILVA for these taxonomy strings being imperfect from a parsing/bioinformatics perspective.
Here we are using BLAST to assign taxonomy to environmental rRNA sequences, using a 90% pairwise identity cutoff against the curated SILVA database (so any rRNA OTUs with <90% identity will come back with a taxonomic string as "unassigned").
qiime feature-classifier classify-consensus-blast \
--i-query rep-seqs.qza \
--i-reference-taxonomy majority_taxonomy_7_levels.qza \
--i-reference-reads 99_otus_18S.qza \
--o-classification taxonomy \
--p-perc-identity 0.90 \
--p-maxaccepts 1
Change the filename on your table to "unfiltered" so we can keep track of the original qiime output.
mv table.qza unfiltered-table.qza
QIIME2 has a number of different options for classifying your sequences. For simplicity (and familiarity) we are using BLAST, but other options offer more sophisticated algorithmic methods for taxonomy assignment:
classify-consensus-blast BLAST+ consensus taxonomy classifier
classify-consensus-vsearch VSEARCH consensus taxonomy classifier
classify-sklearn Pre-fitted sklearn-based taxonomy classifier
extract-reads Extract reads from reference
fit-classifier-naive-bayes Train the naive_bayes classifier
fit-classifier-sklearn Train an almost arbitrary scikit-learn
qiime taxa filter-table \
--i-table unfiltered-table.qza \
--i-taxonomy taxonomy.qza \
--p-include metazoa \
--o-filtered-table table.qza
qiime feature-table summarize \
--i-table table.qza \
--o-visualization table.qzv \
--m-sample-metadata-file mapping_file_panama_MAY_2017.tsv
qiime feature-table tabulate-seqs \
--i-data rep-seqs.qza \
--o-visualization rep-seqs.qzv
- Here, you must copy over the
outputs to your computer, and opentable.qzv
qiime alignment mafft --i-sequences rep-seqs.qza --o-alignment aligned-rep-seqs.qza
qiime alignment mask --i-alignment aligned-rep-seqs.qza --o-masked-alignment masked-aligned-rep-seqs.qza
qiime phylogeny fasttree --i-alignment masked-aligned-rep-seqs.qza --o-tree unrooted-tree.qza
qiime phylogeny midpoint-root --i-tree unrooted-tree.qza --o-rooted-tree rooted-tree.qza
Here you must make a decision about the rarefaction values you will use to carry out ecological diversity analyses on your dataset - this is dependent on the sequencing depth you observe across all your samples (e.g. the minimum value will throw out any samples with a sequencing depth below that threshold).
View the
QIIME 2 artifact, and in particular the Interactive Sample Detail tab in that visualization. What value would you choose to pass for--p-sampling-depth
below? How many samples will be excluded from your analysis based on this choice? How many total sequences will you be analyzing in the core-metrics-phylogenetic command?
In the below script, replace MINIMUM
with the values you choose to use for rarefaction.
qiime diversity alpha-rarefaction \
--i-table table.qza \
--i-phylogeny rooted-tree.qza \
--p-min-depth MINIMUM \
--p-max-depth MAXIMUM \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--o-visualization alpha-rarefaction.qzv
- Here, you must copy over the
output to your computer, and open it in
Script to generate taxonomy bar charts:
First do this for the unfiltered data, and view the .qzv
output in the QIIME2 viewer
qiime taxa barplot \
--i-table unfiltered-table.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--o-visualization taxa-bar-plots.qzv
Now generate the same taxonmy plots for the filtered (Metazoa-only) 18S dataset, and visualize this file as well:
qiime taxa barplot \
--i-table table.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--o-visualization taxa-bar-plots.qzv
Beta Diversity Core Analyses (runs a whole bunch of metrics at once):
qiime diversity core-metrics-phylogenetic \
--i-phylogeny rooted-tree.qza \
--i-table table.qza \
--p-sampling-depth VALUE \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--output-dir core-metrics-results
qiime diversity alpha-group-significance \
--i-alpha-diversity core-metrics-results/faith_pd_vector.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--o-visualization core-metrics-results/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity core-metrics-results/evenness_vector.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--o-visualization core-metrics-results/evenness-group-significance.qzv
- View the
outputs in and answer the following questions.
qiime diversity alpha-correlation \
--i-alpha-diversity core-metrics-results/faith_pd_vector.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--o-visualization core-metrics-results/faith-pd-correlation.qzv
qiime diversity alpha-correlation \
--i-alpha-diversity core-metrics-results/evenness_vector.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--o-visualization core-metrics-results/evenness-correlation.qzv
qiime diversity beta-group-significance \
--i-distance-matrix core-metrics-results/unweighted_unifrac_distance_matrix.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--m-metadata-category Matrix \
--o-visualization core-metrics-results/unweighted-unifrac-Matrix-group-significance.qzv \
qiime emperor plot \
--i-pcoa core-metrics-results/unweighted_unifrac_pcoa_results.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--p-custom-axis Depths \
--o-visualization core-metrics-results/unweighted-unifrac-emperor-Depths.qzv
qiime emperor plot \
--i-pcoa core-metrics-results/bray_curtis_pcoa_results.qza \
--m-metadata-file mapping_file_panama_MAY_2017.tsv \
--p-custom-axis Depths \
--o-visualization core-metrics-results/bray-curtisc-emperor-Depths.qzv
qiime metadata tabulate \
--m-input-file taxonomy.qza \
--o-visualization taxonomy.qzv
- View the
outputs in
Loading a phylogentic tree into iTOL - instructions and and guide on how to load taxonomy metadata onto a phylogeny and view it in iTOL
Follow the instructions on this QIIME 2 forum post to convert your Feature Table into a tsv file.
An interactive browser-based data visualization framework (for exploring QIIME outputs) - use your OTU table with taxonomy and mapping file embedded - instructions are here: - **NOTE: Phinch only works with BIOM 1.0 files, which are no longer the default output in QIIME 1.9 and higher - see file conversion instructions on the link above.
Statistical test that looks for enrichment or depletion of OTUs across sample metadata categories (e.g. Prespill vs. Postspill samples). You can run this analysis on the Huttenhower Lab's online Galaxy server (above link) - you will need to convert your OTU table into tab-delimited format and add metadata headings before you can run LEfSE
Phyloseq - An R package for visualizing QIIME outputs
Offers sleeker visualizaitons and more flexibility than the visualizations offered within QIIME. You can produce heatmaps, box plots, and trim down your OTU table to only look at community patterns within certain OTUs or taxonomic groups. Great for generating publication-ready figures, but requires quite a bit of R knowledge and tweaking to get working.