Maliampi is a phylogenetic placement-based pipeline for handling 16S rRNA amplicons (the most common type of microbiome study data).
Maliampi's overall philosophy is to embrace and work with the inherent ambiguity involved in any PCR / seqencing based approach, as well as rationally deal with the limits of the available references mapping between sequence-space and taxonomy.
- Create sequence variants
- Make a reference package
- Place on the reference package
- Classify
-
Containerized dependencies
Maliampi uses docker containers for all of the software called by the pipeline. This means no fussing with getting archane academic software matched up properly with dependencies, or issues with reproducing the pipeline environment on your system.
-
Probability right to the end
Using 16S rRNA gene amplicons to establish the structure and composition of a microbial community is inherently an uncertain process. This pipeline uses software and approaches that respect this uncertainty, and bring it right to the terminal analyses.
Alpha-diversity, phylogenetic distance, and taxonomic classification are all done respecting the uncertainty involved in each step.
-
Runs the same locally (via docker), on slurm or PBS (via singularity), or on the cloud.
Maliampi's use of containerized software and nextflow allows the pipeline to run the same on a variety of different computational resources. You have slurm, but your collaborators have access to a PBS cluster? No issue. On the go? Fine to run the pipeline's computationally heavier tasks via AWS while running simpler steps on docker right on your computer.
-
Modular design
(TODO) While maliampi can start with fastq files straight from the sequencer, you can also opt to use your own sequence variant / OTU generation strategy and just use maliampi to make a reference package.
Maliampi uses nextflow as a workflow engine.
Using maliampi requires nextflow to be installed and configured on your computer. Here we will go through the steps of doing so.
(A previous attempt at maliampi used sciluigi. This is now deprecated, with the code remaining around for now.)
- nextflow, which in turn requires the java runtime environment.
- Docker (if you want to run locally)
For a variety of reasons, maliampi works best on an unix-like operating system (linux, MacOS, WSL). Directions here are based on an unix-like system.
-
Install nextflow
Nextflow installation directions can be followed and are only three steps:
i) Install the java runtime environment (at least version 8). I suggest OpenJDK 8.
ii) Use curl to download the latest nextflow binary:
curl -s https://get.nextflow.io | bash
iii) Test nextflow
./nextflow run hello
-
(Optional) Install docker engine community edition.
This is needed if you want to run things locally for testing and is strongly suggested. Directions can be found on the docker site.
Windows + Docker is not a happy marriage right now--particularly when attempting to bind local directories in containers.
-
Create (or edit) a common nextflow config file
We need to tell nextflow what sort of computational resources are available in your setup.
Different steps in maliampi require different resources. Maliampi gives each step (or process) a label:
-
'io_limited': processes that are not multithreaded and are generally limited by the speed files can be read and written.
-
'io_limited_local': processes that are I/O limited and should also be executed locally.
-
'io_mem': processes that are limited by the speed of reading and writing files, but also require a lot of memory.
-
'mutlithread': Tasks that have relatively modest cpu and memory requirements, but can benefit from multiple cpus and are not limited by reading and writing files.
-
'mem_veryhigh': Multithreaded tasks that require very large amounts of memory.
Maliampi works best when the number of cpus and memory for each of these labels is established in the common nextflow config file.
In addition the config file should tell nextflow to use docker locally (or AWS / GCS / slurm / PBS / singularity depending on your setup). Details of this can be found in the nextflow documentation. You can seek out the help of your local high-performance computing staff and the nextflow documentation.
An example config file that uses docker locally (and can be copied into
~/.nextflow/config
to get you started): -
profiles{
standard {
process {
executor = 'local'
withLabel: 'io_limited' {
cpus = 1
memory = 4.GB
}
withLabel: 'io_limited_local' {
cpus = 1
memory = 4.GB
}
withLabel: 'io_mem' {
cpus = 1
memory = 16.GB
}
withLabel: 'multithread' {
cpus = 4
memory = 16.GB
}
withLabel: 'mem_veryhigh' {
cpus = 2
memory = 32.GB
}
}
docker {
enabled = true
}
} // end standard profile
hybrid { // start hybrid profile combining local and awsbatch
process {
withLabel: 'io_limited' {
cpus = 1
memory = 4.GB
executor = 'awsbatch'
}
withLabel: 'io_limited_local' {
cpus = 1
memory = 4.GB
executor = 'local'
docker.enabled = true
}
withLabel: 'io_mem' {
cpus = 1
memory = 16.GB
executor = 'awsbatch'
}
withLabel: 'multithread' {
cpus = 4
memory = 16.GB
executor = 'awsbatch'
}
withLabel: 'mem_veryhigh' {
cpus = 2
memory = 32.GB
executor = 'awsbatch'
}
}
} // end hybrid profile
}
If your computer setup is not that rich in memory, feel free to cut down any of the specific memory labels.
This setup is useful for testing and work with modest data sets, including our test data. For larger data, setting up nextflow to use a high-performance computing cluster and/or cloud computing (AWS, GCS, etc) will be required in most situations.
The tutorial presumes you have nextflow installed as above along with the
~/.nextflow/config
file set up as described above.
The maliampi repository contains some sample data we can work with. We will download the current version of the repository to obtain it, save the practice data, then cleanup the rest.
wget https://github.com/jgolob/maliampi/archive/master.zip
unzip master.zip
mv maliampi-master/maliampi-practice-data/ .
rm -r maliampi-master && rm master.zip
cd maliampi-practice-data
mkdir -p working/ && mkdir -p output/sv/
Maliampi uses a manifest file to keep track of things. At a minimum this needs to have two columns: one for specimen (a unique identifier) and R1 (to point to the file path of the fastq file containing the forward read). Almost always there will be another column for the reverse read, R2.
Another important column to include is batch, corresponding to a given extraction-and-sequencing run.
Take a look at manifest.csv to see this simple example of samples from the ATCC mock community.
nextflow run jgolob/maliampi -latest --help
You should see:
N E X T F L O W ~ version 19.07.0
Pulling jgolob/maliampi ...
downloaded from https://github.com/jgolob/maliampi.git
Launching `jgolob/maliampi` [modest_saha] - revision: fc96dbdbfb [master]
Usage:
nextflow run jgolob/maliampi <ARGUMENTS>
Required Arguments:
--manifest CSV file listing samples
At a minimum must have columns:
specimen: A unique identifier
R1: forward read
R2: reverse read fq
optional columns:
batch: sequencing / library batch. Should be filename safe
I1: forward index file (for checking demultiplexing)
I2: reverse index file
--repo_fasta Repository of 16S rRNA genes.
--repo_si Information about the 16S rRNA genes.
--email Email (for NCBI)
Options:
Common to all:
--output Directory to place outputs (default invocation dir)
Maliampi will create a directory structure under this directory
-w Working directory. Defaults to `./work`
-resume Attempt to restart from a prior run, only completely changed steps
SV-DADA2 options:
--trimLeft How far to trim on the left (default = 0)
--maxN (default = 0)
--maxEE (default = Inf)
--truncLenF (default = 0)
--truncLenR (default = 0)
--truncQ (default = 2)
Ref Package required:
--repo_fasta FASTA file containing reference sequences (required)
--repo_si CSV file with information about the repo reads (required)
Ref Package options (defaults generally fine):
--repo_min_id Minimum percent ID to a SV to be recruited (default = 0.8)
--repo_max_accepts Maximum number of recruits per SV (default = 10)
--cmalign_mxsize Infernal cmalign mxsize (default = 8196)
--raxml_model RAxML model for tree formation (default = 'GTRGAMMA')
--raxml_parsiomony_seed (default = 12345)
--taxdmp Path to taxdmp.zip. If not provided, it will be downloaded
Placement / Classification Options (defaults generally fine):
--pp_classifer pplacer classifer (default = 'hybrid2')
--pp_likelihood_cutoff (default = 0.9)
--pp_bayes_cutoff (default = 1.0)
--pp_multiclass_min (default = 0.2)
--pp_bootstrap_cutoff (default = 0.8)
--pp_bootstrap_extension_cutoff (default = 0.4)
--pp_nbc_boot (default = 100)
--pp_nbc_target_rank (default = 'genus')
--pp_nbc_word_length (default = 8)
--pp_seed (default = 1)
maliampi takes a lot of options, but only a few are required:
--manifest: the csv file telling us where to find files
--repo_fasta: A fasta file of 16S rRNA genes
--repo_si: A CSV file linking the 16S rRNA gene IDs in the fasta file with the NCBI taxonomy.
--email: a valid email address to use with NCBI
nextflow will automatically download maliampi from the repository on github.
-latest
ensures the latest version is always downloaded.
nextflow run jgolob/maliampi -latest -resume \
--manifest manifest.csv \
--output tutorial_output \
--email [email protected] \
--repo_fasta test.repo.fasta \
--repo_si test.repo.seq_info.csv
Maliampi will run. It will take a bit to complete even on the test data.
If this all works, you should see something like:
N E X T F L O W ~ version 19.07.0
Pulling jgolob/maliampi ...
Already-up-to-date
Launching `jgolob/maliampi` [furious_lalande] - revision: 2160a88d19 [master]
executor > local (96)
[03/d23f0b] process > barcodecop (9) [100%] 9 of 9 ✔
[e0/e314a2] process > dada2_ft (9) [100%] 9 of 9 ✔
[1e/9fc05c] process > dada2_derep (9) [100%] 9 of 9 ✔
[9c/2cbe37] process > dada2_learn_error (3) [100%] 6 of 6 ✔
[1b/e8a4b9] process > dada2_derep_batches (6) [100%] 6 of 6 ✔
[de/8cbc57] process > dada2_dada (6) [100%] 6 of 6 ✔
[2e/5034c2] process > dada2_demultiplex_dada (6) [100%] 6 of 6 ✔
[62/fa4431] process > dada2_merge (9) [100%] 9 of 9 ✔
[7e/2748f5] process > dada2_seqtab_sp (9) [100%] 9 of 9 ✔
[67/003c58] process > dada2_seqtab_batch_combine (2) [100%] 3 of 3 ✔
[de/6a7a43] process > dada2_seqtab_combine_all [100%] 1 of 1 ✔
[e2/90044f] process > dada2_remove_bimera [100%] 1 of 1 ✔
[0a/920e9f] process > dada2_convert_output [100%] 1 of 1 ✔
[bd/3bd963] process > output_failed [100%] 1 of 1 ✔
[7b/68d107] process > refpkgSearchRepo [100%] 1 of 1 ✔
[cb/c47830] process > filterSeqInfo [100%] 1 of 1 ✔
[7b/5fdc72] process > DlBuildTaxtasticDB [100%] 1 of 1 ✔
[6f/da64ae] process > confirmSI [100%] 1 of 1 ✔
[ba/01d471] process > alignRepoRecruits [100%] 1 of 1 ✔
[de/b68cf4] process > convertAlnToFasta [100%] 1 of 1 ✔
[f7/7251c8] process > raxmlTree [100%] 1 of 1 ✔
[d3/9099c7] process > raxmlTree_cleanupInfo [100%] 1 of 1 ✔
[d0/657abb] process > taxtableForSI [100%] 1 of 1 ✔
[2b/9ddac1] process > obtainCM [100%] 1 of 1 ✔
[d6/eabb78] process > combineRefpkg [100%] 1 of 1 ✔
[0e/075e7e] process > alignSV [100%] 1 of 1 ✔
[5d/be1074] process > extractRefpkgAln [100%] 1 of 1 ✔
[ed/97c83b] process > combineAln_SV_refpkg [100%] 1 of 1 ✔
[11/4b6835] process > pplacerPlacement [100%] 1 of 1 ✔
[57/5bf9d9] process > pplacerReduplicate [100%] 1 of 1 ✔
[91/afe8bd] process > pplacerADCL [100%] 1 of 1 ✔
[48/8e6d1b] process > pplacerEDPL [100%] 1 of 1 ✔
[25/5d3693] process > pplacerPCA [100%] 1 of 1 ✔
[90/ee0880] process > pplacerAlphaDiversity [100%] 1 of 1 ✔
[3a/348adb] process > pplacerKR [100%] 1 of 1 ✔
[34/e44686] process > classifyDB_Prep [100%] 1 of 1 ✔
[14/afd81c] process > classifySV [100%] 1 of 1 ✔
[d8/0d2f39] process > classifyMCC [100%] 1 of 1 ✔
[d7/d5ca1e] process > classifyTables (6) [100%] 6 of 6 ✔
Completed at: 12-Sep-2019 14:58:56
Duration : 13m 16s
CPU hours : 0.9
Succeeded : 96
Cached : 0
There are four subdirectories in the output:
tutorial_output/
├── classify
├── placement
├── refpkg
└── sv
tutorial_output/sv/
├── dada2.combined.seqtabs.nochimera.csv
├── dada2.combined.seqtabs.nochimera.rds
├── dada2.specimen.sv.long.csv
├── dada2.sv.fasta
├── dada2.sv.map.csv
├── dada2.sv.shared.txt
├── dada2.sv.weights.csv
├── errM
│ ├── m22_p4
│ │ ├── m22_p4.R1.errM.csv
│ │ ├── m22_p4.R1.errM.rds
│ │ ├── m22_p4.R2.errM.csv
│ │ └── m22_p4.R2.errM.rds
│ ├── m23_p3
│ │ ├── m23_p3.R1.errM.csv
│ │ ├── m23_p3.R1.errM.rds
│ │ ├── m23_p3.R2.errM.csv
│ │ └── m23_p3.R2.errM.rds
│ └── m48_p4
│ ├── m48_p4.R1.errM.csv
│ ├── m48_p4.R1.errM.rds
│ ├── m48_p4.R2.errM.csv
│ └── m48_p4.R2.errM.rds
└── failed_specimens.csv
Some highlights here:
dada2.sv.fasta
contains the sequence variants in FASTA format.dada2.specimen.sv.long.csv
contains the SV count by community in long format (i.e. a CSV file with columns community, sv, and count).dada2.sv.shared.txt
contains the SV counts in a TSV format similar tomothur
sharefile.failed_specimens.csv
contains all of the specimens that failed at some point (files empty, none survive filter-trim, merging, etc). The reason is listed as well. This is useful for troubleshooting.- The
errM
subdirectory contains the error models for each batch.
By default, maliampi will try to ignore failed specimens and finish with as many specimens as possible.
tutorial_output/refpkg/
└── refpkg.tgz
A reference package is used for placement, and stored in tar-gzip format. Within the tar-gzip file is:
drwxr-xr-x 0 root root 0 Sep 12 14:57 ./
-rw-r--r-- 0 root root 2217 Sep 12 14:57 ./CONTENTS.json
-rw-r--r-- 0 root root 381 Sep 12 14:57 ./phylo_modeldqujt6x8.json
-rw-r--r-- 0 root root 7402 Sep 12 14:57 ./RAxML_bestTree.refpkg
-rw-r--r-- 0 root root 2239 Sep 12 14:57 ./RAxML_info.refpkg
-rw-r--r-- 0 root root 207936 Sep 12 14:57 ./recruits.aln.fasta
-rw-r--r-- 0 root root 252268 Sep 12 14:57 ./recruits.aln.sto
-rw-r--r-- 0 root root 1012905 Sep 12 14:57 ./refpkg.cm
-rw-r--r-- 0 root root 28026 Sep 12 14:57 ./refpkg.seq_info.corr.csv
-rw-r--r-- 0 root root 27359 Sep 12 14:57 ./refpkg.taxtable.csv
-rw-r--r-- 0 root root 4340 Sep 12 14:57 ./treejkkusc5k.tre
CONTENTS.json
a manifest.RAxML_bestTree.refpkg
a tree in newick format.recruits.aln.fasta
an alignment of the full-length 16s rRNA gene recruits (FASTA format).recruits.aln.sto
an alignment of the full-length 16s rRNA gene recruits (Stockholm format).refpkg.cm
the covariance matrix used for the alignment.refpkg.seq_info.corr.csv
A sequence information file telling us about the full length rRNA genes (inlcuding their taxonomic assignment)refpkg.taxtable.csv
a mothur-style table of taxonomic information for each full-length 16s rRNA gene in the reference.
tutorial_output/placement/
├── adcl.csv.gz
├── alpha_diversity.csv.gz
├── dedup.jplace
├── edpl.csv.gz
├── kr_distance.csv.gz
├── pca
│ ├── epca.proj
│ ├── epca.trans
│ ├── epca.xml
│ ├── lpca.proj
│ ├── lpca.trans
│ └── lpca.xml
└── redup.jplace.gz
dedup.jplace
is the deduplicated placement file in JSON format.redup.jplace.gz
is the reduplicated placement file.alpha_diversity.csv.gz
has alpha diversity information for each community.kr_distance.csv.gz
has the weighted pairwise phylogenetic distance between communities in long format.- The
pca\
subdirectory has the PCA coordinates for each community. adcl.csv.gz
andedpl.csv.gz
contains information about each sequence variant's placement (Average distance to the closest leaf and pendant length respectively). High ADCL for a SV is indicative of poor representation in the reference package.
6D. classify
for the taxonomic classifications of sequence variants and for each specimen / community
tutorial_output/classify/
├── classify.mcc.db
└── tables
├── by_specimen.class.csv
├── by_specimen.family.csv
├── by_specimen.genus.csv
├── by_specimen.order.csv
├── by_specimen.phylum.csv
├── by_specimen.species.csv
├── by_taxon.class.csv
├── by_taxon.family.csv
├── by_taxon.genus.csv
├── by_taxon.order.csv
├── by_taxon.phylum.csv
├── by_taxon.species.csv
├── tallies_wide.class.csv
├── tallies_wide.family.csv
├── tallies_wide.genus.csv
├── tallies_wide.order.csv
├── tallies_wide.phylum.csv
└── tallies_wide.species.csv
classify.mcc.db
contains all of the classification information in a sqlite3 database. The most important tables aremulticlass
andmulticlass_concat
. The documentation forpplacer
contains further information.- the
tables/
subdirectory contains by-community and by target-rank classification information. tables/tallies_wide.*
contains the by-community classified output in wide format.tables/by_specimen.*
contains the by-community taxon counts and fractional abundance in long format.