- Introduction
- Using Nextflow Locally
- Using Nextflow with Docker Containers
- Using Nextflow with AWS S3
- Using Nextflow with AWS Batch
- A larger example: SARS-CoV-2 in Singapore
Nextflow is a workflow management system in the vein of Snakemake or even GNU Make. Nextflow makes containerisation and configuration for cloud computing easy, allowing for reproducible and scalable computational analysis.
In this post, we provide a worked example of Nextflow in the construction of a phylogenetic tree. To do so, we use MAFFT for multiple sequence alignment, and FastTree for tree construction.
In this first section, we will construct a small phylogenetic tree based on orthologs of the human haemogoblin alpha 1 subunit. Three sequences are provided in the FASTA file at hba1.fasta.gz:
- Human, HBA1 or ENSG00000206172.
- Mouse, Hba-a1 or ENSMUSG00000069919.
- Zebrafish, zgc:163057 or ENSDARG00000045144.
Channels are sources of data.
They are useful for holding input or intermediate files.
For example, to store our sequences from the aforementioned hba1.fasta.gz into a channel, we can use the fromPath
method to create a channel named hbaSequences
containing that single FASTA file.
Create a file named hba1-local.nf
, and write:
hbaSequences = Channel.fromPath("hba1.fasta.gz")
To execute this nextflow script, simply use the nextflow subcommand run
:
nextflow run hba1-local.nf
Nothing will happen just yet, but that's to be expected: We've declared our data, but we haven't yet defined what should be done to that data!
Processes can act on data from channels.
Processes start with the process
keyword, followed by a name, then a block body.
The data which the process acts on is specified by the input
keyword, and output
declares the data output by the process.
At the end of the block body is the script
which is to be executed.
For example, a process to align the sequences from our just-created hbaSequences
can be written like:
process {
input: file sequences from hbaSequences
output: file "hba-alignment.fasta.gz" into hbaAlignment
"""
gunzip --to-stdout $sequences | mafft --auto - > hba-alignment.fasta
gzip hba-alignment.fasta
""""
}
Note that the output implicitly creates a new channel named hbaAlignment
which contains the single file hba-alignment.fasta.gz
output by the script.
This new channel can then be used in a subsequent tree construction process:
process buildTree {
input: file alignment from hbaAlignment
output: file "hba-tree" into hbaTree
"""
gunzip --to-stdout $alignment | FastTree > hba-tree
"""
}
By default, processes create their output files in the Nextflow-managed workDir
, which is usually a directory named work
in the current working directory.
In order to place process output files elsewhere, you will need to specify the publishDir
using the publishDir
directive.
So, if we want to place the tree file produced by the buildTree
process in the current working directory, we can rewrite it as:
process buildTree {
publishDir './'
input: file alignment from hbaAlignment
output: file "hba-tree" into hbaTree
"""
gunzip --to-stdout $alignment | FastTree > hba-tree
"""
}
In the end, you should end up with a file like hba-local.nf.
Now, if we execute the script with nextflow run hba-local.nf
, there should be an output file hba-tree
produced in the current working directory.
Visualise this tree with any tree visualiser to make sure it's working (we used iroki.net).
Next, let's try to containerise each of our two processes. This can be done in three steps.
First, specify the docker containers to be used for each process in the nextflow script file, using the container
directive.
process alignMultipleSequences {
container "biocontainers/mafft:v7.407-2-deb_cv1"
// ...
}
process buildTree {
container "biocontainers/fasttree:v2.1.10-2-deb_cv1"
publishDir './'
// ...
}
This should result in a modified nextflow script which looks like hba-docker.nf.
Secondly, we need to specify a profile via a configuration file.
Create a new file nextflow.config
, and define a new profile docker
which sets the enabled
property of the docker
scope to true
.
profiles {
docker {
docker.enabled = true
}
}
Lastly, run nextflow with the -profile docker
argument (that's -profile
with a single-dash, not --profile
!)
nextflow run hba-docker.nf -profile docker
Without the -profile docker
argument, Nextflow will not use the docker containers.
This allows one to alternate between "direct" local and containerised execution environments by changing command line arguments only.
If you're working with AWS, you might have or may want to store your files on S3.
Nextflow supports using and publishing to S3: simply use the s3
protocol in your file paths:
hbaSequences = Channel.fromPath("s3://nextflow-awsbatch/hba1.fasta.gz")
// ...
process buildTree {
container "biocontainers/fasttree:v2.1.10-2-deb_cv1"
publishDir "s3://nextflow-awsbatch/"
// ...
}
Using S3, you should have a Nextflow script that looks like hba-s3.nf.
For Nextflow to use AWS Batch, you must first (i) specify publicly available Docker containers for some of your processes, (ii) set-up a compute environment (possibly with a custom AMI) and job queue on AWS, (ii) configure AWS in the configuration file, and (iii) specify a S3 bucket to use as working directory for intermediate files.
The (i) use of Docker containers has already been covered in "Using Nextflow with Docker Containers", and we will skip (ii) setting up AWS as well in order to focus on Nextflow.
Just as with docker, we will create a new profile awsbatch
in our configuration file to provide Nextflow with the information it needs to use AWS Batch.
Minimally, you will only need to set four variables:
profiles {
docker {
docker.enabled = true
}
// Set up a new awsbatch profile
awsbatch {
process.executor = 'awsbatch'
process.queue = 'nextflow-awsbatch-queue-2'
aws.region = 'ap-southeast-1'
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
}
}
Make sure to replace process.queue
with the name of your job queue in AWS Batch, aws.region
with the AWS region you are operating on, and aws.batch.cliPath
with the file path to the AWS CLI binary on the AMI you have configured for your compute environment.
If you are certain that the AMI you are using will have the AWS CLI in its $PATH
, then it is okay to omit the aws.batch.cliPath
line.
Note that this work directory is not the same as the "Using Nextflow with AWS S3" — that previous section only specified S3 as the input and output directories.
Here, we are providing a bucket for Nextflow to store intermediate files.
This is accomplished by means of an additional command-line flag -work-dir
(or equivalently, -bucket-dir
).
nextflow run hba-s3.nf -profile awsbatch -work-dir s3://nextflow-awsbatch/temp
Take note of how the new profile is being used: -profile awsbatch
.
Now, let's try to scale our analysis up from three HBA sequences to a thousand SARS-CoV-2 viral genomes. First, download a collection of SARS-CoV-2 genomes from the GISAID database (registration required), filtering by location to "Asia/Singapore" or region of your liking. Next, upload the gzipped FASTA file of the downloaded genomes onto AWS S3.
Now, all we have to do is to replace the input channel path with the newly uploaded SARS-CoV-2 genomes:
covidSequences = Channel.fromPath("s3://nextflow-awsbatch/sars-cov2-singapore.fasta.gz")
// ...also change the variable names so that they are self-documenting!
and the alignMultipleSequences
process is good to go!
However the buildTree
process might fail as a result of running out of memory, so we will also increase the memory allocated to that process using the memory
directive:
process buildTree {
container "biocontainers/fasttree:v2.1.10-2-deb_cv1"
publishDir "s3://nextflow-awsbatch/"
memory "16 GB"
// ...
You should end up with a script which looks like hba-covid.nf.
Given the large number of sequences involved, this process can take about 4-5 hours to run from start to finish. Once completed, visualize your tree using the phylogenetic tree viewer of your choice!