Skip to content

Commit

Permalink
Merge pull request #54 from EBI-Metagenomics/dev_martinh
Browse files Browse the repository at this point in the history
Dev martinh
  • Loading branch information
mberacochea authored Jun 21, 2021
2 parents 9b1e796 + 1853b3f commit 53a99b8
Show file tree
Hide file tree
Showing 5 changed files with 76 additions and 33 deletions.
68 changes: 50 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
![](https://img.shields.io/badge/CWL-1.2-green)
![](https://img.shields.io/badge/nextflow-20.01.0-brightgreen)
![](https://img.shields.io/badge/nextflow-21.04.0-brightgreen)
![](https://img.shields.io/badge/uses-docker-blue.svg)
![](https://img.shields.io/badge/uses-singularity-red.svg)
![](https://img.shields.io/badge/uses-conda-yellow.svg)
[![Build Status](https://travis-ci.com/EBI-Metagenomics/emg-viral-pipeline.svg?branch=master)](https://travis-ci.com/EBI-Metagenomics/emg-viral-pipeline)

Expand All @@ -14,7 +15,9 @@

# VIRify
![Sankey plot](nextflow/figures/sankey.png)
VIRify is a recently developed pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by [MGnify](https://www.ebi.ac.uk/metagenomics/). VIRify’s taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,014 orthologous protein domains and referred to as ViPhOGs.
VIRify is a recently developed pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by [MGnify](https://www.ebi.ac.uk/metagenomics/). VIRify’s taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,014 orthologous protein domains and [referred to as ViPhOGs](https://doi.org/10.3390/v13061164).

The pipeline is implemented and available in [CWL](#cwl) and [Nextflow](#nf).

<a name="cwl"></a>

Expand All @@ -32,13 +35,12 @@ For instructions go to the [CWL README](cwl/README.md)
<a name="nf"></a>

# Nextflow
Email: [email protected]

A nextflow implementation of the VIRify pipeline for the detection of viruses from metagenomic assemblies. The same scripts are used in the CWL and Nextflow implementation.
A Nextflow implementation of the VIRify pipeline. In the backend, the same scripts are used as in the CWL implementation.

## What do I need?

This pipeline runs with the workflow manager [Nextflow](https://www.nextflow.io/) using [Docker](https://docs.docker.com/v17.09/engine/installation/linux/docker-ce/ubuntu/#install-docker-ce) (Conda will be implemented soonish, hopefully). All other programs and databases are automatically downloaded by Nextflow. _Attention_, the workflow will download databases with a size of roughly 19 GB (49 GB with `--hmmextend` and `--blastextend`) the first time it is executed.
This pipeline runs with the workflow manager [Nextflow](https://www.nextflow.io/) and needs as second dependency either [Docker](https://docs.docker.com/v17.09/engine/installation/linux/docker-ce/ubuntu/#install-docker-ce) or [Singularity](https://sylabs.io/guides/3.0/user-guide/quick_start.html). Conda will be implemented soonish, hopefully. However, we highly recommend the usage of the stable containers. All other programs and databases are automatically downloaded by Nextflow. _Attention_, the workflow will download databases with a size of roughly 19 GB (49 GB with `--hmmextend` and `--blastextend`) the first time it is executed.

### Install Nextflow
```bash
Expand All @@ -55,10 +57,20 @@ sudo usermod -a -G docker $USER
* restart your computer
* see [more instructions about Docker](https://docs.docker.com/v17.09/engine/installation/linux/docker-ce/ubuntu/#install-docker-ce)

### Install Singularity

While singularity can be installed via Conda, we recommend setting up a _true_ Singularity installation. For HPCs, ask the system administrator you trust. [Here](https://github.com/hpcng/singularity/blob/master/INSTALL.md) is also a good manual to get you started. _Please note_: you only need Docker or Singularity.

## Basic execution

Simply clone this repository or get or update the workflow via Nextflow:
Simply clone this repository and execute `virify.nf`:
```bash
git clone https://github.com/EBI-Metagenomics/emg-viral-pipeline.git
cd emg-viral-pipeline
nextflow run virify.nf --help
```

or (__recommended__) let Nextflow handle the installation. With the same command you can update the pipeline.
```bash
nextflow pull EBI-Metagenomics/emg-viral-pipeline
```
Expand All @@ -68,17 +80,25 @@ Get help:
nextflow run EBI-Metagenomics/emg-viral-pipeline --help
```

Pull and run a certain release:
We __highly recommend__ to run stable releases, also for reproducibility:
```bash
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.1 --help
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.2.0 --help
```

Run annotation for a small assembly file (10 contigs, 0.78 Mbp) on your local machine (`--cores 4`; takes approximately 10min + time for database download; ~19 GB on a 8 core i7 laptop):
Run annotation for a small assembly file (10 contigs, 0.78 Mbp) on your local machine using Docker containers (per default `--cores 4`; takes approximately 10 min on a 8 core i7 laptop + time for database download; ~19 GB):
```bash
nextflow run EBI-Metagenomics/emg-viral-pipeline --fasta "/home/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" --cores 4 -profile local,docker
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.2.0 --fasta "/home/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" --cores 4 -profile local,docker
```

EBI cluster:
__Please note__ that in particular further parameters such as

* `--workdir` or `-w` (here your work directories will be save)
* `--databases` (here your databases will be saved and the workflow checks if they are already available)
* `--cachedir` (here Singularity containers will be cached, not needed for Docker)

are important to handle where Nextflow writes files.

Execution specific for the EBI cluster:
```bash
source /hps/nobackup2/production/metagenomics/virus-pipeline/CONFIG

Expand All @@ -89,24 +109,30 @@ DIR=$PWD
cd $OUTPUT
# this will pull the pipeline if it is not already available
# use `nextflow pull EBI-Metagenomics/emg-viral-pipeline` to update the pipeline
nextflow run EBI-Metagenomics/emg-viral-pipeline --fasta "/homes/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" --output $OUTPUT --workdir $OUTPUT/work $DATABASES --cachedir $SINGULARITY -profile ebi
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.2.0 \
--fasta "/homes/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" \
--output $OUTPUT --workdir $OUTPUT/work --databases $DATABASES \
--cachedir $SINGULARITY -profile ebi
cd $DIR
```


## Profiles

The Nextflow uses the merged profile handling system so you have to define an executor (`local`, `lsf`, `slurm`) and an engine (`docker`, `singularity`, `conda`).
Nextflow uses a merged profile handling system so you have to define an executor (e.g., `local`, `lsf`, `slurm`) and an engine (`docker`, `singularity`) to run the pipeline according to your needs and infrastructure

Per default, the workflow runs locally (e.g. on your laptop) with Docker. When you execute the workflow on a HPC you can for example switch to a specific job scheduler and Singularity instead of Docker:

Per default, the workflow is run with Docker-support. When you execute the workflow on a HPC you can switch to
* SLURM (``-profile slurm,singularity``)
* LSF (``-profile lsf,singularity``)
and then you should also define the parameters
* `--workdir` (here your work directories will be save)

Dont forget, especially on an HPC, to define further important parameters such as

* `--workdir` or `-w` (here your work directories will be save)
* `--databases` (here your databases will be saved and the workflow checks if they are already available)
* `--cachedir` (here Docker/Singularity containers will be cached)
* `--cachedir` (here Singularity containers will be cached)

The engine `conda` is not working at the moment until there is a conda recipe for PPR-Meta. Sorry. Use Docker. Please. Or install PPR-Meta by yourself.
The engine `conda` is not working at the moment until there is a conda recipe for PPR-Meta. Sorry. Use Docker. Please. Or install PPR-Meta by yourself and then use the `conda` profile.

## DAG chart

Expand All @@ -123,3 +149,9 @@ Although VIRify has been benchmarked and validated with metagenomic data in mind
<b>3. Post-processing:</b> Metatranscriptomes generate highly fragmented assemblies. Therefore, filtering contigs based on a set minimum length has a substantial impact in the number of contigs processed in VIRify. It has also been observed that the number of false-positive detections of [VirFinder](https://github.com/jessieren/VirFinder/releases) (one of the tools included in VIRify) is lower among larger contigs. The choice of a length threshold will depend on the complexity of the sample and the sequencing technology used, but in our experience any contigs <2 kb should be analysed with caution.

<b>4. Classification:</b> The classification module of VIRify depends on the presence of a minimum number and proportion of phylogenetically-informative genes within each contig in order to confidently assign a taxonomic lineage. Therefore, short contigs typically obtained from metatranscriptome assemblies remain generally unclassified. For targeted classification of RNA viruses (for instance, to search for Coronavirus-related sequences), alternative DNA- or protein-based classification methods can be used. Two of the possible options are: (i) using [MashMap](https://github.com/marbl/MashMap/releases) to screen the VIRify contigs against a database of RNA viruses (e.g. Coronaviridae) or (ii) using [hmmsearch](http://hmmer.org/download.html) to screen the proteins obtained in the VIRify contigs against marker genes of the taxon of interest.

# Cite

If you use VIRify in your work, please cite:

[TBA](https://www.lipsum.com/)
7 changes: 4 additions & 3 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ tower {
}

params {
cores = '4'
max_cores = Runtime.runtime.availableProcessors()
cores = Runtime.runtime.availableProcessors().intdiv(4)
memory = '12'
help = false
profile = false
Expand Down Expand Up @@ -68,7 +68,7 @@ params {
dbs = 'nextflow-autodownload-databases'

// optional profile configurations, mostly necessary for HPC execution [lsf, slurm]
workdir = "/tmp/nextflow-work-$USER"
workdir = "work"
cachedir = false
}

Expand Down Expand Up @@ -127,11 +127,12 @@ profiles {
}

conda {
// not working right now due to missing conda package for PPR-Meta!
includeConfig 'nextflow/configs/conda.config'
}


//pre-merged
//pre-merged profiles for direct usage
standard {
executor {
name = "local"
Expand Down
Binary file modified nextflow/figures/chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion nextflow/modules/blast_filter.nf
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
process blast_filter {
publishDir "${params.output}/${assembly_name}/${params.blastdir}/", mode: 'copy', pattern: "*.meta"
publishDir "${params.output}/${assembly_name}/${params.finaldir}/blast/", mode: 'copy', pattern: "*.meta"
label 'ruby'
label 'python3'

errorStrategy 'retry'
maxRetries 1
Expand Down
32 changes: 21 additions & 11 deletions virify.nf
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,7 @@ workflow download_kaiju_db {
**************************/

/* Comment section:
Rename all contigs and filter by length.
*/
workflow preprocess {
take: assembly
Expand All @@ -351,6 +352,7 @@ workflow preprocess {
}

/* Comment section:
Restore original contig names.
*/
workflow postprocess {
take: fasta
Expand All @@ -363,6 +365,7 @@ workflow postprocess {


/* Comment section:
Run virus detection tools and parse the predictions according to defined filters.
*/
workflow detect {
take: assembly_renamed_length_filtered
Expand Down Expand Up @@ -390,6 +393,9 @@ workflow detect {


/* Comment section:
Predict ORFs and align HMMs to taxonomically annotate each contig. Apply bit score cutoffs and filters to distinguish informative ViPhOG HMMs and finally taxonomically annotate contigs, if possible.
Also runs additional HMM from further databases if defined and can also run a simple blast approach based on IMG/VR. Finally, mashmap can be used for the particular detection of a specific reference virus sequence.
Then, all results are summarized for reporting and plotting.
*/
workflow annotate {
take: predicted_contigs
Expand Down Expand Up @@ -453,6 +459,7 @@ workflow annotate {


/* Comment section:
Plot results. Basically runs krona and sankey. ChromoMap and Balloon are still experimental features and should be used with caution.
*/
workflow plot {
take:
Expand Down Expand Up @@ -490,7 +497,7 @@ workflow plot {


/* Comment section:
Maybe as an pre-step
Optional assembly step, not fully implemented and tested.
*/
workflow assemble_illumina {
take: reads
Expand All @@ -514,7 +521,9 @@ workflow assemble_illumina {
* WORKFLOW ENTRY POINT
**************************/

/* Comment section: */
/* Comment section:
Here the main workflow starts and runs the defined sub workflows.
*/

workflow {

Expand Down Expand Up @@ -609,11 +618,11 @@ def helpMSG() {
VIRify
${c_yellow}Usage example:${c_reset}
nextflow run main.nf --fasta 'assembly.fasta'
nextflow run virify.nf --fasta 'assembly.fasta'
${c_yellow}Input:${c_reset}
${c_green} --illumina ${c_reset} '*.R{1,2}.fastq.gz' -> file pairs
${c_green} --fasta ${c_reset} '*.fasta' -> one sample per file, no assembly produced
${c_green} --illumina ${c_reset} '*.R{1,2}.fastq.gz' -> file pairs, experimental feature that performs SPAdes assembly first
${c_dim} ..change above input to csv:${c_reset} ${c_green}--list ${c_reset}
${c_yellow}Options:${c_reset}
Expand All @@ -622,7 +631,7 @@ def helpMSG() {
--memory max memory for local use [default: $params.memory]
--output name of the result folder [default: $params.output]
${c_yellow}Databases:${c_reset}
${c_yellow}Databases (automatically downloaded by default):${c_reset}
--virsorter a virsorter database provided as 'virsorter/virsorter-data' [default: $params.virsorter]
--virfinder a virfinder model [default: $params.virfinder]
--viphog the ViPhOG database, hmmpress'ed [default: $params.viphog]
Expand All @@ -634,7 +643,8 @@ def helpMSG() {
--imgvr the IMG/VR, viral (meta)genome sequences [default: $params.imgvr]
--pprmeta the PPR-Meta github [default: $params.pprmeta]
--meta the tsv dictionary w/ meta information about ViPhOG models [default: $params.meta]
Important! If you provide your own hmmer database follow this format:
Important! If you provide your own HMM database follow this format:
rvdb/rvdb.hmm --> <folder>/<name>.hmm && 'folder' == 'name'
and provide the database following this command structure
--rvdb /path/to/your/rvdb
Expand All @@ -649,7 +659,7 @@ def helpMSG() {
--chromomap WIP feature to activate chromomap plot [default: $params.chromomap]
--balloon WIP feature to activate balloon plot [default: $params.balloonp]
--length Initial length filter in kb [default: $params.length]
--sankey select the x taxa with highest count for sankey plot, try and error to change plot [default: $params.sankey]
--sankey select the x taxa with highest count for sankey plot, try and error and use '-resume' to change plot [default: $params.sankey]
--chunk WIP: chunk FASTA files into smaller pieces for parallel calculation [default: $params.chunk]
--onlyannotate Only annotate the input FASTA (no virus prediction, only contig length filtering) [default: $params.onlyannotate]
--mashmap Map the viral contigs against the provided reference ((fasta/fastq)[.gz]) with mashmap [default: $params.mashmap]
Expand All @@ -659,18 +669,18 @@ def helpMSG() {
--viphog_version define the ViPhOG db version to be used [default: $params.viphog_version]
v1: no additional bit score filter (--cut_ga not applied, just e-value filtered)
v2: --cut_ga, min score used as sequence-specific GA, 3 bit trimmed for domain-specific GA
v3: --cut_ga, like v2 but seq-specific GA trimmed by 3 bits if second best score is 'nan'
v3: --cut_ga, like v2 but seq-specific GA trimmed by 3 bits if second best score is 'nan' (current default)
--meta_version define the metadata table version to be used [default: $params.meta_version]
v1: older version of the meta data table using an outdated NCBI virus taxonomy, for reproducibility
v2: 2020 version of NCBI virus taxonomy
v2: 2020 version of NCBI virus taxonomy (current default)
${c_dim}Nextflow options:
-with-report rep.html cpu / ram usage (may cause errors)
-with-dag chart.html generates a flowchart for the process tree
-with-timeline time.html timeline (may cause errors)
${c_yellow}HPC computing:${c_reset}
For execution of the workflow on a HPC (LSF, SLURM) adjust the following parameters if needed:
Especially for execution of the workflow on a HPC (LSF, SLURM) adjust the following parameters if needed:
--databases defines the path where databases are stored [default: $params.dbs]
--workdir defines the path where nextflow writes tmp files [default: $params.workdir]
--cachedir defines the path where images (singularity) are cached [default: $params.cachedir]
Expand All @@ -686,7 +696,7 @@ def helpMSG() {
${c_blue}Engines${c_reset} (choose one):
docker
singularity
conda
conda (not fully supported! Unless you manually install PPR-Meta)
Or use a ${c_yellow}pre-configured${c_reset} setup instead:
standard (local,docker) [default]
Expand Down

0 comments on commit 53a99b8

Please sign in to comment.