Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-structure pipeline #134

Merged
merged 12 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,17 @@ jobs:
uses: actions/setup-python@v3
with:
python-version: "3.10"

- name: update pip
run: |
python -m pip install --upgrade pip

- name: Install dependencies
run: |
pip install -r requirements-test.txt
pip install --upgrade numpy pandas
- name: Unit tests
run: |
# TODO, improve the pythonpath handling
PYTHONPATH="$PYTHONPATH:bin" python -m unittest discover tests
export PYTHONPATH=$PYTHONPATH:bin
python -m unittest discover tests
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2019 EMBL-EBI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
29 changes: 6 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,10 @@

1. [ The VIRify pipeline ](#virify)
2. [ Nextflow execution ](#nf)
3. [ CWL execution (discontinued) ](#cwl)
4. [ Pipeline overview ](#overview)
5. [ Detour: Metatranscriptomics ](#metatranscriptome)
6. [ Resources ](#resources)
7. [ Citations ](#cite)
3. [ Pipeline overview ](#overview)
4. [ Detour: Metatranscriptomics ](#metatranscriptome)
5. [ Resources ](#resources)
6. [ Citations ](#cite)

<a name="virify"></a>

Expand All @@ -22,14 +21,12 @@ VIRify is a pipeline for the detection, annotation, and taxonomic classification

The pipeline is implemented in [Nextflow](#nf) and additionally only Docker or Singularity are needed to run VIRify. Details about installation and usage are given below.

**Please note**, that until v1.0 the pipeline was also implemented in [CWL](#cwl) as an alternative to [Nextflow](#nf). However, later updates were only included in the [Nextflow](#nf) version of the pipeline.


<a name="nf"></a>

# Nextflow

A [Nextflow](https://www.nextflow.io/) implementation of the VIRify pipeline. In the backend, the same scripts are used as in the [CWL](#cwl) implementation.
A [Nextflow](https://www.nextflow.io/) implementation of the VIRify pipeline.

## What do I need?

Expand Down Expand Up @@ -155,21 +152,7 @@ The labels used in the Type column of the gff file correspond to the following n
| prophage | [SO:0001006](http://www.sequenceontology.org/browser/current_svn/term/SO:0001006) |
| CDS | [SO:0000316](http://www.sequenceontology.org/browser/current_svn/term/SO:0000316) |

Note that CDS are reported only when a ViPhOG match has been found.


<a name="cwl"></a>

# Common Workflow Language (discontinued)

**Until VIRify v1.0**, VIRify was implemented in [Common Workflow Language (CWL)](https://www.commonwl.org/) next to the Nextflow implementation. Both Workflow Management Systems were previously supported.

## What do I need?
The implementation until v1.0 of VIRify uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.

## How?
For instructions go to the [CWL README](cwl/README.md).

Note that CDS are reported only when a ViPhOG match has been found.

<a name="overview"></a>

Expand Down
35 changes: 35 additions & 0 deletions assets/methods_description_template.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
id: "ebi-metagenomics/emg-viral-pipeline-methods-description"
description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
section_name: "ebi-metagenomics/emg-viral-pipeline Methods Description"
section_href: "https://github.com/EBI-Metagenomics/emg-viral-pipeline"
plot_type: "html"
data: |
<h4>Methods</h4>
<p>Data was processed using ebi-metagenomics/genomes-generation v${workflow.manifest.version} (${doi_text}; <a href="https://doi.org/10.1093/nargab/lqac007">Krakau <em>et al.</em>, 2022</a>) of the nf-core collection of workflows (<a href="https://doi.org/10.1038/s41587-020-0439-x">Ewels <em>et al.</em>, 2020</a>), utilising reproducible software environments from the Bioconda (<a href="https://doi.org/10.1038/s41592-018-0046-7">Grüning <em>et al.</em>, 2018</a>) and Biocontainers (<a href="https://doi.org/10.1093/bioinformatics/btx192">da Veiga Leprevost <em>et al.</em>, 2017</a>) projects.</p>
<p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
<pre><code>${workflow.commandLine}</code></pre>
<p>${tool_citations}</p>
<h4>References</h4>
<ul>
<li>
Informative Regions In Viral Genomes
<i>Viruses (2021)</i>
doi: <a href="https://doi.org/10.3390/v13061164">10.3390/v13061164</a>
Moreno-Gallego, Jaime Leonardo, and Alejandro Reyes
</li>
<li>
VIRify: an integrated detection, annotation and taxonomic classification pipeline using virus-specific protein profile hidden Markov models
<i>bioRxiv</i>
doi: <a href="https://doi.org/10.1101/2022.08.22.504484">10.1101/2022.08.22.504484</a>
Rangel-Pineros, Guillermo, et al.
</li>
${tool_bibliography}
</ul>
<div class="alert alert-info">
<h5>Notes:</h5>
<ul>
${nodoi_text}
<li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
<li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
</ul>
</div>
Binary file added assets/mgnify_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 61 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
report_comment: >

This report has been generated by the <a href="https://github.com/ebi-metagenomics/emg-viral-pipeline/" target="_blank">ebi-metagenomics/emg-viral-pipeline</a> pipeline.

report_section_order:
"ebi-metagenomics/emg-viral-pipeline-methods-description":
order: -1000
software_versions:
order: -1001
"ebi-metagenomics/emg-viral-pipeline-summary":
order: -1002

export_plots: true

data_format: "yaml"

run_modules:
- fastqc
- fastp

## Module order
module_order:
- fastqc
- fastp

## File name cleaning
extra_fn_clean_exts:
- "_fastp"

## Prettification
custom_logo: "mgnify_logo.png"
custom_logo_url: https://github.com/ebi-metagenomics/emg-viral-pipeline/
custom_logo_title: "ebi-metagenomics/emg-viral-pipeline"

## General Stats customisation
table_columns_visible:
"fastp":
pct_duplication: False
after_filtering_q30_rate: False
after_filtering_q30_bases: False
filtering_result_passed_filter_reads: 3300
after_filtering_gc_content: False
pct_surviving: True
pct_adapter: True

table_columns_placement:
"fastp":
pct_duplication: 3000
after_filtering_q30_rate: 3100
after_filtering_q30_bases: 3200
filtering_result_passed_filter_reads: 3300
after_filtering_gc_content: 3400
pct_surviving: 3500
pct_adapter: 3600

custom_table_header_config:
general_stats_table:
"Total length":
hidden: True
N50:
hidden: True
48 changes: 48 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/ebi-metagenomics/miassembler/master/assets/schema_input.json",
"title": "ebi-metagenomics/emg-viral-pipeline - params.input schema",
"description": "Schema for the file provided with params.input",
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample identifier",
"minLength": 3
},
"assembly": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?a(\\.gz)?$",
"errorMessage": "Assembly file in FASTA format",
"minLength": 3
},
"fastq_1": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
},
"fastq_2": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 2 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
}
},
"required": ["id"],
"oneOf": [
{
"required": ["assembly"],
"description": "An assembly file must be provided"
},
{
"required": ["fastq_1", "fastq_2"],
"description": "Both fastq_1 and fastq_2 files must be provided"
}
],
"errorMessage": {
"oneOf": "You must specify either an assembly file or both fastq_1 and fastq_2 files."
}
}
}
14 changes: 12 additions & 2 deletions bin/write_viral_gff.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,14 @@ def aggregate_annotations(virify_annotation_files):
return viral_sequences, cds_annotations


def open_fasta_file(filename):
if filename.endswith('.gz'):
f = gzip.open(filename, "rt")
else:
f = open(filename, "rt")
return f


def write_gff(
checkv_files,
taxonomy_files,
Expand Down Expand Up @@ -181,11 +189,13 @@ def empty_if_number(string):
taxonomy_dict[contig] = taxonomy_string

# Read unmodified contig length from the renamed assembly file
for record in SeqIO.parse(assembly_file, "fasta"):
handle = open_fasta_file(assembly_file)
for record in SeqIO.parse(handle, "fasta"):
contig_id = str(record.id)
seq_len = len(str(record.seq))
contigs_len_dict[contig_id] = seq_len

handle.close()

with open(output_filename, "w") as gff:
print("##gff-version 3", file=gff)
# Constants
Expand Down
64 changes: 64 additions & 0 deletions configs/base.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MGnify genomes-generation pipeline Nextflow base config file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A 'blank slate' config file, appropriate for general use on most high performance
compute environments. Assumes that all software is installed and available on
the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
----------------------------------------------------------------------------------------
*/

process {

cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }

errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 3
maxErrors = '-1'

// Process-specific resource requirements
// NOTE - Please try and re-use the labels below as much as possible.
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors

withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}
withLabel:process_low {
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
memory = { check_max( 12.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}
withLabel:process_medium {
cpus = { check_max( 6 * task.attempt, 'cpus' ) }
memory = { check_max( 36.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 12 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
withLabel:process_long {
time = { check_max( 20.h * task.attempt, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 200.GB * task.attempt, 'memory' ) }
}
withLabel:error_ignore {
errorStrategy = 'ignore'
}
withLabel:error_retry {
errorStrategy = 'retry'
maxRetries = 2
}
withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
}
20 changes: 20 additions & 0 deletions configs/conda.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
process {
withNAME: ANNOTATION { conda = "$baseDir/envs/python3.yaml" }
withNAME: ASSIGN { conda = "$baseDir/envs/python3.yaml" }
withNAME: BALLOON { conda = "$baseDir/envs/balloon.yaml" }
withNAME: basics { conda = "$baseDir/envs/python3.yaml" }
withNAME: BLAST { conda = "$baseDir/envs/blast.yaml" }
withNAME: HMMSCAN { conda = "$baseDir/envs/hmmer.yaml" }
withNAME: KAIJU { conda = "$baseDir/envs/kaiju.yaml" }
withNAME: KRONA { conda = "$baseDir/envs/krona.yaml" }
withNAME: PLOT_CONTIG_MAP { conda = "$baseDir/envs/r.yaml" }
withNAME: PARSE { conda = "$baseDir/envs/python3.yaml" }
withNAME: PRODIGAL { conda = "$baseDir/envs/prodigal.yaml" }
withNAME: PHANOTATE { conda = "$baseDir/envs/phanotate.yaml" }
withNAME: python3 { conda = "$baseDir/envs/python3.yaml" }
withNAME: RATIO_EVALUE { conda = "$baseDir/envs/python3.yaml" }
withNAME: ruby { conda = "$baseDir/envs/ruby.yaml" }
withNAME: VIRSORTER { conda = "$baseDir/envs/virsorter.yaml" }
withNAME: VIRFINDER { conda = "$baseDir/envs/virfinder.yaml" }
withNAME: CHECKV { conda = "$baseDir/envs/checkv.yaml" }
}
31 changes: 31 additions & 0 deletions configs/local.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
process.executor = 'local'

process {
withName: ANNOTATION { cpus = 1; }
withName: ASSIGN { cpus = 1; }
withName: BALLOON { cpus = 1; }
withLabel: basics { cpus = 1; }
withName: BLAST { cpus = params.cores; }
withName: CHROMOMAP { cpus = 1; }
withName: CHECKV { cpus = params.cores }
withName: FASTP { cpus = params.cores; }
withName: FASTQC { cpus = params.cores; }
withName: HMMSCAN { cpus = params.cores; }
withName: KAIJU { cpus = params.cores; }
withName: KRONA { cpus = params.cores; }
withName: PLOT_CONTIG_MAP { cpus = 1; }
withName: PPRMETA { cpus = params.cores; }
withName: MULTIQC { cpus = params.cores; }
withName: PARSE { cpus = 1; }
withName: PRODIGAL { cpus = 1; }
withName: PHANONATE { cpus = 1; }
withLabel: python3 { cpus = 1; }
withName: RATIO_EVALUE { cpus = 1; }
withLabel: ruby { cpus = 1; }
withName: SPADES { cpus = params.cores; }
withName: SANKEY { cpus = 1; }
withName: VIRSORTER { cpus = params.cores; }
withName: VIRFINDER { cpus = 1; }
withName: MASHMAP { cpus = params.cores; }
}

Loading
Loading