ORSON: Usage

Introduction
Install the pipeline
- Local installation
- Adding your own system config
Running the pipeline
- Updating the pipeline
- Reproducibility
Main arguments
Mandatory arguments
- --fasta
- -profile
Generic arguments
Installing annotated sequence banks
BUSCO analysis
- --busco_enable
- --lineage
PLAST search
- --plast_db
BLAST or diamond search
InterProScan analysis
- --iprscan_enable
eggNOG mapper annotation
- --eggnogmapper_enable
BeeDeeM annotation
- --beedeem_annot_enable
- --annot_type
Job resources
Other command line parameters

Introduction

Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen / tmux or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.

It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'

Install the pipeline

Local installation

Make sure that on your system either install Nextflow as well as Docker or Singularity allowing full reproducibility

How to install orson:

git clone https://gitlab.ifremer.fr/bioinfo/orson

Adding your own system config

To use this workflow on a computing cluster, it is necessary to provide a configuration file for your system. For some institutes, this one already exists and is referenced on nf-core/configs. If so, you can simply download your institute custom config file and simply use -c <institute_config_file> in the launch command.

If your institute does not have a referenced config file, you can create it using files from other infrastructure

Running the pipeline

The most simple command for running the pipeline is as follows:

nextflow run main.nf -profile test,singularity

This will launch the pipeline with the test configuration profile using singularity. See below for more information about profiles.

Note that the pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow_log   # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Updating the pipeline

When you run the above command, Nextflow automatically runs the pipeline code from your git clone - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the version of the pipeline:

cd orson
git pull

Reproducibility

It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.

First, go to the ORSON releases page and find the latest version number (eg. v1.0.0). Then, you can configure your local orson installation to use your desired version as follows:

cd orson
git checkout v1.0.0

Mandatory arguments

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile test,singularity.

If -profile is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH.

singularity
- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub: ORSON

Profiles are also available to configure the workflow and can be combined with execution profiles listed above.

test
- A profile with a complete configuration for automated testing of annotation workflow
- Includes test dataset so needs no other parameters
custom
- A profile to complete according to your dataset and experiment

`--fasta`

Path to input FASTA file to annotate.

Please note that the input data must not be in compressed format.

Generic arguments

`--query_type`

Set to "n" for nucleic acid sequences input or to "p" for protein sequences.

`--hit_tool`

Indicates the tool of your choice for the comparison of your sequences to the reference database. Can be "PLAST", "BLAST" or "diamond".

`--chunk_size`

Size of the FASTA file chunks.

Installing annotated sequence banks

`--downloadDB_enable`

Set to true or false to active or disable automated installation of banks. (default = true)

`--db_dir`

Path to annotated sequence banks.

`--bank_list`

List of banks to install. Accepted values are: Uniprot_SwissProt, Refseq_protein, Uniprot_TrEMBL. This list can be completed with: NCBI_Taxonomy, Enzyme. Multiple bank names can be set with comma separator.

BUSCO analysis

Before annotation processes, if your input file is a transcriptome, ORSON can perform a completness analysis of your transcriptome using BUSCO.

`--busco_enable`

Set to true or false to enable or disable the BUSCO completness analysis of your transcriptome (default = false).

`--lineage`

Path to the busco lineage matching your transcriptome. Multiple lineages can be set with comma separator.

PLAST search

If you set --hit_tool with "PLAST", sequence comparison will be done using PLAST.

`--plast_db`

Set the path to the PLAST formatted database of your choice. The reference database must contain protein sequences. (default = UniProt SwissProt)

BLAST or diamond search

If you set --hit_tool with "BLAST" or "diamond", sequence comparison will be done using BLAST or diamond.

`--restricted_search`

Active BLAST search against a taxonomic restricted nr database. Active only with nr BLAST search (default = false).

`--restricted_tax_id`

NCBI Taxonomy ID to restrict nr database for restricted BLAST search

`--blast_db`

Set the path to the BLAST formatted database of your choice. The reference database must contain protein sequences. (default = UniProt SwissProt)

InterProScan analysis

This process is optional and use InterProScan to provides functional analysis of proteins by classifying them into families and predicting domains and important sites.

`--iprscan_enable`

Set to true or false to active or disable InterProScan analysis. (default = true)

eggNOG mapper annotation

This process is optional and use eggNOG mapper to provides fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database to transfer functional information from fine-grained orthologs only.

`--eggnogmapper_enable`

Set to true or false to active or disable eggNOG mapper annotation. (default = false)

BeeDeeM annotation

This process is optional and use BeeDeeM to complete annotation to previously identified hits.

`--beedeem_annot_enable`

Set to true or false to active or disable BeeDeeM annotation. (default = true)

`--annot_type`

Type of annotation to introduce in results. Can be "bco" or "full". Use "bco" to only retrieve biological classifications information (e.g. IDs from Gene Ontology, Enzyme, NCBI Taxonomy, Interpro, Pfam). Use "full" to retrieve full feature tables in addition to biological classifications information.

Look at BeeDeeM Annotator documentation for more information.

Job resources

Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.

Other command line parameters

`--outdir`

The output directory where the results will be published.

`-w/--work-dir`

The temporary directory where intermediate data will be written.

`--email`

Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits.

`--email_on_fail`

Same as --email, except only send mail if the workflow is not successful.

`-name`

Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.

`-resume`

Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

NB: Single hyphen (core Nextflow option)

`-c`

Specify the path to a specific config file (this is a core NextFlow command).

NB: Single hyphen (core Nextflow option)

Note - you can use this to override pipeline defaults.

`--max_memory`

Use to set a top-limit for the default memory requirement for each process. Should be a string in the format integer-unit. eg. --max_memory '8.GB'

`--max_time`

Use to set a top-limit for the default time requirement for each process. Should be a string in the format integer-unit. eg. --max_time '2.h'

`--max_cpus`

Use to set a top-limit for the default CPU requirement for each process. Should be a string in the format integer-unit. eg. --max_cpus 1

`--plaintext_email`

Set to receive plain-text e-mails instead of HTML formatted.

`--monochrome_logs`

Set to disable colourful command line output and live life in monochrome.

Files

usage.md

Latest commit

History

usage.md

File metadata and controls

ORSON: Usage

Table of contents

Introduction

Install the pipeline

Local installation

Adding your own system config

Running the pipeline

Updating the pipeline

Reproducibility

Mandatory arguments

-profile

--fasta

Generic arguments

--query_type

--hit_tool

--chunk_size

Installing annotated sequence banks

--downloadDB_enable

--db_dir

--bank_list

BUSCO analysis

--busco_enable

--lineage

PLAST search

--plast_db

BLAST or diamond search

--restricted_search

--restricted_tax_id

--blast_db