- Introduction
- Install the pipeline
- Running the pipeline
- Main arguments
- Mandatory arguments
- Generic arguments
- Installing annotated sequence banks
- BUSCO analysis
- PLAST search
- BLAST or diamond search
- InterProScan analysis
- eggNOG mapper annotation
- BeeDeeM annotation
- Job resources
- Other command line parameters
Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen
/ tmux
or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.
It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'
Make sure that on your system either install Nextflow as well as Docker or Singularity allowing full reproducibility
How to install orson:
git clone https://gitlab.ifremer.fr/bioinfo/orson
To use this workflow on a computing cluster, it is necessary to provide a configuration file for your system. For some institutes, this one already exists and is referenced on nf-core/configs. If so, you can simply download your institute custom config file and simply use -c <institute_config_file>
in the launch command.
If your institute does not have a referenced config file, you can create it using files from other infrastructure
The most simple command for running the pipeline is as follows:
nextflow run main.nf -profile test,singularity
This will launch the pipeline with the test
configuration profile using singularity
. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
When you run the above command, Nextflow automatically runs the pipeline code from your git clone - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the version of the pipeline:
cd orson
git pull
It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the ORSON releases page and find the latest version number (eg. v1.0.0
). Then, you can configure your local orson installation to use your desired version as follows:
cd orson
git checkout v1.0.0
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile test,singularity
.
If -profile
is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH
.
singularity
- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub:
ORSON
Profiles are also available to configure the workflow and can be combined with execution profiles listed above.
test
- A profile with a complete configuration for automated testing of annotation workflow
- Includes test dataset so needs no other parameters
custom
- A profile to complete according to your dataset and experiment
Path to input FASTA file to annotate.
Please note that the input data must not be in compressed format.
Set to "n" for nucleic acid sequences input or to "p" for protein sequences.
Indicates the tool of your choice for the comparison of your sequences to the reference database. Can be "PLAST", "BLAST" or "diamond".
Size of the FASTA file chunks.
Set to true or false to active or disable automated installation of banks. (default = true)
Path to annotated sequence banks.
List of banks to install. Accepted values are: Uniprot_SwissProt, Refseq_protein, Uniprot_TrEMBL. This list can be completed with: NCBI_Taxonomy, Enzyme. Multiple bank names can be set with comma separator.
Before annotation processes, if your input file is a transcriptome, ORSON can perform a completness analysis of your transcriptome using BUSCO.
Set to true or false to enable or disable the BUSCO completness analysis of your transcriptome (default = false).
Path to the busco lineage matching your transcriptome. Multiple lineages can be set with comma separator.
If you set --hit_tool
with "PLAST", sequence comparison will be done using PLAST.
Set the path to the PLAST formatted database of your choice. The reference database must contain protein sequences. (default = UniProt SwissProt)
If you set --hit_tool
with "BLAST" or "diamond", sequence comparison will be done using BLAST or diamond.
Active BLAST search against a taxonomic restricted nr database. Active only with nr BLAST search (default = false).
NCBI Taxonomy ID to restrict nr database for restricted BLAST search
Set the path to the BLAST formatted database of your choice. The reference database must contain protein sequences. (default = UniProt SwissProt)
This process is optional and use InterProScan to provides functional analysis of proteins by classifying them into families and predicting domains and important sites.
Set to true or false to active or disable InterProScan analysis. (default = true)
This process is optional and use eggNOG mapper to provides fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database to transfer functional information from fine-grained orthologs only.
Set to true or false to active or disable eggNOG mapper annotation. (default = false)
This process is optional and use BeeDeeM to complete annotation to previously identified hits.
Set to true or false to active or disable BeeDeeM annotation. (default = true)
Type of annotation to introduce in results. Can be "bco" or "full". Use "bco" to only retrieve biological classifications information (e.g. IDs from Gene Ontology, Enzyme, NCBI Taxonomy, Interpro, Pfam). Use "full" to retrieve full feature tables in addition to biological classifications information.
Look at BeeDeeM Annotator documentation for more information.
Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143
(exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.
The output directory where the results will be published.
The temporary directory where intermediate data will be written.
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits.
Same as --email, except only send mail if the workflow is not successful.
Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
NB: Single hyphen (core Nextflow option)
Specify the path to a specific config file (this is a core NextFlow command).
NB: Single hyphen (core Nextflow option)
Note - you can use this to override pipeline defaults.
Use to set a top-limit for the default memory requirement for each process.
Should be a string in the format integer-unit. eg. --max_memory '8.GB'
Use to set a top-limit for the default time requirement for each process.
Should be a string in the format integer-unit. eg. --max_time '2.h'
Use to set a top-limit for the default CPU requirement for each process.
Should be a string in the format integer-unit. eg. --max_cpus 1
Set to receive plain-text e-mails instead of HTML formatted.
Set to disable colourful command line output and live life in monochrome.