Important
This project is still under development. After first release, a license and Zenodo DOI will be added.
exscan.nf
is a bioinformatics pipeline designed to scan DNA or protein sequences for key features of interest, with a focus on aiding in genomic annotation and comparative genomics studies.
The pipeline is implemented using Nextflow, and performs the following steps:
-
Translation of ORFs (if input is DNA)
Usesseqkit2
to translate sequences into all possible open reading frames (ORFs). -
Profile HMM Search
Queries each translated ORF or raw protein sequence against a profile HMM database usinghmmscan
. -
Perform different operations on each query result. Among others, operations include:
- Filtering the results by e-value, score, and coverage.
- Selecting the best scoring hit for each translated ORF or full protein sequence.
- Comparing hits with a GFF file to retrain the features intersecting with the hits.
- Writing profile HMM hits as FASTA, GFF, CSV...
All operations are handled via python
, biopython
, jq
, and bedtools
.
You can run the pipeline using:
nextflow run main.nf --fasta sequences.fasta --hmmdb hmmdb.hmm
Or alternatively:
nextflow run main.nf -params-file param_files/params.yaml
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters; see docs.
#TODO To see the results of an example test run with a full size dataset refer to
For more details about the output files and reports, please refer to the output documentation.
exscan.nf
was originally written by Joan LLuis Pons Ramon at the
Station Biologique de Roscoff.
We thank the following people for their extensive assistance in the development of this pipeline:
- #TODO
This work was supported by the HORIZON–MSCA-2022-DN program of the European Commission under the Grant Agreement No 101120280.
Still to be decided.
- Wei Shen, Botond Sipos, and Liuyang Zhao, “SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing,” Imeta 3, no. 3 (June 2024): e191, https://doi.org/10.1002/imt2.191.
- Sean R. Eddy, “Accelerated Profile HMM Searches,” ed. William R. Pearson, Plos Computational Biology 7, no. 10 (October 2011): e1002195, https://doi.org/10.1371/journal.pcbi.1002195.
- Peter J. A. Cock et al., “Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics,” Bioinformatics 25, no. 11 (June 2009): 1422–23, https://doi.org/10.1093/bioinformatics/btp163.
- Aaron R. Quinlan and Ira M. Hall, “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features,” Bioinformatics 26, no. 6 (March 2010): 841–42, https://doi.org/10.1093/bioinformatics/btq033.