MetEOr is an innovative pipeline for discovering enzymatic orthologs using metagenomic data. Developed as part of a Master's thesis, this project aims to streamline the identification and analysis of novel enzyme variants of biotechnological and ecological interest.
- Automated processing of metagenomic data
- Identification of enzymatic orthologs
- Network-based analysis of enzyme relationships
- Phylogenetic tree construction and annotation
- Integration with KEGG Orthology (KO) database
-
emap_env.yml
: Conda environment file for eggNOG-mapper dependencies. -
ete_diamond_env.yml
: Conda environment file for ETE Toolkit and DIAMOND dependencies. -
KOs_ids.txt
: Input file containing KEGG Orthology (KO) identifiers for analysis. -
main_script.py
: The primary Python script for the pipeline. -
bin/
: SLURM scripts for job submission in a cluster environment:KO2fasta.slurm
: Step 1 - Retrieve homologous sequencestreebuilder.slurm
: Step 2 - Build tree and identify candidate enzymesSSN.slurm
: Step 3 - Build Sequence Similarity Networks (SSN)emapper.slurm
: Step 4 - Functional annotation of sequencestreeannotator.slurm
: Step 5 - Tree annotation
Python scripts launched dependent on SLURM scripts:
build_network.py
: Builds a network for enzymatic ortholog analysis.build_network_structure.py
: Structures or extends the network built bybuild_network.py
.clean_diamond.py
: Cleans up data using the Diamond tool.filter_by_length.py
: Filters sequences based on length criteria.get_seqs_by_ko_hit.py
: Retrieves sequences based on hits to KEGG Orthology (KO).identify_clusters_net.py
: Identifies clusters within a network context.identify_clusters_structural.py
: Identifies structural clusters or patterns.midpoint_tree.py
: Constructs midpoint trees for phylogenetic analysis.plot_algorithm.py
: Generates plots related to algorithms or data analysis results.threshold_algorithm.py
: Implements algorithms with specified thresholds.trim_alignment.py
: Trims sequence alignments.
MetEOr is designed to be run in an HPC cluster. Detailed installation instructions will be provided as the project progresses. We plan to implement it in Nextflow or distribute it via a Singularity container. We give below some initial instructions.
Before you begin, ensure you have the following:
- Python 3.7+
- SLURM on an HPC (High-Performance Computing) cluster to run the program
- Miniconda (for managing environments)
If you don't have Miniconda installed, follow these steps:
- Download Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- Make the installer executable
chmod +x Miniconda3-latest-Linux-x86_64.sh
- Run the installer
./Miniconda3-latest-Linux-x86_64.sh
- Restart your terminal or run:
source ~/.bashrc
- Install mamba for faster environment solving
conda install mamba -n base -c conda-forge
git clone https://github.com/jacgonisa/MSc_thesis.git
cd MSc_thesis
- Create and activate the EggNOG-mapper environment
mamba env create -f emap_env.yml
- Create the ETE Toolkit and DIAMOND environment
mamba env create -f ete_diamond_env.yml
Further instructions on this will be shared.
Comprehensive usage instructions and examples will be added as the project develops. The pipeline will be designed for ease of use in both local and cluster environments.
To execute main_script.py
, use the following command in your terminal:
python main_script.py --min_identity 30 --min_seq_cov 30 --min_kegg_seq_cov 30 --database combined --annotation swissprot KO_list.txt
-
--min_identity
: Specifies the minimum identity value for sequence similarity. In this example, it is set to30
. This value determines how similar sequences must be to be retrieved. -
--min_gtdb_seq_cov
: Sets the minimum sequence coverage from GTDB (Genome Taxonomy Database). It is also set to30
in this example. Sequence coverage indicates how much of the target sequence is covered by the query sequence. -
--min_kegg_seq_cov
: Defines the minimum sequence coverage from KEGG (Kyoto Encyclopedia of Genes and Genomes). Similar to the other metrics, it is set to30
. -
--database
: Specifies which database to use for analysis. In this case,combined
is chosen, indicating a combination or integration of data from GTDB and metagenomic databases. Alternatively,gtdb
can be specified to use only GTDB data. -
--annotation
: Defines the annotation method to use during the analysis. Here,swissprot
is selected, indicating the use of annotations from the Swiss-Prot database. Alternatively,default
annotations may be applied and would only include eggNOG-mapper annotations.
The pipeline generates the following directory structure for each KO code (e.g., K21105):
K21105/
├── 01-KO2GTDB/
│ ├── K21105_matches_onlyIDs_combined.tsv
│ ├── K21105_matches_onlyIDs_gtdb.tsv
│ ├── K21105_matches_onlyIDs.tsv
│ ├── K21105_matches.tsv
│ └── KEGG_sim_cov.tsv
├── 02-GTDB2fasta/
│ ├── blastp_swissprot_filtered.txt
│ ├── blastp_swissprot.txt
│ ├── K21105_protein_length.tsv
│ ├── K21105_sequences.faa
│ ├── K21105_sequences_filtered.faa (MSA IS BUILT BASED ON THIS FILE!)
│ ├── K21105_sequences_filtered.faa.db.dmnd
│ ├── K21105_sequences.fna
│ ├── out.emapper.annotations
│ ├── out.emapper.decorated.gff
│ ├── out.emapper.hits
│ ├── out.emapper.pfam
│ └── out.emapper.seed_orthologs
├── 03-MSA/
│ ├── KO_code.faa.alg
│ └── KO_code_trimmed.faa.alg
├── 04-trees/
│ ├── annotation_alignment_taxa_SSN/
│ │ ├── K21105_rooted.faa.alg_annotated.ete (READY TO VISUALIZE!)
│ │ ├── K21105_rooted.faa.alg_annotated.nw
│ │ ├── K21105_rooted.faa.alg_annotated.tsv
│ │ └── K21105_rooted.faa.alg_prop2type.txt
│ ├── annotation_alignment_taxa_SSN_emapper/
│ │ ├── K21105_rooted.faa.alg_annotated_annotated.ete (READY TO VISUALIZE!)
│ │ ├── K21105_rooted.faa.alg_annotated_annotated.nw
│ │ ├── K21105_rooted.faa.alg_annotated_annotated.tsv
│ │ └── K21105_rooted.faa.alg_annotated_prop2type.txt
│ ├── K21105.faa.alg.nw
│ ├── K21105_rooted.faa.alg.nw
│ └── KEGG_sim_cov.tsv
├── 05-SSN/
│ ├── K21105_allvsall
│ ├── K21105_allvsall.clean
│ ├── K21105_allvsall.clean.net
│ ├── K21105_allvsall.clean.net.clusters.tmp
│ └── K21105_allvsall.clean.net.clusters.tsv (METADA WITH SSN CLUSTERING INFORMATION!)
└── threshold_analysis/
├── candidate_genes.tsv (IMPORTANT OUTPUT!)
├── dotplot.png
├── heatmap.png
├── K21105_rooted.faa.alg_leaf_names.txt
├── K21105_surviving_columns.txt
├── K21105_threshold_analysis.csv
├── scatterplot_candidate_genes.png
├── scatterplot.png
└── special_genes.tsv
We welcome contributions to MetEOr! Here's how you can help:
- Fork the repository
- Create a new branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License. See the LICENSE file for details.
For questions, suggestions, or collaborations, please contact: