Skip to content

The MetaGeneMark-2 source code, data, and experiments to reproduce published results.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Experiments for MetaGeneMark-2

Georgia Institute of Technology, Atlanta, Georgia, USA

Reference: PAPER LINK


This repository contains the data and source code needed to reproduce all results found in the MetaGeneMark-2 paper.

Program Versions

MetaGeneMarKS is a standalone tool, but building the initial set of models relies on GeneMarkS-2 predictions. Similarly, results are compared to multiple external tools, whose versions are shown here:

  • GeneMarkS-2:
  • (Meta)Prodigal:
  • MetaGeneAnnotator:
  • FragGeneScan:
  • MetaGeneMark:

MetaGeneMark-2 is a C++ program. That said, experiments and results are all executed and analyzed in python. To get all the packages (for reproducibility), it is recommended that the user creates a conda environment from the file in install/conda_mgm2.yml through the following command:

conda env create -f install/conda_mgm2.yml --name mgms

This can then be activate via

conda activate mgms

See info/reproduce.[html|pdf] for more information.

Installing MetaGeneMark-2 locally

Running MetaGeneMark-2 using automatic genetic code detection is done through the script found in $code/hmm_src. The below compiles the C++ binary and copies all the relevant components to $bin_external/mgm2_auto.

 cd code/hmm_src;
 pf_makefile=Makefile.macos    # NOTE: change based on operating system
 make -f $pf_makefile

This generates a binary =gmhmmp2=.

Running MetaGeneMark-2

Running MetaGeneMark-2 with automatic genetic code detection is done using The following files should be in the same directory:, gmhmmp2, mgm2_11.mod, mgm2_4.mod. MetaGeneMark-2 can then be run (from anywhere) using:

$path_to_binary/ --seq [name]  --out [name]

Required options:
     --seq  [name]            nucleotide sequence of metagenome in FASTA format.
     --out  [name]            output file with coordinates of predicted protein coding genes.

Output options:

       --nt  [name]           output file with nucleotide sequences of predicted genes in FASTA format.
       --aa  [name]           output file with protein sequences of predicted genes in FASTA format.
       --format  [gtf]        format of output file with gene coordinates: gtf or gff3.
       --clean                delete temporay files

Other parameters:

Reproducing Results

We provide a document detailing how to reproduce all results. This can be found at info/reproduce.[html|pdf]

Folder structure

The following directory structure should be maintained while running experiments

├── bin                                   # Executables constructed from python/bash drivers (via
├── bin_external                          # External tools
├── config                                # Configuration files, e.g. MetaGeneMark-2 learning parameters
├──                             # Load bash variables for paths to different directories
├── install                               # Conda environment file for easy installation
├── lists                                 # Lists of genomes (main input method to scripts)
├── info                                  # Information about reproducing results
├── metadata                              # Non-genomic data, including taxonomy information
├── data                                  # Data Location: where all raw data will be stored during runs
│   ├── GCFID 1                           # ID of genome 1
│   │   ├── ncbi.gff                      # RefSeq annotation
│   │   ├── sequence.fasta                # Genomic sequence file
│   ├── GCFID 2                           # ID of genome 2
│   │   ├── ncbi.gff                      # RefSeq annotation
│   │   ├── sequence.fasta                # Genomic sequence file
│   │   ...
├── code                                  # Source code
│   ├── python                            # Python code
│   │   ├── driver                        # Drivers that can be executed
│   │   ├── lib                           # Library files
|   |── mgms                              # MetaGeneMarkS (C++) source code and Makefile
│   ├── bash                              # Bash scripts
│   │   ├── driver                        # Drivers that can be executed
│   │   ├── lib                           # Library files
├── runs                                  # Data Location: where all raw data will be stored during runs
│   ├── GCFID 1                           # ID of genome 1
│   │   ├── startlink                     # StartLink runs
│   │   ├── mgms                          # MGMS runs
|   |   ├── others...                     # Other tools
│   ├── GCFID 2                           # ID of genome 2
│   │   ├── startlink                     # StartLink runs
│   │   ├── mgms                          # MGMS runs
|   |   ├── others...                     # Other tools
│   │   ...


The MetaGeneMark-2 source code, data, and experiments to reproduce published results.







No releases published


No packages published