Benchmarks of several metagenome community profilers

This repository contains Snakemake workflows for benchmarking tools which report microbial taxonomies and their relative abundance in microbial communities, using metagenome sequencing as input. It focuses particularly on comparing the SingleM microbial profiler to others, but can be adapted to new profilers so long as they can output GTDB R207-based taxonomy profiles.

The benchmarks are:

1_novel_strains/ (i.e. 'known species benchmark') - benchmark profilers using communities simulated from genomes which have been assigned taxonomies at the species level in GTDB (the genomes chosen are not representative genomes, however).
2_phylogenetic_novelty/ - benchmark profilers on community profiles made up of a novel lineage and a known species, at equal abundance. This benchmark tests the ability of profilers to detect and classify new lineages.
3_cami2_marine - benchmark profilers on CAMI2 marine datasets, after converting the taxonomy to GTDB R207-based taxonomy.
4_complex_and_novel - benchmark profilers on a complex community (defined by the CAMI2 marine coverages), where 0-100% of the community is new in GTDB R214 compared to R207.

To get this repository, git clone with recursive option to get the submodules:

git clone --recursive https://github.com/wwood/singlem-benchmarking

To run a benchmark, first create a conda env

cd singlem-benchmarking
mamba env create -n singlem-benchmarking -f env.yml

Then activate it

conda activate singlem-benchmarking

First, download the reference databases for each tool

snakemake --snakefile gather_tool_databases.smk --use-conda -c 8

The Metabuli R207 database is downloaded separately. Download the tar.gz file from https://connectqutedu.sharepoint.com/:u:/s/metabuli_gtdb_207/EYk7N71mp-NAtET5_X_fBDABM6AC_DCbxGiDc2rdVVlNiw?e=Ra5rVZ and put it into a new folder tool_reference_data/metabuli. Then extract it with

tar -xvf metabuli.tar.gz

Then run the benchmarking, for instance #1

Then run a benchmarking, for instance #1

```bash
cd 1_novel_strains
./run_benchmark.sh

Results can be viewed by rerunning the plot.ipynb in each benchmark directory, and then the plot_overall.ipynb notebook in the base directory.

Download genomes for benchmark #2

Using NCBI datasets CLI (on conda as ncbi-datasets-cli=14.29.0)

cd 2_phylogenetic_novelty
cd genomes
datasets download genome accession --inputfile ../genome_accessions.txt
unzip ncbi_dataset.zip

# Rename files to simple names (e.g. GCA_000508305.1_genomic.fna)
parallel --col-sep "\t" cp {1} {2} :::: ../genome_ncbi_names.tsv

cd ../genome_pairs
datasets download genome accession --inputfile ../genome_pairs_accessions.txt
unzip ncbi_dataset.zip

# Rename files to simple names (e.g. GCA_000508305.1_genomic.fna)
parallel --col-sep "\t" cp {1} {2} :::: ../genome_pairs_ncbi_names.tsv

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
1_novel_strains		1_novel_strains
2_phylogenetic_novelty		2_phylogenetic_novelty
3_cami2_marine		3_cami2_marine
4_complex_and_novel		4_complex_and_novel
bin		bin
singlem @ c1de290		singlem @ c1de290
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
ar53_metadata_r207.tsv		ar53_metadata_r207.tsv
ar53_taxonomy_r207.tsv		ar53_taxonomy_r207.tsv
bac120_taxonomy_r207.tsv		bac120_taxonomy_r207.tsv
env.yml		env.yml
gather_tool_databases.smk		gather_tool_databases.smk
plot_overall.ipynb		plot_overall.ipynb
tool_reference_data.py		tool_reference_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarks of several metagenome community profilers

Download genomes for benchmark #2

About

Releases

Packages

Languages

thepatientwait/singlem-benchmarking

Folders and files

Latest commit

History

Repository files navigation

Benchmarks of several metagenome community profilers

Download genomes for benchmark #2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages