Skip to content

Deep learning for Metagenome Assembly Error Detection

License

Notifications You must be signed in to change notification settings

leylabmpi/DeepMAsED

Repository files navigation

DeepMAsED

DeepMAsED

Deep learning for Metagenome Assembly Error Detection (DeepMAsED)

"mased"

Middle English term: misled, bewildered, amazed, or perplexed

Citation

Mineeva, Olga, Mateo Rojas-Carulla, Ruth E. Ley, Bernhard Schölkopf, and Nicholas D. Youngblut. 2020. "DeepMAsED: Evaluating the Quality of Metagenomic Assemblies." Bioinformatics , February.

Main Description

The tool is divided into two main parts:

  • DeepMAsED-SM
    • A snakemake pipeline for generating DeepMAsED train/test datasets from reference genomes
  • DeepMAsED-DL
    • A python package for misassembly detection via deep learning

Setup

Via the conda recipe

The simplest approach is to use the conda recipe:

conda create -n deepmased bioconda::deepmased

[alternative] The piecemeal setup

Dependency setup via conda

  • [If needed] Install miniconda (or anaconda)
  • See the conda create line in the .travis.yml file.
  • If just using DeepMAsED-SM:
    • conda create -n snakemake conda-forge::pandas bioconda::snakemake

Testing the DeepMAsED package (optional)

pytest -s

Installing the DeepMAsED package into the conda environment

  • Via setup.py
    • python setup.py install
  • Via pip
    • pip install DeepMAsED

Usage

Example of classifying contig misassemblies

You need to have the following input:

  • fasta of metagenome assembly contigs (uncompressed)
  • BAM file of metagenome reads mapped to the contigs

Create table mapping BAM & fasta files

If multiple sets of contigs (eg., MAGs) and BAM files, then which contigs go with which BAM files?

Create a tab-delim table of: bam<tab>fasta (header required)

This will be your bam_fasta_table, which is need for creating the features.

Create feature table(s)

DeepMAsED features $bam_fasta_table

This generates >=1 feature table and a table listing all output files (the "feature_file_table"). This feature_file_table will be the input for predict

Predict misassemblies

DeepMAsED predict $feature_file_table

...where feature_filt_table is the path to a table that lists all feature files (see above).

--force-ovewrite forces the re-creation of the pkl files, which is a bit slower but can prevent issues.

Change --save-path to set the output directory. Use --cpu-only to just use CPUs instead of a GPU.

Third, inspect the output

By default, the predictions will be written to deepmased_predictions.tsv.

Example output
Collection     Contig  Deepmased_score
0       NODE_1156_length_5232_cov_4.046938      0.0007264018
0       NODE_1563_length_3868_cov_5.851298      0.03783685
0       NODE_4288_length_1225_cov_3.235897      0.070887744
1       k141_9081       8.8751316e-05
1       k141_2594       6.720424e-05
1       k141_4878       0.0015754104
2       NODE_5204_length_1290_cov_3.283401      0.00036007166
2       NODE_2848_length_2164_cov_2.982456      0.0005029738
2       NODE_446_length_6027_cov_5.812291       0.068261534

See Mineeva et al., 2020 to help decide what score cutoff is prudent for classifying misassembled contigs.

Creating training datasets with DeepMAsED-SM

This is useful for training DeepMAsED-DL with a custom train/test dataset (e.g., just biome-specific taxa).

Input

  • A table listing refernce genomes. Two possible formats:
    • Genome-accession: <Taxon>\t<Accession>
      • "Taxon" = the species/strain name
      • "Accession" = the NCBI genbank genome accession
      • The genomes will be downloaded based on the accession
    • Genome-fasta: <Taxon>\t<Fasta>
      • "Taxon" = the species/strain name of the genome
      • "Fasta" = the fasta of the genome sequence
      • Use this option if you already have the genome fasta files (uncompressed or gzip'ed)
  • The snakemake config file (e.g., config.yaml). This includes:
    • Config params on MG communities
    • Config params on assemblers & parameters

The column order for the tables doesn't matter, but the column names must be exact.

Running locally

See the "Setup" section above for snakemake installation instructions.

cd ./DeepMAsED-SM/

Edit the config.yaml file as needed (eg., changing input & output paths)

snakemake --use-conda -j <NUMBER_OF_THREADS> --configfile <MY_CONFIG.yaml_FILE>

Running on SGE cluster

./snakemake_sge.sh <MY_CONFIG.yaml_FILE> cluster.json <PATH_FOR_SGE_LOGS> <NUMBER_OF_PARALLEL_JOBS> [additional snakemake options]

It should be rather easy to update the code to run on other cluster architectures. See the following resources for help:

Output

The output will the be same as for feature generation, but with extra directories:

  • ./output/genomes/
    • Reference genomes
  • ./output/MGSIM/
    • Simulated metagenomes
  • ./output/assembly/
    • Metagenome assemblies
  • ./output/true_errors/
    • Metagenome assembly errors determined by using the references
  • ./output/map/
    • Feature tables for each simulation

DeepMAsED-DL

Main interface: DeepMAsED -h

DeepMAsED [train|predict] can be run without GPUs, but the will be substantially slower.

Predicting with existing model

See DeepMAsED predict -h

Training a new model

See DeepMAsED train -h

Evaluating a model

See DeepMAsED evalulate -h

Creating features for predict

See DeepMAsED features -h

Features table

  • Basic info
    • assembler
      • metagenome assembler used
    • contig
      • contig ID
    • position
      • position on the contig (bp)
    • ref_base
      • nucleotide at that position on the contig
  • Extracted from the bam file
    • num_query_A
      • number of reads mapping to that position with 'A'
    • num_query_C
      • number of reads mapping to that position with 'C'
    • num_query_G
      • number of reads mapping to that position with 'G'
    • num_query_T
      • number of reads mapping to that position with 'T'
    • num_SNPs
      • number of SNPs at that position
    • coverage
      • number of reads mapping to that position
    • num_discordant
      • discordant reads according to the read mapper definition
    • num_supplementary
      • number of reads mapping to that position where the alignment is supplementary
      • see the samtools docs for more info
    • num_secondary
      • number of reads mapping to that position where the alignment is secondary
      • see the samtools docs for more info
  • MetaQUAST info
    • Extensive_misassembly
      • the "extensive misassembly" classification set by MetaQUAST