Deep learning for Metagenome Assembly Error Detection (DeepMAsED)
"mased"
Middle English term: misled, bewildered, amazed, or perplexed
The tool is divided into two main parts:
- DeepMAsED-SM
- A snakemake pipeline for generating DeepMAsED train/test datasets from reference genomes
- DeepMAsED-DL
- A python package for misassembly detection via deep learning
The simplest approach is to use the conda recipe:
conda create -n deepmased bioconda::deepmased
- [If needed] Install miniconda (or anaconda)
- See the
conda create
line in the .travis.yml file. - If just using DeepMAsED-SM:
conda create -n snakemake conda-forge::pandas bioconda::snakemake
pytest -s
- Via
setup.py
python setup.py install
- Via
pip
pip install DeepMAsED
You need to have the following input:
- fasta of metagenome assembly contigs (uncompressed)
- BAM file of metagenome reads mapped to the contigs
If multiple sets of contigs (eg., MAGs) and BAM files, then which contigs go with which BAM files?
Create a tab-delim table of: bam<tab>fasta
(header required)
This will be your bam_fasta_table
, which is need for creating the features.
DeepMAsED features $bam_fasta_table
This generates >=1 feature table and a table listing all output files
(the "feature_file_table"). This feature_file_table will be the input
for predict
DeepMAsED predict $feature_file_table
...where feature_filt_table
is the path to a table that lists
all feature files (see above).
--force-ovewrite
forces the re-creation of the pkl files, which is a bit slower
but can prevent issues.
Change --save-path
to set the output directory.
Use --cpu-only
to just use CPUs instead of a GPU.
By default, the predictions will be written to deepmased_predictions.tsv
.
Collection Contig Deepmased_score
0 NODE_1156_length_5232_cov_4.046938 0.0007264018
0 NODE_1563_length_3868_cov_5.851298 0.03783685
0 NODE_4288_length_1225_cov_3.235897 0.070887744
1 k141_9081 8.8751316e-05
1 k141_2594 6.720424e-05
1 k141_4878 0.0015754104
2 NODE_5204_length_1290_cov_3.283401 0.00036007166
2 NODE_2848_length_2164_cov_2.982456 0.0005029738
2 NODE_446_length_6027_cov_5.812291 0.068261534
See Mineeva et al., 2020 to help decide what score cutoff is prudent for classifying misassembled contigs.
This is useful for training DeepMAsED-DL
with a custom
train/test dataset (e.g., just biome-specific taxa).
- A table listing refernce genomes. Two possible formats:
- Genome-accession:
<Taxon>\t<Accession>
- "Taxon" = the species/strain name
- "Accession" = the NCBI genbank genome accession
- The genomes will be downloaded based on the accession
- Genome-fasta:
<Taxon>\t<Fasta>
- "Taxon" = the species/strain name of the genome
- "Fasta" = the fasta of the genome sequence
- Use this option if you already have the genome fasta files (uncompressed or gzip'ed)
- Genome-accession:
- The snakemake config file (e.g.,
config.yaml
). This includes:- Config params on MG communities
- Config params on assemblers & parameters
The column order for the tables doesn't matter, but the column names must be exact.
See the "Setup" section above for snakemake installation instructions.
cd ./DeepMAsED-SM/
Edit the config.yaml file as needed (eg., changing input & output paths)
snakemake --use-conda -j <NUMBER_OF_THREADS> --configfile <MY_CONFIG.yaml_FILE>
./snakemake_sge.sh <MY_CONFIG.yaml_FILE> cluster.json <PATH_FOR_SGE_LOGS> <NUMBER_OF_PARALLEL_JOBS> [additional snakemake options]
It should be rather easy to update the code to run on other cluster architectures. See the following resources for help:
The output will the be same as for feature generation, but with extra directories:
./output/genomes/
- Reference genomes
./output/MGSIM/
- Simulated metagenomes
./output/assembly/
- Metagenome assemblies
./output/true_errors/
- Metagenome assembly errors determined by using the references
./output/map/
- Feature tables for each simulation
Main interface: DeepMAsED -h
DeepMAsED [train|predict]
can be run without GPUs, but the will be substantially slower.
See DeepMAsED predict -h
See DeepMAsED train -h
See DeepMAsED evalulate -h
See DeepMAsED features -h
- Basic info
- assembler
- metagenome assembler used
- contig
- contig ID
- position
- position on the contig (bp)
- ref_base
- nucleotide at that position on the contig
- assembler
- Extracted from the bam file
- num_query_A
- number of reads mapping to that position with 'A'
- num_query_C
- number of reads mapping to that position with 'C'
- num_query_G
- number of reads mapping to that position with 'G'
- num_query_T
- number of reads mapping to that position with 'T'
- num_SNPs
- number of SNPs at that position
- coverage
- number of reads mapping to that position
- num_discordant
- discordant reads according to the read mapper definition
- num_supplementary
- number of reads mapping to that position where the alignment is supplementary
- see the samtools docs for more info
- num_secondary
- number of reads mapping to that position where the alignment is secondary
- see the samtools docs for more info
- num_query_A
- MetaQUAST info
- Extensive_misassembly
- the "extensive misassembly" classification set by MetaQUAST
- Extensive_misassembly