Skip to content

treangenlab/emu

Repository files navigation

Emu: species-level taxonomic abundance for full-length 16S reads

Description

Emu is a relative abundance estimator for 16S genomic sequences. The method is optimized for error-prone full-length reads, but can also be utilized for short-read data.

Demo

Calculate relative abundances for Oxford Nanopore Technologies single-end 16S reads:

emu abundance example/full_length.fa

Calculate relative abundances for short paired-end 16S data:

emu abundance --type sr example/short_read_f.fq example/short_read_r.fq

Calculate relative abundances for short single-end 16S data:

emu abundance --type sr example/short_read_f.fq

Expected output: each of the above commands is expected to create a relative abundance .tsv file in a ./results folder. Output names are: full_length_rel-abundance.tsv, short_read_f-short_read_r_rel-abundance.tsv, short_read_f_rel-abundance.tsv.

Expected run time: each of the above commands is expected to complete successfully in a couple seconds.

Installation

1. Download database

Define <path_to_database> as your desired database directory. If you desire a different database, skip this step and follow steps below in Build Custom Database. Our databases are stored on OSF. To download from the command line, you will first need to install osfclient. Alternatively, you can download directly from the website.

pip install osfclient
export EMU_DATABASE_DIR=<path_to_database>
cd ${EMU_DATABASE_DIR}
osf -p 56uf7 fetch osfstorage/emu-prebuilt/emu.tar
tar -xvf emu.tar

** Note Emu v3.0+ database requirements differ from previous versions. Check you are using the appropriate database for the version you are running. Both databases contain identical information: a combination of rrnDB v5.6 and NCBI 16S RefSeq from 17 September, 2020. Taxonomy is also from NCBI on the same date. The resulting database contains 49,301 sequences from 17,555 unique bacterial and archaeal species.

2. Activate appropriate conda environment

Emu requires Python version to be >=3.6.

Option A: Create new Conda environment
conda create --name py37 python=3.7 
conda activate py37
Option B: Set Python version in current environment
conda install python=3.7
3. Install Emu
Option A: Install Emu via conda

Add the bioconda channel and install Emu.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install emu

Although not required, we recommend using minimap2 version >=2.22 to avoid memory leak bug with highly repetitive sequences.

Option B: Create local Emu conda environment

If you are unable to install Emu via conda as described above, an alternative approach is to install conda and create an environment that supports Emu. The default name of the environment created is emu, but this can be configured in the environment.yml file if desired. The environment will need to be activated before Emu can be run.

conda env create -f environment.yml
conda activate custom-emu

NOTE: with this installation method, all commands will need to be run with <path to emu> instead of emu. For example, in the directory containing the Emu script, a working command would be:

./emu abundance example/full_length.fa

Each step of the installation process is expected to take a matter of seconds.

Abundance Estimation Parameters

Command Default Description
--type map-ont denote sequencer [short-read:sr, Pac-Bio:map-pb, ONT:map-ont, Nanopore Q20:lr:hq, PacBio HiFi:map-hifi, traditional cDNA:splice:hq]
--min-abundance 0.0001 generates results with species relative abundance above this value in addition to full results; .01 = 1%
--db $EMU_DATABASE_DIR path to emu database; directory must include the following files: names_df.tsv, nodes_df.tsv, species_taxid.fasta, unqiue_taxids.tsv
--N 50 max number of alignments utilized for each read in minimap2
--K 500M minibatch size for mapping in minimap2
--mm2-forward-only FALSE force minimap2 to consider the forward transcript strand only (for long mRNA/cDNA reads)
--output-dir ./results directory for output results
--output-basename stem of input_file(s) basename of all output files saved in output-dir; default utilizes basename from input file(s)
--keep-files FALSE keep working files in output-dir ( alignments [.sam], reads of specied length [.fa])
--keep-counts FALSE include estimated read counts for each species in output*
--keep-read-assignments FALSE output .tsv file with read assignment distributions: each row as an input read; each entry as the likelihood it is dervied from that taxa (taxid is the column header); each row sums to 1
--output-unclassified FALSE generate two additional sequence files of unmapped and unclassified mapped input reads**
--threads 3 number of threads utilized by minimap2

*Estimated read counts are based on likelihood probabilities and therefore may not be integer values. They are calculated as the product of estimated relative abundance and total classified reads.

**Here, "unmapped reads" are reads that did not result in a mapping to the provided database with minimap2. "Unclassified mapped reads" are those that mapped only to database sequences of species that are presumed to not be present in the sample by Emu's algorithm (likely due to low overall abundance).

Note: If you are experiencing heavy RAM consumption, first upgrade minimap2 to at least v2.22. If memory is still an issue, try decreasing the number of secondary alignments evaluated for each read (--N).

Build Custom Database

An emu database consists of 2 files:

Filename Description
taxonomy.tsv tab separated datasheet of database taxonomy lineages containing at columns: 'tax_id' and any taxonomic ranks (i.e. species, genus, etc)
species_taxid.fasta database sequences where each sequence header starts with the respective species-level tax id (or lowest level above species-level if missing) preceeding a colon [<species_taxid>:<remainder of header>]

A custom database can be built from either NCBI taxonomy (names.dmp and nodes.dmp files) or a .tsv file containing full taxonomy lineages. If the database if build from NCBI taxonomy, the database fasta sequences will be classified at the species level (or the lowest taxonomic rank provided if already classified above the species level). If taxonomy is provided as a list of lineage, database fasta sequences will be classified at the rank of the tax_id provided in the seq2taxid.map file. Therefore, when using direct taxonomy to build your custom database, we recommend providing a seq2taxid.map at the species-level or higher.

The following files are required to build a custom database:

Command file(s) Description
--sequences database.fasta nucleotide sequences
--seq2tax seq2tax.map.tsv headerless two column tab-separated values, where each row contains (1) sequence header in database.fasta and (2) corresponding tax id
either taxonomy option:
--ncbi-taxonomy names.dmp & nodes.dmp directory containing both names.dmp and nodes.dmp files in NCBI taxonomy format and named accordingly
--taxonomy-list input taxonomy.tsv a .tsv file containing complete taxonomic lineages. The first column MUST be the taxonomy ids. Remaining columns can be in any format, then Emu abundance output will match this format
emu build-database <db_name> --sequences <database.fasta> --seq2tax <seq2taxid.map> --ncbi-taxonomy <dir-to-names/nodes.dmp>
OR
emu build-database <db_name> --sequences <database.fasta> --seq2tax <seq2taxid.map> --taxonomy-list <taxonomy.tsv>

Example:

emu build-database zymo_assembled_db --sequences ./example_customdb/ex.fasta --seq2tax ./example_customdb/ex_seq2tax.map --ncbi-taxonomy ./example_customdb/
OR
emu build-database zymo_assembled_db --sequences ./example_customdb/ex.fasta --seq2tax ./example_customdb/ex_seq2tax.map --taxonomy-list ./example_customdb/ex-taxonomy.tsv
emu abundance ./example_customdb/ex.fasta --db ./zymo_assembled_db --threads 3

If preferred, user can define custom database through the shell variable rather than specifying with each run at the command line.

export EMU_DATABASE_DIR=./zymo_assembled_db
emu abundance ./example_customdb/ex.fasta

Note for NCBI-taxonomy created database: If your taxonomy is missing species-level information, a “pseudo” species will be reported as “unclassified <genus>” where <genus> is the labeled genus in the taxonomic lineage. If genus-level classification is also missing in the lineage, this process will continue moving up the taxonomic lineage until a specified label (<taxa>) is detected. Then, "unclassified <taxa>" will be reported as the species classification instead.

Alternative Databases

Our pre-built databases and files\scripts used to construct the databases are stored on OSF. You can download from our OSF UI or from the command line using osfclient. If downloading from the command line, first define the EMU_PREBUILT_DB variable.

Database Command
RDP v11.5 with NCBI taxonomy has been pre-built for Emu v3.0+. export EMU_PREBUILT_DB='rdp'
SILVA v138.1 has been pre-built for Emu v3.0+ from the DADA2 SILVA species-level database. export EMU_PREBUILT_DB='silva'
UNITE general fasta v8.3 fungi has been pre-built for Emu v3.0+. This database has not yet been tested or validated with Emu. export EMU_PREBUILT_DB='unite-fungi'
UNITE general fasta v8.3 all eukaryotes has been pre-built for Emu v3.0+. This database has not yet been tested or validated with Emu. export EMU_PREBUILT_DB='unite-all'

Then run the following commands:

pip install osfclient
export EMU_DATABASE_DIR=<path_to_database>
cd ${EMU_DATABASE_DIR}
osf -p 56uf7 fetch osfstorage/emu-prebuilt/${EMU_PREBUILT_DB}.tar
tar -xvf ${EMU_PREBUILT_DB}.tar

Collapse Taxonomy

The collapse-taxonomy function can be used to generate an additional relative abundance profile from an emu output <.tsv> file, collapsed at the desired taxonomic rank. The input <.tsv> file must contain a unique column for each taxonomic rank (such as those constructed by the emu default database or the preconstructed Silva database with taxonomy_split, etc). The collapsed taxonomy file is output to the same folder as the input file, with filename:<input_file>-<rank>.tsv. Accepted ranks: ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom']

emu collapse-taxonomy <file_path> <rank>

Combine Outputs

The combine-outputs function can be used to create a single table containing all Emu output relative abundances in a single directory. Note this function will select all the .tsv files in the provided directory that contain 'rel-abundance' in the filename. Combined table will only include all ranks above the specified rank according to this list: [tax_id, species, genus, family, order, class, phylum, superkingdom]. Specified rank must be in this list and in each of the included Emu outputs. Combined table will be created in the provided directory path with the file name: emu-combined-<rank>.tsv. In order to include tax_id in your output, specific <rank> as "tax_id".

emu combine-outputs <directory_path> <rank>

Optional additional parameters:

Command Description
--split-tables output 2 tables: (1) abundances only at specified rank and (2) taxonomic lineages down to specified rank
--counts output estimated counts rather than relative abundance percentage in combined table. Only includes Emu relative abundance outputs that already have 'estimated counts'

System Requirements

All software dependencies are listed in environment.yml. Emu v3.0.0 has been tested on Python v3.7 and used to generate results in manuscript.

Emu Manuscript

Publication: Kristen D. Curry et al., “Emu: Species-Level Microbial Community Profiling of Full-Length 16S RRNA Oxford Nanopore Sequencing Data,” Nature Methods, June 30, 2022, 1–9, https://doi.org/10.1038/s41592-022-01520-4. Repository for reproduction of results in manuscript: Emu-benchmark

Database Citations

Please use citations below if any of the pre-contructed databases are utilized:

Emu default database
  • Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. & Schmidt, T. M. (2015) rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucl. Acids Res. 43, D593–D598.
  • O’Leary, N. A. et al. (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucl. Acids Res. 44, D733–D745.
  • Schoch, C. L. et al. (2020) NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford) 2020.
RDP
  • Cole, J. R. et al. (2014) Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucl. Acids Res. 42, D633–D642.
  • Schoch, C. L. et al. (2020) NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford) 2020.
SILVA
  • Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. 41 (D1): D590-D596.
  • Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO (2014) The SILVA and "All-species Living Tree Project (LTP)" taxonomic frameworks. Nucl. Acids Res. 42:D643-D648
  • Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13:581–583. doi:10.1038/nmeth.3869
UNITE
  • Nilsson, R. H. et al. (2019) The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Res 47, D259–D264.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages