We are happy to announce anvi'o v8
with the code name, "marie"!
After about 4,200 changes that introduced over 36,000 new lines of code, this stable release of anvi'o represents significant advancements over v7
, and introduces many new features for integrated studies of microbial metabolism, genomic inversions, phylogeography of proteins, performance improvements, and fixes for known bugs.
This page intends to give you a summary of some of the notable changes that come with marie.
The code name recognizes Marie Tharp, an American geologist and oceanographic cartographer, who has made immense contributions to earth sciences. Marie was a pioneer in our understanding of oceans as she created the first map of the Atlantic seafloor with her colleague Bruce Heezen [1]. Her work showed that the bottom of our oceans were not only flat sediments but were also covered with canyons, ridges, and mountain ranges that spanned over 65,000 kilometers around the globe. Marie's revolutionary work emerged from her interpretation of data she was not allowed to collect since women were not allowed to be on ships during the 1950s. Marie compiled her physiographic diagrams from the data Bruce Heezen were able to collect [2]. She did not step on a ship until 1968, and the early evidence she had for seafloor features was initially dismissed as 'girl talk' [3].
[1] https://en.wikipedia.org/wiki/Marie_Tharp
[2] https://www.lyellcollection.org/doi/abs/10.1144/GSL.SP.2002.192.01.11
[3] https://www.youtube.com/watch?v=gsQGOJtwdv0
The code name was a suggestion by Zena Cardman, a Marine Microbiologist and a NASA Astronaut. The release notes were written by Meren, Iva Veseli, and Matt Schechter, who are among the developers of anvi'o. The notes were proofread by Katy Lambert-Slosarska, who is a MSc student at the International Max Planck Research School of Marine Microbiology (MarMic).
New anvi'o programs, artifacts, and workflows
The new version of anvi'o comes with a few new programs:
- anvi-compute-functional-enrichment-across-genomes
- anvi-compute-functional-enrichment-in-pan
- anvi-compute-metabolic-enrichment
- anvi-delete-functions
- anvi-display-functions
- anvi-get-codon-usage-bias
- anvi-get-metabolic-model-file
- anvi-get-pn-ps-ratio
- anvi-get-tlen-dist-from-bam
- anvi-merge-trnaseq
- anvi-plot-trnaseq
- anvi-profile-blitz
- anvi-reaction-network
- anvi-report-inversions
- anvi-run-cazymes
- anvi-search-palindromes
- anvi-search-primers
- anvi-search-sequence-motifs
- anvi-setup-cazymes
- anvi-setup-kegg-data
- anvi-setup-modelseed-database
- anvi-setup-user-modules
- anvi-summarize-blitz
- anvi-tabulate-trnaseq
- anvi-script-as-markdown
- anvi-script-compute-bayesian-pan-core
- anvi-script-estimate-metabolic-independence
- anvi-script-filter-hmm-hits-table
- anvi-script-gen-function-matrix-across-genomes
- anvi-script-gen-functions-per-group-stats-output
- anvi-script-gen-genomes-file
- anvi-script-gen-user-module-file
- anvi-script-permute-trnaseq-seeds
And a few new artifacts:
- bam-stats-txt
- bams-and-profiles-txt
- cazyme-data
- contig-inspection
- dna-sequence
- enzymes-list-for-module
- enzymes-txt
- external-structures
- functions-across-genomes-txt
- gene-cluster-inspection
- hmm-hits-across-genomes-txt
- hmm-list
- inversions-txt
- markdown-txt
- metabolic-independence-score
- modifications-txt
- paired-end-fastq
- palindromes-txt
- primers-txt
- quick-summary
- reaction-network
- reaction-network-json
- reaction-ref-data
- seeds-non-specific-txt
- seeds-specific-txt
- trnaseq-contigs-db
- trnaseq-plot
- trnaseq-profile-db
- trnaseq-seed-txt
- user-metabolism
- user-modules-data
- variability-profile-xml
In addition, this release makes available three new Snakemake workflows that are accessible via the anvi'o program anvi-run-workflow: trnaseq, ecophylo, and sra_download.
A new subsystem for metabolic modeling
One of the biggest news in this release is the set of programs now anvi'o includes for metabolic modeling. These programs are emerging as a by-product of collaborative projects in C-CoMP, or the Center for Chemical Currencies of a Microbial Planet, and under the leadership of Samuel Miller.
Using the integrated anvi'o metabolic modeling subsystem, one can generate a biochemical reaction network suitable for metabolic modeling from the annotations in a genome or a pangenome using the new program anvi-reaction-network
. This works on both individual genomes (using a contigs-db and pangenomes (using a genomes-storage-db). The resulting network is stored in corresponding anvi'o database for programmatic access, and can be exported into a JSON file for inspection and downstream usage (i.e., as input into a program for flux-balance analysis) via another new program, anvi-get-metabolic-model-file
.
These programs rely on KEGG Orthology (KO) annotations of protein-coding genes and reference data in the ModelSEED Biochemistry database, which can be downloaded and set up on your computer using the programs anvi-setup-kegg-data and anvi-setup-modelseed-database, respectively.
For additional information, please see PRs #2058, #2072, and #2123.
Substantial improvements to metabolic pathway prediction in anvi'o
Anvi'o metabolism offers a full suite of integrated tools to study metabolism in microbial genomes and metagenomes, and multiple recent papers from our group (i.e., by Watson et al and Veseli et al) propelled a series of improvements thanks work from Iva Veseli. We hope these improvements summarized below will also help anvi'o users at large.
Improved data download and processing
Now multiple aspects of anvi'o rely on data from KEGG, so we decided to revamp how we download it. The old program anvi-setup-kegg-kofams
has been changed to a new program, anvi-setup-kegg-data
. This program has multiple modes for downloading KOfam profiles, KEGG MODULE data, KEGG BRITE hierarchies (PR #1910), and modeling data for anvi-reaction-network
. It can be multi-threaded for faster downloads.
However, for most users we recommend the default usage of this program, which downloads a pre-processed snapshot of everything you need for downstream programs working on this data. Please see anvi-setup-kegg-data.
Improvements in pathway prediction
The metabolism framework in anvi'o has undergone a lot of changes in the past year, with the addition of several notable features mainly concerning the use of the program anvi-estimate-metabolism:
"Stepwise" metrics: a new strategy for interpreting metabolic pathway definitions
As of PR #1927, we've added a new way of interpreting metabolic modules which affects how metrics like completeness and copy number are calculated. This strategy is called 'stepwise' interpretation because it considers only the major, non-redundant steps in a metabolic module. In this method, alternative enzymes, or in some cases alternative series of enzymes, are evaluated as one entity. Stepwise metrics may be appropriate for those interested in summarizing generic metabolic capacity with less focus on the specific enzymes that are required.
The former method of interpreting pathways is now referred to as the 'pathwise' strategy because it involves deconstructing the module definition into all possible unique combinations of enzymes required to catalyze the reactions in the metabolic pathway (so it considers all possible 'paths' through the module). Metrics are still calculated using this strategy and are labeled with the term 'pathwise' to distinguish them from the stepwise metrics.
You can find a description of these two strategies, along with examples, here.
Calculation of pathway copy number
This release also introduces a redundancy metric for metagenome-wide analyses - pathway copy number. This metric can be added to your output files using the --add-copy-number
flag, and will be calculated using both the pathwise and the stepwise strategies. This metric may be most appropriate when your input data represents a multitude of organisms (as in when you input a metagenome without using the --metagenome-mode
flag).
In our documentation, you can find an explanation of the pathwise copy number calculation and the stepwise copy number calculation. This feature was added in PR #1927.
User-defined metabolic modules
anvi-estimate-metabolism now has the ability to work with user-defined metabolic pathways based on arbitrary functional annotation sources as of PR #1867.
Users wishing to define their own metabolic modules can use the new anvi'o artifact, user-modules-data. The files can either be written manually or generated via the script anvi-script-gen-user-module-file (See PR #1872). The program anvi-setup-user-modules can then convert these module files into a database that can be used with anvi-estimate-metabolism via the --user-modules
flag, as described here.
To support the use of arbitrary HMMs as an annotation source for user-defined metabolic modules, the program anvi-run-hmms now has a flag called --add-to-functions-table
, which causes any HMM hits to be stored as functional annotations. See here for details.
Miscellaneous updates
Beyond the major features described above, there are a few miscellaneous changes to the metabolism codebase.
- You no longer have to rely on having contigs databases as input to anvi-estimate-metabolism. Thanks to help from Antonio Fernandez-Guerra, this program now can accept a simple list of enzymes as input. See PR #1890 as well as this help section.
- The output options and formats for anvi-estimate-metabolism are different. See this page for details. One new output feature that may particularly help with interpretation of these data is the addition of columns related to enzymes that are unique to a given metabolic module. These are described in PR #1867.
A new anvi'o workflow to study phylogeography of any gene family
Exploring the ecology and evolution of microbes across environments with metagenomic data is a common task for microbiologists. What if we applied this framework to gene families? The availability of large metagenomic datasets and fast computational biology toolsets provide us a unique opportunity to explore the limits of gene diversity! To leverage this, Matthew Schechter led the development of the ecophylo workflow, which can simultaneously profile ecological and phylogenetic relationships between gene families and environments.
The final output of the ecophylo workflow is an interactive interface that includes (1) a phylogenetic analysis of all genes detected by the HMM in genomes and/or metagenomes, and (2) the distribution pattern of each of these genes across metagenomes if the user provided metagenomic short reads to survey.
For more details please see the ecophylo documentation.
A new anvi'o framework to identify genomic inversions and quantify their activity
Genetic variants can rapidly proliferate even in populations taht grow from a single cell. One class of such variants emerge from 'inversions', a genetic phenomenon through which a microorganism can mediate the ON/OFF orientation of a promoter region regulating the expression of a downstream gene. Using paired-end short reads and quantifying their orientation upon mapping to a genomic context, one can identify and quantify inversions and their activities.
Thanks to Florian Trigodet's efforts, this version of anvi'o comes with a new program, anvi-report-inversions to study inversions in genomes and metagenomes across environments and to quantify the relative proportion of each inversion orientation in each sample.
The anvi-report-inversions workflow will (1) find genomic regions of interest (based on short-read recruitment data), (2) find palindromic motifs in regions of interest (where the pair of inverted repeats (IR) that surround the inversion site is found, (3) confirm the inversion (by going back to the BAM file and make sure the IR is the true one among multiple potential IRs that may occur in the region of interest), (4) compute the inversion activity (using the raw R1/R2 sequences from FASTQ files find support for activity, and (5) generate extensive reporting (including the genomic context, and genes that surround the inversion site). These reports will include a lot of information in text file outputs (see inversion-txt for details), as well as a static HTML output that does not require an anvi'o installation to browse.
A new suite of programs to analyze Transfer RNA transcripts
Anvi'o now includes a comprehensive (yet very experimental) software framework to support the analysis of tRNA transcript sequencing (as demonstrated here). The 'tRex Tools', as Samuel Miller calls them, include new programs for the identification of tRNA sequences and their modification sites in tRNA-seq results. The primary output of tRex tools in anvi'o is a set of tRNA seeds, each of which represents a mature tRNA sequence (minus the 3’-CCA acceptor) from the input set of samples. These capabilities are implemented in a set of programs that can be run individually or as part of the tRNAseq workflow:
- The
anvi-trnaseq
program predicts tRNA sequences, structures, and modifications from a single tRNA-seq library - The
anvi-merge-trnaseq
program combines the results across multiple tRNA-seq libraries and computes a final set of tRNA seeds (as well as their coverage across samples) - To analyze the taxonomy associated with tRNA sequences, there are two programs to be run in sequence:
anvi-run-trna-taxonomy
andanvi-estimate-trna-taxonomy
- Finally, the program
anvi-tabulate-trnaseq
exports the tRNA-associated coverage and modification data as tab-delimited files
You can also generate nice plots of the tRNA seed coverages and modification sites with anvi-plot-trnaseq
.
A new variant of the contigs database -- the trnaseq-contigs-db, which stores tRNA seeds instead of contigs -- and a new variant of the profile database -- the trnaseq-profile-db, which stores modification positions and both specific and non-specific coverage of tRNA seeds -- makes the integration of these new data types possible.
This is quite an experimental workflow, and if you plan to use it, please get in touch with us.
Please follow the latest installation instructions at https://anvio.org/install/ and come to the anvi'o Discord channel if you have any qeustions or concerns, or to simply join our community.