Design Document

Welcome to Project 3: Microbiome Phylogenetic Tree Pipeline's Design Document

This project was compiled by Rita Mormando, Rohan Rajagopal, and Delaney Sauer as part of Dr. Wheeler's COMP 383/483 Computational Biology class.

All are part of the Bioinformatics Program at Loyola University Chicago.

Overview:

A technique that has been gaining popularity over the years for investigating microbial communities is metagenomics. Metagenomic analysis directly obtains genomic sequences from various environments and uses such data to mine through and discover novel organisms. Characterization of such novel species relies on comparisons of similar microbial communities. The main method in which these comparisons happen is by placing the newly identified species into the tree of life. To do this, phylogenetic marker genes are processed and used to build phylogenetic trees.

Phylogenetic trees are widely used for genetic and evolutionary studies in various organisms. Once a new genome is sequenced their taxonomic identities can be determined by inserting them into prebuilt species trees. Advanced sequencing technologies have dramatically increased the data available for constructing phylogenetic trees based on various data inputs. There are several pre-existing software platforms that allow for researchers to infer phylogenies, but most of these platforms require users to manually download genomic data, clean and align sequences, or visualize the tree themselves. These laborious and time-consuming steps might impede or even skew tree reconstruction, especially if one is to input all the data themselves and if there is a large number of species. Hence, there is a clear need for an efficient pipeline that can reduce the time required for phylogenetic tree-building processes with various input data.

This project aims to develop a bioinformatics pipeline to be used for microbiome data analysis. Here, we allow for input data to be a simple list of microbial species or entire microbial metagenomic reads and output a phylogenetic tree of the data with the branch length. This pipeline is user-friendly and is an efficient and powerful alternative to previously made software platforms.

Context:

Data analysis using phylogenetic trees has become an incredible tool for looking at taxonomic relationships between various species. For metagenomic data analysis, most of the outputs are a list of microbial taxa names from analyzing raw metagenomic sequences. However, for further analysis and downstream utilization, it's imperative that the phylogenetic relationships are available.

The problem that we’re trying to solve through this project is there is no efficient way to create a phylogenetic tree just from microbial species names or metagenomic reads. This can be done through a tedious procedure of pulling phylogenetic trees from another source (e.g. Tree of Life) or fully constructing a tree from scratch, but such processes require multiple steps and software. The proposed project here aims to remedy this gap through the development of a functional pipeline to streamline this process. With that, this problem is interesting and necessary to the Dong and Gao labs of Loyola University Chicago’s Stritch School of Medicine so that they may perform more downstream metagenomic analysis with such data.

In this project, we will specifically try to output a phylogenetic tree with only the microbial taxonomic names along with FASTA files of specified metagenomes as input. We will gather the FASTA files of either input and create a local database within the pipeline itself.

This particular problem has had many extensive attempts of being solved, and although the work proposed here has many similarities to those pre-existing efforts, it is useful in attempting to further piece together the world of phylogenetics.

Goals & Non-Goals:

Build a quality pipeline that prepares a phylogenetic tree from taxonomic names or metagenomic raw reads
Document the process with step-by-step instructions
Will only be using either taxonomic names or metagenomic FASTA files as input data sets to perform pipeline tasks
Will only be outputting phylogenetic trees (with the branch lengths) for the final product

Proposed Solution:

The goal is to construct a bioinformatics pipeline that can either extract the tree from other public resources (e.g., Tree of Life) using the microbial taxa names, or identify marker gene sequences from the metagenomic reads and build the phylogenetic tree from the scratch.

Useful Background:

We found a number of resources that gave us a general understanding of other applications that have attempted to answer this question.

Interactive Tree of Life (iTOL)
- https://itol.embl.de/
NCBI’s Genome Database
- https://www.ncbi.nlm.nih.gov/genome/
PhySpeTree
- Input data: concatenated highly conserved proteins and small subunit ribosomal RNA sequences
- Output: a reconstructed phlyogenentic species tree
- https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-019-1541-x
SNPhylo
- Input data: SNP datasets
- Output: a phlyogenentic tree
- https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-162
ezTree
- Input data: uncultivated species retrieved from environmental samples
- Output: build phylogenetic trees from the marker genes
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5780852/

Pipeline Breakdown:

We will be testing the code with a sample data set (provided in the repo) of both taxonomic species names and pre-downloaded FASTA files. We aim to use identical species for both of the input data to make sure both methods will produce the same result.

The pipeline is broken down into three steps: retrieval, aligning, and visualization. All steps were developed in Python and are automated by one Bash shell script, so it is easy for advanced developers to expand our steps laid out here or to integrate our work with other phylogenetic tools.

First, it will take in either a list of microbial taxonomic species names or microbial metagenomics reads. The goal here would be to create two separate ‘modules’ depending on the type of input data. If the species names are given, the pipeline will choose the module that automatically handles sequence download, cleaning, alignment, and tree reconstruction. This module will also attempt to extend prebuilt trees by inserting new organisms in which their genome annotations may be incomplete. To accomplish this, the pipeline will take the individual species names and retrieve each FASTA file by creating a local database with the command line that will allow access into NCBI’s GenBank database. On the other hand, if the input data is to be derived from FASTA files themselves, the pipeline will choose a separate module to parse through the data without needing to create a local database. This way the tree will be reconstructed directly with the sequences provided.

Second, the retrieved sequences will be cleaned up and a multiple sequence alignment will be performed by accessing Clustal-Omega’s software.

Finally, the aligned sequences will be reconstructed in a species tree using FastTree and further visualized for future downstream analysis with iTOL. The taxonomy of the species will be directly retrieved from NCBI’s Taxonomy database. iTOL is a powerful online tool that allows for tree display, annotation, and manual configuration. It’s also very user-friendly and if further annotation is to be done, the derived tree from this pipeline can be imported into its database.

Proposed solution PNG

Milestones:

Here is a tentative timeline for the process of this project. Each cell in the table contains a note of what each person will plan to accomplish that week. We made efforts to divide the tasks equally and according to each other's strengths - as this is a team effort and we want everyone to work together to produce a working and efficient pipeline.

The main focus for each person is listed below. Although that is what the person will dedicate most of their time to, everyone will be working together to ensure equal work efforts and to check the code.

Focus:

Rita: Design Document, Application Note, Documentation, creating test data, visualizing the output tree

Rohan: Creating a local database and retrieving the data, creating test data, visualizing the output tree

Delaney: Aligning the retrieved data, creating test data, visualizing the output tree

Week	Notable Happenings	Rita	Rohan	Delaney
March 9	--	Spring Break	Spring Break	Spring Break
March 16	Repo Check #1, Mar 18 & Initial Group Presentation, Mar 23/25	Prepare presentation, plan out weekly milestones, add to powerpoint	Implementation research, add to powerpoint	Introduction, gather background information, add to powerpoint
March 23	--	Keep GitHub Wiki up to date, integrate code to derive tree, meet with Dr. Dong and Dr. Gao	Integrate code to access GenBank FASTA files, integrate code to derive tree, meet with Dr. Dong and Dr. Gao	Integrate code to access Clustal-Omega for alignment, integrate code to derive tree, meet with Dr. Dong and Dr. Gao
March 30	Progress Presentation #1, Apr 1	Test code ready to present for specific task	Test code ready to present for specific task	Test code ready to present for specific task
April 6	Repo Check #2, Apr 8	Begin testing code with others, begin App Note	Finalize tree visualization methods, run through test data	Finalize tree visualization methods, run through test data
April 13	Progress Presentation #3, Apr 15 & App Note Draft, Apr 15	Finalize code and troubleshoot, Finalize App Note Draft	Clean up code, build final presentation	Clean up code, build final presentation
April 20	Repo Check #3, Apr 22	Finishing touches and ensure everything works properly	Finishing touches and ensure everything works properly	Finishing touches and ensure everything works properly
April 27	Final Presentation, Apr 27-29	Final Presentations, Finalize App Note	Final Presentations	Final Presentations
May 6	Final Project Code due & Final App Note due	Turn it in!	Turn it in!	Turin it in!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly