Skip to content

A bioinformatics package for creating, filtering, and manipulating supermatrices and phylogenetic datasets using GenBank and/or local sequence data.

License

Notifications You must be signed in to change notification settings

dportik/SuperCRUNCH

Repository files navigation

SuperCrunch Logo


Overview

SuperCRUNCH is a python toolkit for creating and working with phylogenetic datasets. SuperCRUNCH can be run using any set of sequence data, as long as sequences are in fasta format with standard naming conventions (described here).

SuperCRUNCH can be used to:

  • Construct de novo supermatrices from GenBank and/or local sequence data.
  • Construct phylogeographic datasets from GenBank and/or local sequence data, with the ability to detect voucher codes.
  • Parse a large fasta file of GenBank and/or local sequences into gene-specific fasta files, based on a list of genes and list of species names.
  • Perform similarity filtering for all sequences of a gene to remove mis-identified or highly divergent sequences.
  • Extract specific genes from a set of mitochondrial genomes.
  • Use a set of reference sequences to trim all input sequences to match the reference region.
  • Adjust reading frames for coding sequences, and identify problematic coding sequences.
  • Adjust sequence directions to prepare for alignment.
  • Automate multiple sequence alignment using Clustal-O, MAFFT, Muscle, and MACSE.
  • Generate accession tables for supermatrices, for all genes and species included.
  • Relabel GenBank sequences using species names, accession numbers, and/or voucher codes.
  • Trim alignments using trimAl or a custom trimming routine.
  • Convert fasta format to phylip and nexus.
  • Concatenate any number of fasta or phylip alignment files. Outputs a concatenated alignment, data partitions, and counts of loci for all species included.
  • And many other tasks!

A visual overview of the major steps in SuperCRUNCH is shown below:

SuperCrunch workflow

SuperCRUNCH is highly modular and analyses do not require running the full pipeline. There are many useful tools available for manipulating your phylogenetic and phylogeographic datasets.

Citation

SuperCRUNCH is described in more detail in the following publication:

  • Portik, D.M., and J.J. Wiens. (2020) SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets. Methods in Ecology and Evolution, 11: 763-772. https://doi.org/10.1111/2041-210X.13392

The published article is available here. A pre-print was made available on BioRxiv prior to publication (here).

Installation

There are several dependencies required to run SuperCRUNCH, including Python packages (BioPython and numpy), as well as external tools (NCBI-BLAST+, CD-HIT-EST, MAFFT, Muscle, Clustal-O, MACSE, and trimAl).

Installation of these requirements is fast and easy using conda. The supercrunch-conda-env.yml file can be used to create the correct conda environment:

conda env create -f supercrunch-conda-env.yml

The resulting conda environment can then be activated using:

conda activate supercrunch

You can then run all SuperCRUNCH modules in this environment.

NOTE - this will install all requirements except for MACSE, which is a jar file that must be downloaded from here (get V2.05). This software is only required for the MACSE option in the alignment module (Align.py).

For non-conda installation of these packages, please see the Installation Instructions wiki.

SuperCRUNCH itself consists of a set of modules written in Python (compatible with 2.7 and 3.7) that function as stand-alone command-line scripts. These modules are available in the supercrunch-scripts folder. They can be downloaded and executed independently without the need to install SuperCRUNCH as a Python package or library, making them easy to use and edit. The scripts function independently, and do not require being contained or used in the same directory. SuperCRUNCH scripts can be run using Mac OSX (10.10+) and Linux, and can also work with Windows using a program like Cygwin.

Complete Instructions for Analyses

SuperCRUNCH has extensive documentation which can be accessed through the wiki tab at the top of the page. An overview of the components of SuperCRUNCH can be found on the Analysis Overview page. This section outlines all major topics and navigates to detailed instructions for each step, including usage for all modules, proposed workflows, and common issues. Quick links to wiki pages for the major components of SuperCRUNCH are also provided below:

Tutorials and Examples

Several tutorials were made available as part of the original SuperCRUNCH publication, which cover the full range of analyses available. These tutorials can be found on the OSF SuperCRUNCH project page. Each tutorial includes all data and instructions necessary to replicate the analysis. An overview of the tutorials available can be found on the Tutorials wiki.

A new set of tutorials is available as part of a study testing alignment and trimming combinations (Portik & Wiens, 2021 - Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses?). The complete set of data and instructions are available at https://osf.io/qa9r8/.

Reporting Issues, Getting Help, and Providing Suggestions

For main analysis issues and/or bugs, please create an issue on github here. Make sure you include the details of your analysis (inputs, outputs, commands) to assist the troubleshooting. The SuperCRUNCH user group can also be used to create a post.

Version

The current release of SuperCRUNCH is v1.3.2. Please see below for important changes.

Changes in v1.3.2:

  • Increased speed for Cluster_Blast_Extract.py and Reference_Blast_Extract.py by changing underlying data structure used to gather blast coordinates per accession.

Changes in v1.3.1:

  • Fixed a bug in Parse_Loci.py that was causing vouchers to not be formatted correctly in the output files, which prevented the voucher feature from being used in downstream modules.

Changes in v1.3.0:

  • Added a conda environment recipe for SuperCRUNCH, allowing easy installation of all requirements except MACSE.
  • Parse_Loci.py: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term pseudogene will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file, where the fourth column is the negative term (N/A in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the N/A is automatically generated.
  • Filter_Seqs_and_Species.py: Added --accessions_include flag. This points to a text file of accession numbers (one per line). When used with the --seq_selection oneseq option, if an accession included in the list is found in the available seqs for a taxon and gene, it must be selected. This is not just an "allowed list", this list will override other settings for selection such as length. Also added the --accessions_exclude flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a "blocked list".
  • Taxa_Assessment.py: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Also, now invokes the SeqIO.index_db() method for sequence files >5GB, rather than using SeqIO.index() method, which is much more memory efficient for big data. The SeqIO.index_db() method is already used in Parse_Loci.py.
  • Cluster_Blast_Extract.py: Added feature to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths.
  • Added a new Remove_Long_Accessions.py module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH.
  • Updated recognition for file extensions produced by updated blastn tools (.ndb, .not, .ntf, .nto).

For complete version history please see the releases.

License

GNU Lesser General Public License v3.0

About

A bioinformatics package for creating, filtering, and manipulating supermatrices and phylogenetic datasets using GenBank and/or local sequence data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages