Rules

`create_database`

Puts relevant BOLD data columns into a custom SQLite database. Uses BOLD dump TSV as defined in config file. Implemented by the create_database.py script.

`map_opentol`

Enriches the SQLite database with mappings to OpenToL. Because this operates on the same database file as the previous rule, the output is a 0-byte file map_opentol.ok to indicate that the task was run. Uses the Open Tree of Life API to map BOLD taxon names to Open Tree of Life taxonomy IDs. Implemented by the map_opentol.py script.

`megatree_loader`

Loads the configured OpenToL topology in the into the SQLite database. This is a multi-step process that first processes the tree using the conda package perl-bio-phylo-forest-dbtree, exports this from a temporary database to an SQL dump, imports that dump into the central database, and then creates a 0-byte file megatree_loader.ok.

`family_fasta`

Barcodes from the SQLite database are divided into their taxonomic family groups and written to FASTA files: 'fasta/family/{family}.fasta'. This rule is run in parallel using scattergather and is implemented in the family_fasta.py script. Note that the parallelization needs to know ahead of time how many families to expect, which needs to be specified in the configuration file.

`makeblastdb`

Creates a BLAST database from all family FASTA files from the family_fasta rule, retaining the sequences that have an OpenToL ID and reformatting them so that the definition line only has the process ID of the sequence. The script is implemented in makeblastdb.sh, which uses the makeblastdb command from the BLAST+ suite to create the database.

`get_outgroups`

Fetches outgroup sequences from the BLAST database. This is done by running a BLAST search with the focal ingroup sequences (i.e. a family fasta file) against the BLAST database, and then selecting the most common hits across the ingroup sequences. The script is implemented in get_outgroups.sh and is parallelized using scattergather.

`family_constraint`

For each family FASTA file generates a constraint tree. The constraints are generated using the tree indexed by megatree_loader and the family FASTA files from family_fasta. This rule is also parallelized using scattergather, and is implemented using the conda package perl-bio-phylo-forest-dbtree.

`msa_hmm`

Creates a multiple sequence alignment for each family FASTA file using Hidden Markov Model alignment. Here, the sequences are also corrected for possible revcom issues. It is possible that this is not needed at all because so far 0 revcom sequences were observed in BOLD. This rule is implemented by the script msa_hmm.py, which depends on the conda package hmmer, and is parallelized using scattergather.

`prep_raxml`

Prepares the aligned ingroup sequences and constraint tree for analysis by conda package raxml-ng. This step also includes outgroup selection. It is possible that this rule needs to be reimplemented with an eye on doing phylogenetic placement (as opposed to constrained tree searching) instead. This would obviate the need for outgroup selection and for subsequent re- rooting. The rule is implemented by the script prep_raxml.py and is parallelized using scattergather.

`run_raxml`

Runs a phylogenetic analysis for each family FASTA file + constraint tree. Implemented using the raxml-ng conda package and parallelized using scattergather. Might be changed to phylogenetic placement.

`reroot_raxml_output`

Reroots the output from raxml-ng. The reason for this is that RAxML emits unrooted trees when it runs a tree search. The way in which this is done is by figuring out what the outgroup taxa were and then rooting the tree on the smallest bipartition that separates the outgroup from the ingroup, modifying the root branch length by taking the midpoint. There are several issues with this, because the midpoint is a crude and possibly inappropriate assumption and because there are cases where selected outgroups nest inside the ingroup. Implemented by the script reroot_backbone.py and parallelized using scattergather.

`choose_exemplars`

Chooses two exemplars that represent the ingroup in the backbone topology. At present these are selected by taking the shallowest tip in the ingroup. This may be inappropriate: perhaps the tallest tips should be selected, or the median ones (this needs to be tested empirically). Implemented by the script choose_exemplars.py and parallelized using scattergather.

`prep_raxml_backbone`

Prepares the backbone constraint tree and alignment by merging the exemplars from the separate family-level analyses. Implemented by the script backbone_constraint.py. From this rule onward, the pipeline is no longer parallel.

`run_raxml_backbone`

Runs the backbone analysis using raxml-ng. This is done by a heuristic search, which may have to be replaced by a phylogenetic placement operation instead.

`reroot_backbone`

Reroots the backbone tree. This is done by taking the input constraint tree and reconciling the resulting backbone with the orientation of the input. Perhaps this is not necessary if we use phylogenetic placement instead. Implemented by the reroot_backbone.py script.

`graft_clades`

Grafts the family level subtrees onto the backbone. The branch lengths on the backbone and those in the family level trees are on a different order of magnitude, with the (more or less) equivalent branches in the backbone much longer than in the subtrees. This is presumably because the backbone spans a broader taxon sample, which consequently has greater saturation and homoplasy in the alignment, which is accommodated by lengthening branches. How to scale those relative to one another is a challenge to be addressed empirically by testing different exemplar selections and rescaling rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!