The python scripts contained in this repo were used to calculate the branch length significant differences between paralogs and to identify asymmetric evolution. This method was applied to all gene families across the Tree of Life in the PANTHER database. There is also code used to analyse the gene structures and gene expression profiles. Each notebook contains a first explanatory markdown cell and comments in the code to help users replicate the analysis.
Use to compute the expected branch lengths, using a simple evolutionary model. This gives us a method to identify the unexpectedly long branches, so that we can test the hypothesis of the Least Diverged Orthologue (LDO).
The imput data was downloaded from the Panther database:
-
download: wget http://data.pantherdb.org/ftp/panther_library/18.0/PANTHER18.0_hmmscoring.tgz
-
extract only the tree files: tar -zxvf PANTHER18.0_hmmscoring.tgz target/famlib/rel/PANTHER18.0_altVersion/hmmscoring/PANTHER18.0/books/PTHR*/tree.tree
-
rename to trees/PTHR*.tree
downloaded using panther api with: scripts/panther_species_tree.py
Use to filter the branches to only duplication events. This is performed by computing the difference from the expected branch length and then translating this into z-scores. Branches can then be classified into six categories (p<0.05):
i) normal-normal: both branches are not significantly different
ii) short-short: both branches are significantly shorter than expected
iii) long-long: both branches are significantly longer than expected
iv) normal-long: only one branch is significantly longer than expected
v) short-normal: only one branch is significantly shorter than expected
vi) short-long: one branch is significantly shorter than expected and the other is significantly longer than expected.
Also generates Fig 1, Supp Fig 1, Supp Fig 2
Additionally, identifies the genes for the outgroup test.
Use to download the structure data and compute the structural alignment.
This document contains the code that downloads and reformats the available expression data from the bgee database (https://www.bgee.org/) and reformat all the expression data used.
The plant expression data was downloaded from: https://expression.plant.tools/
Generate genome mapping tables for each of the species. This is necessary as PANTHER genomes are imported from UniProt RPs, whereas bgee is using ensembl data directly.
This notebook contains the code to run the inparalogue pairwise Pearson's correlation and tissue specificity (
This notebook contains the code to compute tissue specicity scores (arcsinh
function --
This notebook identifies relevant species to use for each branch in the species tree as outgroup species.
After, LDO / MDO are compared to the outgroup gene with both a PCC and tau analysis.
This notebook contains all the code used to analyze the data and generate the plots presented in the paper.
This folder contains the modules used to parse the panther trees in step_1.
Scripts used to analyse the data and download specific datasets.
Scripts used in step_3.
The scripts used for this project have only been tested on Ubuntu 24.04 environment.
At various steps several external tools are called by these scripts and notebooks.
Installation instuctions and dependencies for the software used in this project can be found at the following locations:
AlphaFold: [https://alphafold.ebi.ac.uk/] (v4)
foldseek: [https://github.com/steineggerlab/foldseek] (v8.ef4e960)