Skip to content

Pipeline for the selection of canonical proteins for reference proteomes

License

Notifications You must be signed in to change notification settings

g-insana/ortho2tree

Repository files navigation

ortho2tree

DOI

The UniProt Reference Proteomes dataset seeks to provide complete proteomes for an evolutionarily diverse, less redundant, set of organisms.

As higher eukaryotes often encode multiple isoforms of a protein from a single gene, the Reference Proteome pipeline selects a single representative (‘canonical’) sequence. UniProt identifies canonical isoforms using a ‘Gene-Centric’ approach: proteins are grouped by gene-identifier and for each gene a single protein sequence is chosen.

For unreviewed (UniProtKB/TrEMBL) protein sequences (and for some reviewed sequences), the longest sequence in the Gene-Centric group is usually chosen as canonical. This can create inconsistencies, selecting canonical sequences with dramatically different lengths for orthologous genes.

The Ortho2tree data pipeline examines Gene-Centric canonical and isoform sequences from sets of orthologous proteins (from PantherDB), builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. Canonical choices can be either confirmed or a better one proposed.

The pipeline and the underlying analysis is described in the journal article "Improved selection of canonical proteins for reference proteomes".

An overview of the pipeline is shown in this figure: ortho2tree pipeline overview

The pipeline can retrieve protein sequences using direct access to the UniProt databases or using the UniProt web API.

Data processing is done via pandas DataFrames employing vectorized operations and all the orthogroups can be processed in parallel if multithread is available. For each orthogroup the pipeline:

  • builds a Multiple Sequence Alignment (via muscle)
  • calculates a gap-based Neighbour-Joining tree (via BioPython using a modified pairwise distance function focused on gaps)
  • scans the tree to identify low-cost clades
  • ranks the best low-cost clades to confirm existing canonicals or suggest replacements

Contents of the repository

ortho2tree.py    # main script to use to run the pipeline on the command line
ortho2tree.ipynb # jupyter notebook to run the pipeline interactively
ortho2tree/      # modules folder
requirements.txt # list of needed packages
README.md        # this text
MS_figures_src/  # all of the datafiles and .R code to recreate the figures in the manuscript
test/            # folder containing data ready for a quick test run
test.cfg         # configuration file for the quick test run
qfomam.cfg       # configuration file for the qfomam2022_05 analysis described in the manuscript

INSTALLATION

  • git clone the repository:

git clone https://github.com/g-insana/ortho2tree.git

  • install requirements (virtual environment is optional but recommended) via pip or conda/mamba:

via pip:

cd ortho2tree && python3 -m venv venv_o2t
source venv_o2t/bin/activate
pip3 install -r requirements.txt

via conda or mamba:

cd ortho2tree && mamba create --name ortho2tree --file requirements.txt --channel conda-forge
mamba activate ortho2tree

Note that you also need to install muscle for multiple sequence alignments, either version v3.8.31 or the new v5.1. Please check ortho2tree/config_muscle.py and update accordingly to your installation so that the muscle executable can be found and the correct format is set (according to the muscle version used).

e.g. via conda or mamba for 3.8.31:

#EITHER:
mamba install -c bioconda "muscle<=4.0" #3.8.31
#OR:
mamba install -c bioconda 'muscle>=5.0' #5.1

QUICK TEST TO CHECK INSTALLATION

  • test run of a single group

./ortho2tree.py -set test -id PTHR43715:SF1

  • example of full analysis run of a set

./ortho2tree.py -set test -no_stats

COMMAND LINE USAGE

usage: ortho2tree.py [-h] -set DATASET_NAME [-d] [-nocache] [-no_stats]
                     [-id SINGLE_GROUP [SINGLE_GROUP ...]] [-file LIST_FILENAME]
                     [-sugg SUGG_FILE] [-prevgc PREVGC_FILE] [-outstamp OUTSTAMP]

optional arguments:
  -h, --help            show this help message and exit
  -set DATASET_NAME     set for the analysis. a file SET.cfg should be present
  -d                    print verbose/debug messages
  -nocache              do not use cache, re-create alignments/trees and do not save them
  -no_stats             do not print any stats on the dataframe
  -id SINGLE_GROUP [SINGLE_GROUP ...]
                        to only work on one or few group(s)
  -file LIST_FILENAME   to work on a series of groups, from a file
  -sugg SUGG_FILE       to simulate integration of canonical suggestions reading a previosly
                        generated changes file; note that file should be placed in the set main dir
  -prevgc PREVGC_FILE   to integrate previosly generated changes file; note that file should be placed
                        in the set main dir
  -outstamp OUTSTAMP    to name and timestamp the output files and the dumps; this overrides the
                        outstamp parameter from the config

    Examples:
       -set=qfomam                                 #will do the analysis on the whole set
       -set=qfomam -id=PTHR19918:SF1               #only for one orthogroup
       -set=qfomam -id=PTHR19918:SF1 PTHR40139:SF1 #only for two orthogroups
       -set=qfomam -file=list_of_ids.txt           #for a series of groups listed in a file

CONFIGURATION

Please check the the example YAML configuration files provided for the list of the parameters. E.g. test yaml configuration file

DOCUMENTATION

Please refer to the DOCS.md file for information on how to setup a new analysis and how to interpret the output produced.

Analysis of UP2022_05 QfO mammals

The manuscript "Improved selection of canonical proteins for reference proteomes" (preprint) describes the ortho2tree analysis of eight QfO (Quest for Orthologs) mammalian proteomes, based on UniProtKB data (release UP2022_05).

See the folder MS_figures_src for datafiles and .R code to recreate the figures in the manuscript

To replicate the analysis from the paper:

wget -O qfomam.tar.gz https://zenodo.org/records/10778115/files/qfomam.tar.gz?download=1  #retrieve the archive
tar xfz qfomam.tgz                                               #uncompress the archive
./ortho2tree.py -set qfomam -id PTRH43715:SF1                    #run a single orthogroup
./ortho2tree.py -set qfomam -outstamp $(date +%y%m%d)            #do the analysis

The Zenodo archive qfomam.tgz contains pre-computed alignments, trees and clades (or alternatively from Figshare).

A web interface for filtering and viewing the pdf files (with trees and alignments for each orthogroup) from the result of that analysis (and subsequent ones) is available at fasta.bioch.virginia.edu/ortho2tree

The pdf files, generated whenever canonicals were confirmed or changes were proposed, are available as a Zenodo archive: qfomam_pdf_data.tgz (or alternatively from Figshare).

A script to generate the pdf files is included under the folder pdfcreation/

LINKS

CITATION

If you find this software useful, please consider citing our paper (pubmed 38130879):

Insana, G., Martin, M.J. & Pearson, W.R.
Improved selection of canonical proteins for reference proteomes
NAR Genomics and Bioinformatics (2024). https://doi.org/10.1093/nargab/lqae066

Bibtex:

@article{10.1093/nargab/lqae066,
    author = {Insana, Giuseppe and Martin, Maria J and Pearson, William R},
    title = "{Improved selection of canonical proteins for reference proteomes}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {6},
    number = {2},
    pages = {lqae066},
    year = {2024},
    month = {06},
    abstract = "{The ‘canonical’ protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting \\&gt;95\\% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022\_05, ortho2tree proposed 7804 canonical changes for release 2023\_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82\\% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92\\% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are \\&gt;60\\% identical, a group that includes vertebrates and higher plants.}",
    issn = {2631-9268},
    doi = {10.1093/nargab/lqae066},
    url = {https://doi.org/10.1093/nargab/lqae066},
}