The xProtCAS (extracting Protein Conserved Accessible Surfaces) framework is a pipeline for the discovery of clusters of conserved residues on the surface of folded structural modules as a proxy for functional interfaces. The xProtCAS pipeline integrates information on residue solvent accessibility and conservation with topological data on residue proximity in three-dimensional space using graph-based methods to determine proximal clusters of relatively conserved residues on a protein's surface.
- Description
- Software prerequisites
- Installation methods
- Usage
- Input files
- Output files
- License
- Reference
The workflow of the xProtCAS framework includes eight major steps: (i) definition of the autonomous structural modules of a protein; (ii) calculation of the residue-centric accessibility and topology metrics for the structural module; (iii) calculation of the per residue conservation scores; (iv) creation of an edge-weighted directed graph encoding the structural and evolutionary properties for the structural module; (v) calculation of eigenvector centrality scores; (vi) definition of the conserved accessible surfaces using hierarchical clustering; and (vii) scoring and (viii) annotation of the conserved accessible surfaces. This stand-alone version can use either AlphaFold2 models or PDB structures and can define multiple surfaces. For more details, please refer to the paper.
The pipeline relies on external software/libraries to handle protein databank files, compute mathematical/geometric features, and run graph algorithms. The following is the list of required libraries and programs, as well as the version on which it was tested (in parenthesis).
- Python (3.8)
- BioPython (1.78) . To parse PDB files.
- NetworkX (2.8) . To run graph algorithms (Centrality and Community Detection).
- NumPy (1.20.1) . To handle N-dimensional arrays.
- SciPy (1.8.1) . To implement the Delaunay triangulation and Mann-Whitney U test.
- Scikit-Learn (0.24.1) . To run hierarchical clustering.
- Requests (2.25.1) . To send HTTP requests.
- Pymol (2.4) . To visualise files of the detected patches (optional).
Choose one of the three following installation methods. However, using our prebuild docker image is the easiest way to get started.
Install dependencies in the Software prerequisites section and clone the repository to a local directory:
git clone https://github.com/hkotb/xprotcas.git
cd xprotcas/src
Note It is more convenient to make
xprotcas/src
your current working directory before running the pipeline using python commands on your local machine, as the default input and output pathes are relative pathes.
To build a Docker image from the provided Dockerfile run the following steps:
git clone https://github.com/hkotb/xprotcas.git
cd xprotcas/
docker build -t xprotcas .
Note You have to abide by the name you choose while using your built image later (xprotcas).
To pull our prebuilt image run the following command:
docker pull hkotb/xprotcas
Note You have to abide by the name of the prebuilt image while using it later (hkotb/xprotcas).
Run the pipeline help to display all arguments and their usage:
python pipeline_starter.py -h
usage: pipeline_starter.py [-h] [--output OUTPUT] [--input INPUT] [--uniprot UNIPROT] [--pdb PDB] [--create_pymol_session {true,false}] [--pdb_file PDB_FILE] [--conservations_file CONSERVATIONS_FILE]
[--split_into_domains {true,false}] [--predicted_aligned_error_file PREDICTED_ALIGNED_ERROR_FILE] [--orthdb_taxon_id {metazoa,qfo,vertebrates,mammalia}]
[--number_of_iterations NUMBER_OF_ITERATIONS] [--log {debug,info,warning,error,critical}] [--slim_server SLIM_SERVER]
Run functional regions detector.
optional arguments:
-h, --help show this help message and exit
--output OUTPUT Path to an output directory where the pipeline results will be saved.
--input INPUT Path to the directory where the input files should be saved.
--uniprot UNIPROT UniProt accession, --uniprot or --pdb should be passed, not both of them.
--pdb PDB PDB id, --uniprot or --pdb should be passed, not both of them.
--create_pymol_session {true,false}
PyMol session will be created if this option is true and PyMol is installed. (default: true)
--pdb_file PDB_FILE The file name of the resolved or predicted structure (in PDB format) (should be saved in the input directory). If it is passed, the pipeline will use it and not download a pdb
file based on uniprot/pdb options.
--conservations_file CONSERVATIONS_FILE
The name of conservation scores file (should be saved in the input directory). If it is passed, the pipeline will use it and not call SLiM RESTful APIs to get the conservation
scores based on uniprot/pdb options.
--split_into_domains {true,false}
If true, the pipeline will split AlphaFold predicted structure into domains. (default: true)
--predicted_aligned_error_file PREDICTED_ALIGNED_ERROR_FILE
The name of AlphaFold predicted aligned error file (should be saved in the input directory). If it is passed, the pipeline will use it instead of trying to download it from
AlphaFold database. It is used in splitting AlphaFold predicted structure into domains.
--orthdb_taxon_id {metazoa,qfo,vertebrates,mammalia}
The search database to find orthologous sequences to the query structure. It is used by SLiM tools to generate conservations. (default: metazoa)
--number_of_iterations NUMBER_OF_ITERATIONS
Maximum number of iterations the pipeline will perform (per each chain/domain). (default: 20)
--log {debug,info,warning,error,critical}
Specify the logging level. (default: info)
--slim_server SLIM_SERVER
SLiM tools server url.
To run the pipeline on uniprot accession Q9Y2M5 use the command:
python pipeline_starter.py --uniprot Q9Y2M5
using docker:
docker run -t -d --name my_xprotcas_container hkotb/xprotcas
docker exec -it my_xprotcas_container python pipeline_starter.py --uniprot Q9Y2M5
docker cp my_xprotcas_container:/home/submitter/output ./
docker stop my_xprotcas_container
docker rm my_xprotcas_container
Note The
run
command creates a container,exec
uses the created container to run the command,cp
copies the output from inside the container to your local mahcine,stop
andrm
to stop the container and remove it. We prefer to usecp
command over mounting local directories as docker volumes to avoid any permission probelmes between the host and the container.
To run the pipeline on specific files on your local machine:
python pipeline_starter.py --pdb_file AF-Q9Y2M5-F1-model_v2.pdb --predicted_aligned_error_file AF-Q9Y2M5-F1-predicted_aligned_error_v2.json --conservations_file Q9Y2M5.json
Note Assuming the files are placed in the default input directory, there is no need to pass --input argument.
using docker:
docker run -t -d --name my_xprotcas_container hkotb/xprotcas
docker cp ./input my_xprotcas_container:/home/submitter/
docker exec -it my_xprotcas_container python pipeline_starter.py --pdb_file AF-Q9Y2M5-F1-model_v2.pdb --predicted_aligned_error_file AF-Q9Y2M5-F1-predicted_aligned_error_v2.json --conservations_file Q9Y2M5.json
docker cp my_xprotcas_container:/home/submitter/output ./
docker stop my_xprotcas_container
docker rm my_xprotcas_container
Note We have added here one extra
cp
step to copy the input directory from your local machine to the container before executing the command.
To run the pipeline on uniprot accession Q9Y2M5 but with a specific conservations file from your local machine:
python pipeline_starter.py --uniprot Q9Y2M5 --conservations_file Q9Y2M5.json
using docker:
docker run -t -d --name my_xprotcas_container hkotb/xprotcas
docker cp ./input my_xprotcas_container:/home/submitter/
docker exec -it my_xprotcas_container python pipeline_starter.py --uniprot Q9Y2M5 --conservations_file Q9Y2M5.json
docker cp my_xprotcas_container:/home/submitter/output ./
docker stop my_xprotcas_container
docker rm my_xprotcas_container
To run the pipeline on PDB ID 6GY5:
python pipeline_starter.py --pdb 6GY5
using docker:
docker run -t -d --name my_xprotcas_container hkotb/xprotcas
docker exec -it my_xprotcas_container python pipeline_starter.py --pdb 6GY5
docker cp my_xprotcas_container:/home/submitter/output ./
docker stop my_xprotcas_container
docker rm my_xprotcas_container
To run the four previous examples in sequence using docker:
docker run -t -d --name my_xprotcas_container hkotb/xprotcas
docker cp ./input my_xprotcas_container:/home/submitter/
docker exec -it my_xprotcas_container python pipeline_starter.py --uniprot Q9Y2M5
docker exec -it my_xprotcas_container python pipeline_starter.py --pdb_file AF-Q9Y2M5-F1-model_v2.pdb --predicted_aligned_error_file AF-Q9Y2M5-F1-predicted_aligned_error_v2.json --conservations_file Q9Y2M5.json
docker exec -it my_xprotcas_container python pipeline_starter.py --uniprot Q9Y2M5 --conservations_file Q9Y2M5.json
docker exec -it my_xprotcas_container python pipeline_starter.py --pdb 6GY5
docker cp my_xprotcas_container:/home/submitter/output ./
docker stop my_xprotcas_container
docker rm my_xprotcas_container
The pipeline will grab them automatically, but you can optionally provide them.
.
├── PDB # Downlaoded from PDB or Alphafold databases
├── Conservations # Create one from SLiM tools RESTful APIs
├── Predicted aligned error # Downaloaded from Alphafold database
If you want to create your own conservations file, instead of using results returned from SLiM tools RESTful APIs, follow the following format:
{
"data":{
"A":{
"317":0.6376021525986822,
"318":0.8349590754476021,
"319":0.6597751935569642,
.
.
.
"599":0.17295363833060104,
"600":0.21248774408861887,
"601":0.14574453946172294
},
"U":{
"1334":0.6756696043085054,
"1335":0.5544372819953505,
"1336":0.5697273349944153,
.
.
.
"1342":0.1550935920994699,
"1343":0.09108525697019905,
"1344":0.07128042175662261
}
}
}
Where "A" and "U" are the chains in the pdb file. "317":0.6376021525986822 assigns conservation score 0.6376021525986822 to the residue with sequence number 317, etc.
The pipeline will produce the following files:
.
├── *.pdb # The downlaoded PDB file (if you didn't use --pdb_file).
├── *.conservations.json # The created conservations file from SLiM tools RESTful APIs (if you didn't use --conservations_file).
├── *-predicted_aligned_error_v2.json # Downaloaded predicted aligned error file (if you used --uniprot, --split_into_domains was true, and didn't use --predicted_aligned_error_file).
├── merged_data.json # The results, including centrality scores and ranked patches.
├── *.pse # PyMOL session file (if PyMOL is installed and --create_pymol_session is true).
The merged_data.json file is the major output of the pipeline and it has the following format:
.
├── chain
│ ├── domain
│ │ ├── "residues"
│ │ │ ├── residue's sequence number
│ │ │ │ ├── "accessibility" # Always has the value 1 as the results contain accessible residues only.
│ │ │ │ ├── "direct_neighbors" # List of neighbour residues to this specific residue in the Delaunay triangulation.
│ │ │ │ ├── "conservation" # Conservation score of this specific residue (from the conservations file)
│ │ │ │ ├── "score_1" # The centrality score for this residue on the first iteration.
. . . . .
. . . . .
. . . . .
│ │ │ │ └── "score_X" # The centrality score for this residue on Xth iteration. X is the iteration when this residue was selected in the patch of central residues.
. . . .
. . . .
. . . .
│ │ │ └── residue's sequence number
│ │ ├── "patch_1" # The first patch. This is the patch extracted in the first iteration of the pipeline.
│ │ │ ├── "residues" # List of residues in the first patch.
│ │ │ ├── "patch_conservation_mean" # Mean patch conservation.
│ │ │ ├── "patch_conservation_difference" # The difference between mean patch conservation and mean non-patch conservation.
│ │ │ └── "patch_conservation_pvalue" # p-value of the Mann-Whitney U test of patch conservations to the non-patch conservations.
. . .
. . .
. . .
│ │ └── "patch_Y" # The Yth patch. This is the patch extracted in the Yth iteration of the pipeline. Y is also the last iteration when the pipeline managed to extract a patch. It may equal to --number_of_iterations, but problems might happen like centrality algorithm didn't converge or found a null graph which stops the pipeline from iterating.
{
"A":{
"1":{
"residues":{
"318": {
"accessibility": 1,
"direct_neighbors": [
"598",
"551",
"338",
"600",
"553",
"319",
"554"
],
"conservation": 0.8349590754476024,
"score_1": 0.00033068561774767614,
"score_2": 0.003976137444726946,
"score_3": 0.004157688400901547,
"score_4": 0.03592842084943813,
"score_5": 0.01930988819468361,
"score_6": 0.19329758289234808
},
.
.
.
"599": {
"accessibility": 1,
"direct_neighbors": [
"387",
"365",
"600",
"319",
"321"
],
"conservation": 0.17295363833060104,
"score_1": 0.00010753379553353566,
"score_2": 0.0022105078526197678,
"score_3": 0.001117286314411295,
"score_4": 0.013368541320983066,
"score_5": 0.0013258888042676014,
"score_6": 0.01427290268982088,
"score_7": 0.05193837060072906,
"score_8": 1.6861154589542962e-7,
"score_9": 0.00037484282552134704,
"score_10": 0.000017629096441159546,
"score_11": 0.04617722058371434,
"score_12": -1.2681081705743956e-15,
"score_13": 1.7241949030324695e-15,
"score_14": 0.2616932881394638
},
"600": {
"accessibility": 1,
"direct_neighbors": [
"599",
"362",
"365",
"318",
"598",
"319"
],
"conservation": 0.21248774408861887,
"score_1": 0.0000761960316721797,
"score_2": 0.0016847472149046766,
"score_3": 0.000997893003956213,
"score_4": 0.01343725373983663,
"score_5": 0.004481525100512187,
"score_6": 0.04512572092482299,
"score_7": 0.04099458733679399,
"score_8": 2.031085956839882e-7,
"score_9": 0.0004508437188904322,
"score_10": 0.000021196411712485968,
"score_11": 0.055361193620167434,
"score_12": -1.4889439728584647e-15,
"score_13": 1.9990920480167914e-15,
"score_14": 0.3074308020579626
}
},
"patch_1": {
"residues": [
"405",
"421",
"499",
.
.
.
"562",
"592",
"326"
],
"patch_conservation_mean": 0.9575961426660872,
"patch_conservation_difference": 0.30507939164836484,
"patch_conservation_pvalue": 8.596211261183572e-7
},
.
.
.
"patch_20": {
"residues": [
"508",
"506",
"507"
],
"patch_conservation_mean": 0.187915590231306,
"patch_conservation_difference": -0.4941059262313974,
"patch_conservation_pvalue": 0.9934569500920662
}
},
"2":{
"residues":{
"41": {
"accessibility": 1,
"direct_neighbors": [
"43",
"42"
],
"conservation": 0.31485642770736555,
"score_1": 4.095330317951117e-9,
"score_2": 0.000004357793466783931,
"score_3": 0.000015317220742269545,
"score_4": 0.0000010318277367862613,
"score_5": 0.00000546114369877771,
"score_6": 0.000028724474458458038,
"score_7": -7.920020924280168e-18,
"score_8": 0.0006733410768030233,
"score_9": 0.000003798160113791288,
"score_10": 5.376602530989818e-18,
"score_11": 0.003945267490756683,
"score_12": -8.031149324261881e-17,
"score_13": -8.451131673320225e-17,
"score_14": -6.493675595735365e-16,
"score_15": -2.0968819501682493e-15,
"score_16": -4.76131098384193e-16,
"score_17": 0.47769033067499245
},
.
.
.
"312": {
"accessibility": 1,
"direct_neighbors": [
"222",
"221",
"310",
"309",
"311",
"223"
],
"conservation": 0.8056384639036711,
"score_1": 0.05457085596700717
}
},
"patch_1": {
"residues": [
"295",
"294",
.
.
.
"281",
"312"
],
"patch_conservation_mean": 0.8518427508531001,
"patch_conservation_difference": 0.21933607974408698,
"patch_conservation_pvalue": 3.1043509158711496e-7
},
.
.
.
"patch_20": {
"residues": [
"56",
"57",
"60",
"53",
"58",
"54"
],
"patch_conservation_mean": 0.21788566503943327,
"patch_conservation_difference": -0.46527161683305185,
"patch_conservation_pvalue": 0.9998017813668787
}
}
}
}
*.pse file can be loaded using PyMOL. It enables the virtualisation of all the detected patches with their ranks.
Note In this example, chain A is split into two domains, A1 and A2. A.excluded contains the excluded residues, whether because they are in the protein's core or disordered. The red colours on the 3D structure represent high scores, while blue is for average scores.
This source code is licensed under the MIT license found in the LICENSE
file in the root directory of this source tree.
If you find the pipeline useful in your research, we ask that you cite our paper:
@Article{biom13060906,
AUTHOR = {Kotb, Hazem M. and Davey, Norman E.},
TITLE = {xProtCAS: A Toolkit for Extracting Conserved Accessible Surfaces from Protein Structures},
JOURNAL = {Biomolecules},
VOLUME = {13},
YEAR = {2023},
NUMBER = {6},
ARTICLE-NUMBER = {906},
URL = {https://www.mdpi.com/2218-273X/13/6/906},
ISSN = {2218-273X},
DOI = {10.3390/biom13060906}
}