Pocket representation #103

jyaacoub · 2024-05-31T14:58:49Z

Pocket-only representation

To make sure we dont have to build entirely seperate datasets for the pocket representation, this implementation should just get index positions for our binding pocket and then just apply a mask to the original graph (similar to how it is done with the dropout_node function in pytorch_geometric).

Task list:

KLIFS Database

This is used by KBDNet to get the binding pockets for davis and kiba "pocket is 85 residues long".

The sequence given from KLIFS is not contiguous and only contains relevant pocket residues.

So to match it up with our structures we need to do a sequence alignment and then get the index positions of those residues.

After we get our list of index positions for the binding pocket AA we can modify our existing graph by applying a mask

This way we don't need to build entirely separate databases, and can just apply the mask before inference

Getting pockets for Kiba

Using UniProt ID we can get the pocket from the KLIFS database with the /kinase_ID API.
1. For example: https://klifs.net/api/kinase_ID?kinase_name=O00141&species=HUMAN returns:

Getting pockets for davis:

Same as for kiba, but we use the raw Gene Name code (need to remove any mutation or phosphorylation information): ABL1(F317I)p -> ABL1
1. For example: https://klifs.net/api/kinase_ID?kinase_name=ABL1&species=HUMAN returns:

However for mutated genes we must be careful with the sequence alignment, and must follow the following procedure to get the right amino acid index positions:

Reverting the mutation
Perform alignment
Extract index positions

then we just use these positions for our mask on the original (mutated) graph.

The text was updated successfully, but these errors were encountered:

jyaacoub · 2024-08-06T17:12:15Z

Building the pocket dataset

Assumes that we have a normal dataset built already.

1. get and mask for pockets with KLIFS

This should be done first on the login node since it queries the KLIFS database for the sequences and caches them locally. Then we can use the skip_download arg to run this on the compute node for the rest of the datasets.

# building pocket datasets:
from src.utils.pocket_alignment import pocket_dataset_full
import shutil
import os

data_dir = '/cluster/home/t122995uhn/projects/data/'
db_type = ['kiba', 'davis']
db_feat = ['nomsa_binary_original_binary', 'nomsa_aflow_original_binary', 
           'nomsa_binary_gvp_binary',      'nomsa_aflow_gvp_binary']

for t in db_type:
    for f in db_feat:
        print(f'\n---{t}-{f}---\n')
        dataset_dir= f"{data_dir}/DavisKibaDataset/{t}/{f}/full"
        save_dir   = f"{data_dir}/v131/DavisKibaDataset/{t}/{f}/full"
        
        pocket_dataset_full(
            dataset_dir= dataset_dir,
            pocket_dir = f"{data_dir}/{t}/",
            save_dir   = save_dir,
            skip_download=True
        )

2. resplit the database:

import os
from src.data_prep.init_dataset import create_datasets
from src import cfg
import logging
cfg.logger.setLevel(logging.DEBUG)

dbs = [cfg.DATA_OPT.davis, cfg.DATA_OPT.kiba]
splits = ['davis', 'kiba']
splits = ['/cluster/home/t122995uhn/projects/MutDTA/splits/' + s for s in splits]
print(splits)

#%%
for split, db in zip(splits, dbs):
    print('\n',split, db)
    create_datasets(db, 
                feat_opt=cfg.PRO_FEAT_OPT.nomsa, 
                edge_opt=[cfg.PRO_EDGE_OPT.binary, cfg.PRO_EDGE_OPT.aflow],
                ligand_features=[cfg.LIG_FEAT_OPT.original, cfg.LIG_FEAT_OPT.gvp], 
                ligand_edges=cfg.LIG_EDGE_OPT.binary, overwrite=False,
                k_folds=5,
                test_prots_csv=f'{split}/test.csv',
                val_prots_csv=[f'{split}/val{i}.csv' for i in range(5)],)
                # data_root=os.path.abspath('../data/test/'))

3. test inference

#%%
from src import cfg
from src.utils.loader import Loader

# db2 = Loader.load_dataset(cfg.DATA_OPT.davis, 
#                          cfg.PRO_FEAT_OPT.nomsa, cfg.PRO_EDGE_OPT.aflow,
#                          path='/cluster/home/t122995uhn/projects/data/',
#                          subset="full")

db2 = Loader.load_DataLoaders(cfg.DATA_OPT.davis, 
                         cfg.PRO_FEAT_OPT.nomsa, cfg.PRO_EDGE_OPT.aflow,
                         path='/cluster/home/t122995uhn/projects/data/v131',
                         training_fold=0,
                         batch_train=2)
for b2 in db2['train']: break


# %%
m = Loader.init_model(cfg.MODEL_OPT.DG, cfg.PRO_FEAT_OPT.nomsa, cfg.PRO_EDGE_OPT.aflow,
                  dropout=0.3480, output_dim=256,
                  )

#%%
# m(b['protein'], b['ligand'])
m(b2['protein'], b2['ligand'])

… index renumbering #103 - Had to make some modifications since edge index needs to be updated after applying the mask so that it still points to the right nodes and we dont get something like an "IndexError" for being out of bounds - Also error due to not removing all proteins without pocket sequences (line 216 saved the old dataset instead of the new one). - Successfully built pocket datasets for davis and kiba #131 #103

Merging two branches for #103

Aflow still underperforms here...

jyaacoub added enhancement New feature or request high priority labels Jul 10, 2024

jyaacoub assigned JacksonH44 Jul 16, 2024

jyaacoub added a commit that referenced this issue Aug 2, 2024

refactor(config): update issue number for pockets #103

73cd7b1

jyaacoub added a commit that referenced this issue Aug 7, 2024

Merge pull request #134 from jyaacoub/v103

bb06794

Merging two branches for #103

jyaacoub mentioned this issue Aug 7, 2024

Training new pocket representation #135

Merged

jyaacoub added a commit that referenced this issue Aug 12, 2024

results(davis): pocket version on DG and aflow #103

a70a44a

Aflow still underperforms here...

jyaacoub added a commit that referenced this issue Aug 16, 2024

fix(esm): apply pocket mask to ESM embeddings #103

862dffc

jyaacoub added a commit that referenced this issue Aug 20, 2024

results(kiba): DG and aflow pocket #103

9b92565

jyaacoub closed this as completed in #135 Sep 15, 2024

jyaacoub mentioned this issue Sep 15, 2024

missing amino acids #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pocket representation #103

Pocket representation #103

jyaacoub commented May 31, 2024 •

edited

Loading

jyaacoub commented Aug 6, 2024 •

edited

Loading

Pocket representation #103

Pocket representation #103

Comments

jyaacoub commented May 31, 2024 • edited Loading

Pocket-only representation

KLIFS Database

Getting pockets for Kiba

Getting pockets for davis:

jyaacoub commented Aug 6, 2024 • edited Loading

Building the pocket dataset

1. get and mask for pockets with KLIFS

2. resplit the database:

3. test inference

jyaacoub commented May 31, 2024 •

edited

Loading

jyaacoub commented Aug 6, 2024 •

edited

Loading