Updated results for distribution drift issue #131

jyaacoub · 2024-08-02T14:24:21Z

There is a significant distribution drift due to the new training split we had created to exclude any proteins in OncoKB...

This is evident by the drastic difference in performance when training with the split vs when training on a random split.
Most evident in the PDBbind dataset.
get distributional stats to provide additional support for this claim
- Via sequence identity?

TODOs:

build new dataset with a random, but unified test set (any overlaping proteins across datasets that are in the test set, must also be in the test set).
Identify the list of ligand-protein pairs from oncoKB that are not in the training set, which can be used for Mutagenesis analysis.
profit??

jyaacoub · 2024-08-02T15:49:17Z

Get test OncoKBs

PDBbind

Using both uniprot ID and pdb ID:

Filtering by both gene name and drug name is difficult since PDBbind uses shortform names like "NLG" and the oncoKB uses the full name of the drug, and there are multiple aliases for each drug so the shortform name might not match. We would have to check against all possible aliases to ensure that we had the right drug-target pair.

code

#%%
import pandas as pd

def get_test_oncokbs(train_df=pd.read_csv('/cluster/home/t122995uhn/projects/data/test/PDBbindDataset/nomsa_binary_original_binary/full/cleaned_XY.csv'),
                     oncokb_fp='/cluster/home/t122995uhn/projects/data/tcga/mart_export.tsv', 
                     biomart='/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv'):
    #Get gene names for PDBbind
    dfbm = pd.read_csv(oncokb_fp, sep='\t')
    dfbm['PDB ID'] = dfbm['PDB ID'].str.lower()
    train_df.reset_index(names='idx',inplace=True)

    df_uni = train_df.merge(dfbm, how='inner', left_on='prot_id', right_on='UniProtKB/Swiss-Prot ID')
    df_pdb = train_df.merge(dfbm, how='inner', left_on='code', right_on='PDB ID')

    # identifying ovelap with oncokb
    # df_all will have duplicate entries for entries with multiple gene names...
    df_all = pd.concat([df_uni, df_pdb]).drop_duplicates(['idx', 'Gene name'])[['idx', 'code', 'Gene name']]

    dfkb = pd.read_csv(biomart)
    df_all_kb = df_all.merge(dfkb.drop_duplicates('gene'), left_on='Gene name', right_on='gene', how='inner')

    trained_genes = set(df_all_kb.gene)

    #Identify non-trained genes
    return dfkb[~dfkb['gene'].isin(trained_genes)]


train_df = pd.read_csv('/cluster/home/t122995uhn/projects/data/test/PDBbindDataset/nomsa_binary_original_binary/train0/cleaned_XY.csv')
val_df = pd.read_csv('/cluster/home/t122995uhn/projects/data/test/PDBbindDataset/nomsa_binary_original_binary/val0/cleaned_XY.csv')

train_df = pd.concat([train_df, val_df])

get_test_oncokbs(train_df=train_df)

Davis

To limit distribution drift issues mentioned in #131.

revert(splits): use random split to resolve distribution drift (#131)

jyaacoub · 2024-08-02T21:40:39Z

SUMMARY (see below for stats on distributions - oncokb vs random split dataset):

The distribution looks visually different in terms of highly targeted proteins, but when running a similarity scoring algorithmn to deduce the difference in the two distributions (random split vs OncoKB split) there was no real difference, but this could be a fault of the scoring algorithmn

… index renumbering #103 - Had to make some modifications since edge index needs to be updated after applying the mask so that it still points to the right nodes and we dont get something like an "IndexError" for being out of bounds - Also error due to not removing all proteins without pocket sequences (line 216 saved the old dataset instead of the new one). - Successfully built pocket datasets for davis and kiba #131 #103

jyaacoub · 2024-08-08T17:55:06Z

stats on distribution differences (oncokb split vs random split):

Small difference in the % of heavily targeted proteins that appear in the test set (oncoKB test set had fewer proteins that were heavily targeted)

Likely due to the fact that we had hand picked oncoKB proteins which were heavily targeted and so the rest of the proteins were low targets, and thus the majority ended up being low targets.

Distribution test splits:

Poor performance is likely due to high diversity in the OncoKB test set, making it harder for the model. This is because the model has to be good at predicting a larger range of proteins.

Details

Full distribution:

We can see that the oncoKB test split has way more "lowly" targeted proteins in the test set, which are naturally going to be harder for the model to be evaluated on due to the range of proteins that the model has to be good at.

This is counter intuitive but the reason for this was because we hand picked the more heavily targeted proteins for the test set, and because we want the test sets to remain the same size, the code that I wrote picked lowly targeted proteins in order to reach that size constraint.
Whereas the random split has a more even distribution, mimicing more closely the training split.

Limit to top n frequent proteins

This really shows the impact of hand picking the proteins for the test set. They are shown in the spikes of green from the oncokb test split.

Code

#%%
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

new = '/cluster/home/t122995uhn/projects/splits/new/pdbbind/'

train_df = pd.concat([pd.read_csv(f'{new}train0.csv'), 
                      pd.read_csv(f'{new}val0.csv')], axis=0)
test_df = pd.read_csv(f'{new}test.csv')

all_df = pd.concat([train_df, test_df], axis=0)
print(len(all_df))


#%%
old = '/cluster/home/t122995uhn/projects/splits/old/pdbbind/'
old_test_df = pd.read_csv(f'{old}test.csv')
old_train_df = all_df[~all_df['code'].isin(old_test_df['code'])]

# %%
# this will give us an estimate to how well targeted the training proteins are vs the test proteins
def proteins_targeted(train_df, test_df, split='new', min_freq=0, normalized=False):
    # protein count comparison (number of diverse proteins)
    plt.figure(figsize=(18,8))
    # x-axis is the normalized frequency, y-axis is the number of proteins that have that frequency (also normalized)
    vc = train_df.prot_id.value_counts()
    vc = vc[vc > min_freq]
    train_counts = list(vc/len(test_df)) if normalized else vc.values
    vc = test_df.prot_id.value_counts()
    vc = vc[vc > min_freq]
    test_counts = list(vc/len(test_df)) if normalized else vc.values

    sns.histplot(train_counts, 
                bins=50, stat='density', color='green', alpha=0.4)
    sns.histplot(test_counts, 
                bins=50,stat='density', color='blue', alpha=0.4)

    sns.kdeplot(train_counts, color='green', alpha=0.8)
    sns.kdeplot(test_counts, color='blue', alpha=0.8)

    plt.xlabel(f"{'normalized ' if normalized else ''} frequency")
    plt.ylabel("normalized number of proteins with that frequency")
    plt.title(f"Targeted differences for {split} split{f' (> {min_freq})' if min_freq else ''}")
    if not normalized:
        plt.xlim(-8,100)

# proteins_targeted(old_train_df, old_test_df, split='oncoKB')
# plt.show()
# proteins_targeted(train_df, test_df, split='random')
# plt.show()


proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test')
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=5)
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=10)
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=15)
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=20)
plt.show()

Similarity scores between training and testing

No difference was found: -5.5718 (for random) vs -5.914 (for oncoKB split)

Details

Code

#%%
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

new = '/cluster/home/t122995uhn/projects/splits/new/pdbbind/'

train_df = pd.concat([pd.read_csv(f'{new}train0.csv'), 
                      pd.read_csv(f'{new}val0.csv')], axis=0)
test_df = pd.read_csv(f'{new}test.csv')

all_df = pd.concat([train_df, test_df], axis=0)
print(len(all_df))


#%%
old = '/cluster/home/t122995uhn/projects/splits/old/pdbbind/'
old_test_df = pd.read_csv(f'{old}test.csv')
old_train_df = all_df[~all_df['code'].isin(old_test_df['code'])]
# %%
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import substitution_matrices

from tqdm import tqdm
import random

def get_group_similarity(group1, group2):
    # Choose a substitution matrix (e.g., BLOSUM62)
    matrix = substitution_matrices.load("BLOSUM62")

    # Define gap penalties
    gap_open = -10
    gap_extend = -0.5

    # Function to calculate pairwise similarity score
    def calculate_similarity(seq1, seq2):
        alignments = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
        return alignments[0][2]  # Return the score of the best alignment

    # Compute pairwise similarity between all sequences in group1 and group2
    similarity_scores = []
    for seq1 in group1:
        for seq2 in group2:
            score = calculate_similarity(seq1, seq2)
            similarity_scores.append(score)

    # Calculate the average similarity score
    average_similarity = sum(similarity_scores) / len(similarity_scores)
    return similarity_scores, average_similarity


# sample 10 sequences randomly 100x
train_seq = old_train_df.prot_seq.drop_duplicates().to_list()
test_seq = old_test_df.prot_seq.drop_duplicates().to_list()
sample_size = 5
trials = 100

est_similarity = 0
for _ in tqdm(range(trials)):
    _, avg = get_group_similarity(random.sample(train_seq, sample_size), 
                                  random.sample(test_seq, sample_size))
    est_similarity += avg

print(est_similarity/1000)

jyaacoub added the main hurdle/issue This is an issue that was a pivotal moment during the project. label Aug 2, 2024

jyaacoub mentioned this issue Aug 2, 2024

revert(splits): use random split to resolve distribution drift (#131) #132

Merged

jyaacoub added a commit that referenced this issue Aug 2, 2024

fix(splits): new random splits #113 #128 #131

b3e52a9

To limit distribution drift issues mentioned in #131.

jyaacoub added a commit that referenced this issue Aug 2, 2024

Merge pull request #132 from jyaacoub/v131

9b3b2b4

revert(splits): use random split to resolve distribution drift (#131)

jyaacoub closed this as completed in #132 Aug 2, 2024

jyaacoub reopened this Aug 2, 2024

jyaacoub changed the title ~~Distribution Drift issue with training and test dataset~~ Updated results for distribution drift issue Aug 2, 2024

jyaacoub closed this as completed Sep 15, 2024

jyaacoub pinned this issue Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated results for distribution drift issue #131

Updated results for distribution drift issue #131

jyaacoub commented Aug 2, 2024 •

edited

Loading

jyaacoub commented Aug 2, 2024

jyaacoub commented Aug 2, 2024 •

edited

Loading

jyaacoub commented Aug 8, 2024 •

edited

Loading

Full distribution:

Limit to top n frequent proteins

Updated results for distribution drift issue #131

Updated results for distribution drift issue #131

Comments

jyaacoub commented Aug 2, 2024 • edited Loading

TODOs:

jyaacoub commented Aug 2, 2024

Get test OncoKBs

PDBbind

Davis

jyaacoub commented Aug 2, 2024 • edited Loading

SUMMARY (see below for stats on distributions - oncokb vs random split dataset):

jyaacoub commented Aug 8, 2024 • edited Loading

stats on distribution differences (oncokb split vs random split):

Distribution test splits:

Full distribution:

Limit to top n frequent proteins

Similarity scores between training and testing

jyaacoub commented Aug 2, 2024 •

edited

Loading

jyaacoub commented Aug 2, 2024 •

edited

Loading

jyaacoub commented Aug 8, 2024 •

edited

Loading