Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated results for distribution drift issue #131

Closed
3 of 4 tasks
jyaacoub opened this issue Aug 2, 2024 · 3 comments · Fixed by #132
Closed
3 of 4 tasks

Updated results for distribution drift issue #131

jyaacoub opened this issue Aug 2, 2024 · 3 comments · Fixed by #132
Labels
main hurdle/issue This is an issue that was a pivotal moment during the project.

Comments

@jyaacoub
Copy link
Owner

jyaacoub commented Aug 2, 2024

There is a significant distribution drift due to the new training split we had created to exclude any proteins in OncoKB...

  • This is evident by the drastic difference in performance when training with the split vs when training on a random split.
  • Most evident in the PDBbind dataset.
  • get distributional stats to provide additional support for this claim
    • Via sequence identity?

TODOs:

  • build new dataset with a random, but unified test set (any overlaping proteins across datasets that are in the test set, must also be in the test set).
  • Identify the list of ligand-protein pairs from oncoKB that are not in the training set, which can be used for Mutagenesis analysis.
  • profit??
@jyaacoub jyaacoub added the main hurdle/issue This is an issue that was a pivotal moment during the project. label Aug 2, 2024
@jyaacoub
Copy link
Owner Author

jyaacoub commented Aug 2, 2024

Get test OncoKBs

PDBbind

Using both uniprot ID and pdb ID:
image
Filtering by both gene name and drug name is difficult since PDBbind uses shortform names like "NLG" and the oncoKB uses the full name of the drug, and there are multiple aliases for each drug so the shortform name might not match. We would have to check against all possible aliases to ensure that we had the right drug-target pair.

code

#%%
import pandas as pd

def get_test_oncokbs(train_df=pd.read_csv('/cluster/home/t122995uhn/projects/data/test/PDBbindDataset/nomsa_binary_original_binary/full/cleaned_XY.csv'),
                     oncokb_fp='/cluster/home/t122995uhn/projects/data/tcga/mart_export.tsv', 
                     biomart='/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv'):
    #Get gene names for PDBbind
    dfbm = pd.read_csv(oncokb_fp, sep='\t')
    dfbm['PDB ID'] = dfbm['PDB ID'].str.lower()
    train_df.reset_index(names='idx',inplace=True)

    df_uni = train_df.merge(dfbm, how='inner', left_on='prot_id', right_on='UniProtKB/Swiss-Prot ID')
    df_pdb = train_df.merge(dfbm, how='inner', left_on='code', right_on='PDB ID')

    # identifying ovelap with oncokb
    # df_all will have duplicate entries for entries with multiple gene names...
    df_all = pd.concat([df_uni, df_pdb]).drop_duplicates(['idx', 'Gene name'])[['idx', 'code', 'Gene name']]

    dfkb = pd.read_csv(biomart)
    df_all_kb = df_all.merge(dfkb.drop_duplicates('gene'), left_on='Gene name', right_on='gene', how='inner')

    trained_genes = set(df_all_kb.gene)

    #Identify non-trained genes
    return dfkb[~dfkb['gene'].isin(trained_genes)]


train_df = pd.read_csv('/cluster/home/t122995uhn/projects/data/test/PDBbindDataset/nomsa_binary_original_binary/train0/cleaned_XY.csv')
val_df = pd.read_csv('/cluster/home/t122995uhn/projects/data/test/PDBbindDataset/nomsa_binary_original_binary/val0/cleaned_XY.csv')

train_df = pd.concat([train_df, val_df])

get_test_oncokbs(train_df=train_df)

Davis

jyaacoub added a commit that referenced this issue Aug 2, 2024
To limit distribution drift issues mentioned in #131.
jyaacoub added a commit that referenced this issue Aug 2, 2024
revert(splits): use random split to resolve distribution drift (#131)
@jyaacoub
Copy link
Owner Author

jyaacoub commented Aug 2, 2024

SUMMARY (see below for stats on distributions - oncokb vs random split dataset):

The distribution looks visually different in terms of highly targeted proteins, but when running a similarity scoring algorithmn to deduce the difference in the two distributions (random split vs OncoKB split) there was no real difference, but this could be a fault of the scoring algorithmn

@jyaacoub jyaacoub reopened this Aug 2, 2024
@jyaacoub jyaacoub changed the title Distribution Drift issue with training and test dataset Updated results for distribution drift issue Aug 2, 2024
jyaacoub added a commit that referenced this issue Aug 7, 2024
… index renumbering #103

- Had to make some modifications since edge index needs to be updated after applying the mask so that it still points to the right nodes and we dont get something like an "IndexError" for being out of bounds

- Also error due to not removing all proteins without pocket sequences (line 216 saved the old dataset instead of the new one).

- Successfully built pocket datasets for davis and kiba

#131 #103
@jyaacoub
Copy link
Owner Author

jyaacoub commented Aug 8, 2024

stats on distribution differences (oncokb split vs random split):

Small difference in the % of heavily targeted proteins that appear in the test set (oncoKB test set had fewer proteins that were heavily targeted)

  • Likely due to the fact that we had hand picked oncoKB proteins which were heavily targeted and so the rest of the proteins were low targets, and thus the majority ended up being low targets.

Distribution test splits:

Poor performance is likely due to high diversity in the OncoKB test set, making it harder for the model. This is because the model has to be good at predicting a larger range of proteins.

Details

Full distribution:

We can see that the oncoKB test split has way more "lowly" targeted proteins in the test set, which are naturally going to be harder for the model to be evaluated on due to the range of proteins that the model has to be good at.

This is counter intuitive but the reason for this was because we hand picked the more heavily targeted proteins for the test set, and because we want the test sets to remain the same size, the code that I wrote picked lowly targeted proteins in order to reach that size constraint.
Whereas the random split has a more even distribution, mimicing more closely the training split.
image

Limit to top n frequent proteins

This really shows the impact of hand picking the proteins for the test set. They are shown in the spikes of green from the oncokb test split.
image
image
image

Code

#%%
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

new = '/cluster/home/t122995uhn/projects/splits/new/pdbbind/'

train_df = pd.concat([pd.read_csv(f'{new}train0.csv'), 
                      pd.read_csv(f'{new}val0.csv')], axis=0)
test_df = pd.read_csv(f'{new}test.csv')

all_df = pd.concat([train_df, test_df], axis=0)
print(len(all_df))


#%%
old = '/cluster/home/t122995uhn/projects/splits/old/pdbbind/'
old_test_df = pd.read_csv(f'{old}test.csv')
old_train_df = all_df[~all_df['code'].isin(old_test_df['code'])]

# %%
# this will give us an estimate to how well targeted the training proteins are vs the test proteins
def proteins_targeted(train_df, test_df, split='new', min_freq=0, normalized=False):
    # protein count comparison (number of diverse proteins)
    plt.figure(figsize=(18,8))
    # x-axis is the normalized frequency, y-axis is the number of proteins that have that frequency (also normalized)
    vc = train_df.prot_id.value_counts()
    vc = vc[vc > min_freq]
    train_counts = list(vc/len(test_df)) if normalized else vc.values
    vc = test_df.prot_id.value_counts()
    vc = vc[vc > min_freq]
    test_counts = list(vc/len(test_df)) if normalized else vc.values

    sns.histplot(train_counts, 
                bins=50, stat='density', color='green', alpha=0.4)
    sns.histplot(test_counts, 
                bins=50,stat='density', color='blue', alpha=0.4)

    sns.kdeplot(train_counts, color='green', alpha=0.8)
    sns.kdeplot(test_counts, color='blue', alpha=0.8)

    plt.xlabel(f"{'normalized ' if normalized else ''} frequency")
    plt.ylabel("normalized number of proteins with that frequency")
    plt.title(f"Targeted differences for {split} split{f' (> {min_freq})' if min_freq else ''}")
    if not normalized:
        plt.xlim(-8,100)

# proteins_targeted(old_train_df, old_test_df, split='oncoKB')
# plt.show()
# proteins_targeted(train_df, test_df, split='random')
# plt.show()


proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test')
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=5)
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=10)
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=15)
plt.show()
proteins_targeted(old_test_df, test_df, split='oncoKB(green) vs random(blue) test', min_freq=20)
plt.show()

Similarity scores between training and testing

No difference was found: -5.5718 (for random) vs -5.914 (for oncoKB split)

Details

Code

#%%
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

new = '/cluster/home/t122995uhn/projects/splits/new/pdbbind/'

train_df = pd.concat([pd.read_csv(f'{new}train0.csv'), 
                      pd.read_csv(f'{new}val0.csv')], axis=0)
test_df = pd.read_csv(f'{new}test.csv')

all_df = pd.concat([train_df, test_df], axis=0)
print(len(all_df))


#%%
old = '/cluster/home/t122995uhn/projects/splits/old/pdbbind/'
old_test_df = pd.read_csv(f'{old}test.csv')
old_train_df = all_df[~all_df['code'].isin(old_test_df['code'])]
# %%
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import substitution_matrices

from tqdm import tqdm
import random

def get_group_similarity(group1, group2):
    # Choose a substitution matrix (e.g., BLOSUM62)
    matrix = substitution_matrices.load("BLOSUM62")

    # Define gap penalties
    gap_open = -10
    gap_extend = -0.5

    # Function to calculate pairwise similarity score
    def calculate_similarity(seq1, seq2):
        alignments = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
        return alignments[0][2]  # Return the score of the best alignment

    # Compute pairwise similarity between all sequences in group1 and group2
    similarity_scores = []
    for seq1 in group1:
        for seq2 in group2:
            score = calculate_similarity(seq1, seq2)
            similarity_scores.append(score)

    # Calculate the average similarity score
    average_similarity = sum(similarity_scores) / len(similarity_scores)
    return similarity_scores, average_similarity


# sample 10 sequences randomly 100x
train_seq = old_train_df.prot_seq.drop_duplicates().to_list()
test_seq = old_test_df.prot_seq.drop_duplicates().to_list()
sample_size = 5
trials = 100

est_similarity = 0
for _ in tqdm(range(trials)):
    _, avg = get_group_similarity(random.sample(train_seq, sample_size), 
                                  random.sample(test_seq, sample_size))
    est_similarity += avg

print(est_similarity/1000)

@jyaacoub jyaacoub pinned this issue Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
main hurdle/issue This is an issue that was a pivotal moment during the project.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant