Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify cross validation splits to use consistent sets #113

Closed
Tracked by #112
jyaacoub opened this issue Jul 2, 2024 · 5 comments · Fixed by #117
Closed
Tracked by #112

Unify cross validation splits to use consistent sets #113

jyaacoub opened this issue Jul 2, 2024 · 5 comments · Fixed by #117

Comments

@jyaacoub
Copy link
Owner

jyaacoub commented Jul 2, 2024

Rebuilding existing datasets is unnecessary since the XY.csv is what is used to get items:

def __getitem__(self, idx) -> dict:
row = self.df.iloc[idx] #WARNING: idx must be a list in future versions on pandas since it is deprecated
code = row.name
prot_id = row['prot_id']
lig_seq = row['SMILE']

We need to define a new resplit function to take the dataset, delete all the old train, test, and val subsets and replace them with new subsets that are defined by the following constraints:

  • New test set must contain all proteins from the test_gene_names.csv file that was used for existing analyses (TCGA analysis #95 Platinum analysis  #94)
  • Test set must also include at least 2-3 "heavily targeted" proteins so that we can do a deeper analysis focusing on just those proteins. This resource should be helpful for identifying such heavily targeted proteins by the number of times they appear in the DataFrame for the interactions.tsv file.
@jyaacoub
Copy link
Owner Author

jyaacoub commented Jul 2, 2024

resplit function:

Should be defined in - MutDTA/src/train_test/splitting.py

  • Takes as input the target dataset path (or list of options that define that dataset), and a list defining the splits for all 5 folds + 1 test set.
  • Deletes existing splits
  • Builds new splits (this is already defined in Dataset.save_subset()

jyaacoub added a commit that referenced this issue Jul 3, 2024
jyaacoub added a commit that referenced this issue Jul 3, 2024
Example usage:
```python
#%% now based on this test set we can create the splits that will be used for all models
# 5-fold cross validation + test set
import pandas as pd
from src import cfg
from src.train_test.splitting import balanced_kfold_split
from src.utils.loader import Loader

test_df = pd.read_csv('/home/jean/projects/data/splits/davis_test_genes_oncoG.csv')
test_prots = set(test_df.prot_id)

db = Loader.load_dataset(f'{cfg.DATA_ROOT}/DavisKibaDataset/davis/nomsa_binary_original_binary/full/')

#%%
train, val, test = balanced_kfold_split(db,
                k_folds=5, test_split=0.1, val_split=0.1,
                test_prots=test_prots, random_seed=0, verbose=True
                )

#%%
db.save_subset_folds(train, 'train')
db.save_subset_folds(val, 'val')
db.save_subset(test, 'test')
```
@jyaacoub
Copy link
Owner Author

jyaacoub commented Jul 3, 2024

Whats left:

  • Define resplit function which accepts the list of csvs defining our split and re-splits the target dataset using its "full" db.
  • Optionally on top of this function we can add a wrapper that takes as input a path to the dataset it wants to be "like"
  • Define test sets for Kiba and PDBbind with OncoKB file

@jyaacoub
Copy link
Owner Author

jyaacoub commented Jul 5, 2024

test split for kiba

Test set size goal of ($118083\times 0.1 \approx 11808$)

  • This means all the proteins from the davis test set can be added to kiba test set
    • Combining entire old dataset and davis genes gives us a total of 6028 entries
  • next step is to add more proteins from OncoKB (we need $11808-6028=5780 \text{ entries}$)
  • From the remaining matching genes from OncoKB we can just add all of them since they only give us an additional 3680 (we still need $11808-6028-3680=2100 \text{ entries}$).
  • For the last 2100 we just randomly sample until we arrive at the final test set below:
    image
Code

# %%
import pandas as pd
import logging
DATA_ROOT = '../data'
biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t')
biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True)

# %% Specific to kiba:
kiba_df = pd.read_csv(f'{DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv')
kiba_df = kiba_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'), 
              left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left')
kiba_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True)

if kiba_df.gene.isna().sum() != 0: logging.warning("Some proteins failed to get their gene names!")

# %% making sure to add any matching davis prots to the kiba test set
davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv')
davis_test_prots = set(davis_df.prot_id.str.split('(').str[0])
kiba_davis_gene_overlap = kiba_df[kiba_df.gene.isin(davis_test_prots)].gene.value_counts()
print("Total # of gene overlap with davis TEST set:", len(kiba_davis_gene_overlap))
print("                       # of entries in kiba:", kiba_davis_gene_overlap.sum())

# Starting off with davis test set as the initial test set:
kiba_test_df = kiba_df[kiba_df.gene.isin(davis_test_prots)]

# %% using previous kiba test db:
kiba_test_old_df = pd.read_csv('/cluster/home/t122995uhn/projects/downloads/test_prots_gene_names.csv')
kiba_test_old_df = kiba_test_old_df[kiba_test_old_df['db'] == 'kiba']
kiba_test_old_prots = set(kiba_test_old_df.gene_name)

kiba_test_df = pd.concat([kiba_test_df, kiba_df[kiba_df.gene.isin(kiba_test_old_prots)]], axis=0).drop_duplicates(['prot_id', 'lig_id'])
print("Combined kiba test set with davis matching genes size:", len(kiba_test_df))

#%% NEXT STEP IS TO ADD MORE PROTS FROM ONCOKB IF AVAILABLE.
onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv")

kiba_join_onco = set(kiba_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene'])

#%%
remaining_onco = onco_df[~onco_df.gene.isin(kiba_join_onco)].drop_duplicates('gene')

# match with remaining kiba:
remaining_onco_kiba_df = kiba_df.merge(remaining_onco, on='gene', how="inner")
counts = remaining_onco_kiba_df.value_counts('gene')
print(counts)
# this gives us 3680 which still falls short of our 11,808 goal for the test set size
print("total entries in kiba with remaining (not already in test set) onco genes", counts.sum()) 


#%%
# drop_duplicates is redundant but just in case.
kiba_test_df = pd.concat([kiba_test_df, remaining_onco_kiba_df], axis=0).drop_duplicates(['prot_id', 'lig_id']) 
print("Combined kiba test set with remaining OncoKB genes:", len(kiba_test_df))

# %% For the remaining 2100 entries we will just choose those randomly until we reach our target of 11808 entries
# code is from balanced_kfold_split function
from collections import Counter
import numpy as np

# Get size for each dataset and indices
dataset_size = len(kiba_df)
test_size = int(0.1 * dataset_size) # 11808
indices = list(range(dataset_size))

# getting counts for each unique protein
prot_counts = kiba_df['prot_id'].value_counts().to_dict()
prots = list(prot_counts.keys())
np.random.shuffle(prots)

# manually selected prots:
test_prots = set(kiba_test_df.prot_id)
# increment count by number of samples in test_prots
count = sum([prot_counts[p] for p in test_prots])

#%%
## Sampling remaining proteins for test set (if we are under the test_size) 
for p in prots: # O(k); k = number of proteins
    if count + prot_counts[p] < test_size:
        test_prots.add(p)
        count += prot_counts[p]

additional_prots = test_prots - set(kiba_test_df.prot_id)
print('additional prot_ids to add:', len(additional_prots))
print('                     count:', count)

#%% ADDING FINAL PROTS
rand_sample_df = kiba_df[kiba_df.prot_id.isin(additional_prots)]
kiba_test_df = pd.concat([kiba_test_df, rand_sample_df], axis=0).drop_duplicates(['prot_id', 'lig_id'])

kiba_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True)
print('final test dataset for kiba:')
kiba_test_df

#%% saving
kiba_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv', index=False)

OR:

MutDTA/playground.py

Lines 1 to 95 in c7fdc86

# %%
import pandas as pd
import logging
DATA_ROOT = '../data'
biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t')
biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True)
# %% Specific to kiba:
kiba_df = pd.read_csv(f'{DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv')
kiba_df = kiba_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'),
left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left')
kiba_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True)
if kiba_df.gene.isna().sum() != 0: logging.warning("Some proteins failed to get their gene names!")
# %% making sure to add any matching davis prots to the kiba test set
davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv')
davis_test_prots = set(davis_df.prot_id.str.split('(').str[0])
kiba_davis_gene_overlap = kiba_df[kiba_df.gene.isin(davis_test_prots)].gene.value_counts()
print("Total # of gene overlap with davis TEST set:", len(kiba_davis_gene_overlap))
print(" # of entries in kiba:", kiba_davis_gene_overlap.sum())
# Starting off with davis test set as the initial test set:
kiba_test_df = kiba_df[kiba_df.gene.isin(davis_test_prots)]
# %% using previous kiba test db:
kiba_test_old_df = pd.read_csv('/cluster/home/t122995uhn/projects/downloads/test_prots_gene_names.csv')
kiba_test_old_df = kiba_test_old_df[kiba_test_old_df['db'] == 'kiba']
kiba_test_old_prots = set(kiba_test_old_df.gene_name)
kiba_test_df = pd.concat([kiba_test_df, kiba_df[kiba_df.gene.isin(kiba_test_old_prots)]], axis=0).drop_duplicates(['prot_id', 'lig_id'])
print("Combined kiba test set with davis matching genes size:", len(kiba_test_df))
#%% NEXT STEP IS TO ADD MORE PROTS FROM ONCOKB IF AVAILABLE.
onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv")
kiba_join_onco = set(kiba_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene'])
#%%
remaining_onco = onco_df[~onco_df.gene.isin(kiba_join_onco)].drop_duplicates('gene')
# match with remaining kiba:
remaining_onco_kiba_df = kiba_df.merge(remaining_onco, on='gene', how="inner")
counts = remaining_onco_kiba_df.value_counts('gene')
print(counts)
# this gives us 3680 which still falls short of our 11,808 goal for the test set size
print("total entries in kiba with remaining (not already in test set) onco genes", counts.sum())
#%%
# drop_duplicates is redundant but just in case.
kiba_test_df = pd.concat([kiba_test_df, remaining_onco_kiba_df], axis=0).drop_duplicates(['prot_id', 'lig_id'])
print("Combined kiba test set with remaining OncoKB genes:", len(kiba_test_df))
# %% For the remaining 2100 entries we will just choose those randomly until we reach our target of 11808 entries
# code is from balanced_kfold_split function
from collections import Counter
import numpy as np
# Get size for each dataset and indices
dataset_size = len(kiba_df)
test_size = int(0.1 * dataset_size) # 11808
indices = list(range(dataset_size))
# getting counts for each unique protein
prot_counts = kiba_df['prot_id'].value_counts().to_dict()
prots = list(prot_counts.keys())
np.random.shuffle(prots)
# manually selected prots:
test_prots = set(kiba_test_df.prot_id)
# increment count by number of samples in test_prots
count = sum([prot_counts[p] for p in test_prots])
#%%
## Sampling remaining proteins for test set (if we are under the test_size)
for p in prots: # O(k); k = number of proteins
if count + prot_counts[p] < test_size:
test_prots.add(p)
count += prot_counts[p]
additional_prots = test_prots - set(kiba_test_df.prot_id)
print('additional prot_ids to add:', len(additional_prots))
print(' count:', count)
#%% ADDING FINAL PROTS
rand_sample_df = kiba_df[kiba_df.prot_id.isin(additional_prots)]
kiba_test_df = pd.concat([kiba_test_df, rand_sample_df], axis=0).drop_duplicates(['prot_id', 'lig_id'])
kiba_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True)
print('final test dataset for kiba:')
kiba_test_df
#%% saving
kiba_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv', index=False)

jyaacoub added a commit that referenced this issue Jul 5, 2024
@jyaacoub
Copy link
Owner Author

jyaacoub commented Jul 6, 2024

Test set for PDBbind

Test set size Goal of $16265*0.1\approx1626$

Initial stats after getting gene names by matching with biomart:

                            match on PDB ID: 1120
                           match on prot_id: 1039

Combined match (not accounting for aliases): 1216
 pdb_df.gene_x.combine_first(pdb_df.gene_y): 1138

           num genes where gene_x != gene_y: 237

   Total number of entries with a gene name: 8624/16265
  • Number of entries after merging gene names with kiba test set: 171
    • Number of genes: 13
  • Total # of gene overlap with davis TEST set: 6
    • entries in pdb: 60
    • This entirely overlaps with kiba so there is no change in test set size.
  • Adding remaining matching with OncoKB proteins gives us an additional 93 genes for a total of 264 entries
  • The remaining $1626-264=1362$ entries will be randomly sampled to arrive at our final test dataset with 1603 entries

MutDTA/playground.py

Lines 1 to 124 in 256563c

# %%
import pandas as pd
import logging
DATA_ROOT = '../data'
biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t')
biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True)
biom_df['PDB ID'] = biom_df['PDB ID'].str.lower()
# %% merge on PDB ID
pdb_df = pd.read_csv(f'{DATA_ROOT}/PDBbindDataset/nomsa_binary_original_binary/full/XY.csv')
pdb_df = pdb_df.merge(biom_df.drop_duplicates('PDB ID'), left_on='code', right_on="PDB ID", how='left')
pdb_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True)
# %% merge on prot_id: - gene_y
pdb_df = pdb_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'),
left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left')
pdb_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True)
#%%
biom_pdb_match_on_pdbID = pdb_df.gene_x.dropna().drop_duplicates()
print(' match on PDB ID:', len(biom_pdb_match_on_pdbID))
biom_pdb_match_on_prot_id = pdb_df.gene_y.dropna().drop_duplicates()
print(' match on prot_id:', len(biom_pdb_match_on_prot_id))
biom_concat = pd.concat([biom_pdb_match_on_pdbID,biom_pdb_match_on_prot_id]).drop_duplicates()
print('\nCombined match (not accounting for aliases):', len(biom_concat))
# cases where both pdb ID and prot_id match can cause issues if gene_x != gene_y resulting in double counting
# in above concat
pdb_df['gene'] = pdb_df.gene_x.combine_first(pdb_df.gene_y)
print(' pdb_df.gene_x.combine_first(pdb_df.gene_y):', len(pdb_df['gene'].dropna().drop_duplicates()))
# case where we match on prot_id and PDB ID can cause issues with mismatched counts due to
# different names for the gene (e.g.: due to aliases)
print("\n num genes where gene_x != gene_y:",
len(pdb_df[pdb_df['gene_x'] != pdb_df['gene_y']].dropna().drop_duplicates(['gene_x','gene_y'])))
print(f'\n Total number of entries with a gene name: {len(pdb_df[~pdb_df.gene.isna()])}/{len(pdb_df)}')
# %% matching with kiba gene names as our starting test set
kiba_test_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv')
kiba_test_df = kiba_test_df[['gene']].drop_duplicates()
# only 171 rows from merging with kiba...
pdb_test_df = pdb_df.merge(kiba_test_df, on='gene', how='inner').drop_duplicates(['code', 'SMILE'])
print('Number of entries after merging gene names with kiba test set:', len(pdb_test_df))
print(' Number of genes:', len(pdb_test_df.gene.drop_duplicates()))
# %% adding any davis test set genes
davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv')
davis_test_prots = set(davis_df.prot_id.str.split('(').str[0])
pdb_davis_gene_overlap = pdb_df[pdb_df.gene.isin(davis_test_prots)].gene.value_counts()
print("Total # of gene overlap with davis TEST set:", len(pdb_davis_gene_overlap))
print(" # of entries in pdb:", pdb_davis_gene_overlap.sum())
pdb_test_df = pd.concat([pdb_test_df, pdb_df[pdb_df.gene.isin(davis_test_prots)]],
axis=0).drop_duplicates(['code', 'SMILE'])
print("# of entries in test set after adding davis genes: ", len(pdb_test_df))
#%% CONTINUE TO GET FROM OncoKB:
onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv")
pdb_join_onco = set(pdb_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene'])
#%%
remaining_onco = onco_df[~onco_df.gene.isin(pdb_join_onco)].drop_duplicates('gene')
# match with remaining pdb:
remaining_onco_pdb_df = pdb_df.merge(remaining_onco, on='gene', how="inner")
counts = remaining_onco_pdb_df.value_counts('gene')
print(counts)
print("total entries in pdb with remaining (not already in test set) onco genes", counts.sum())
# this only gives us 93 entries... so adding it to the rest would only give us 171+93=264 total entries
pdb_test_df = pd.concat([pdb_test_df, remaining_onco_pdb_df], axis=0).drop_duplicates(['code', 'SMILE'])
print("Combined pdb test set with remaining OncoKB genes entries:", len(pdb_test_df)) # 264 only
# %% Random sample to get the rest
# code is from balanced_kfold_split function
from collections import Counter
import numpy as np
# Get size for each dataset and indices
dataset_size = len(pdb_df)
test_size = int(0.1 * dataset_size) # 1626
indices = list(range(dataset_size))
# getting counts for each unique protein
prot_counts = pdb_df['code'].value_counts().to_dict()
prots = list(prot_counts.keys())
np.random.shuffle(prots)
# manually selected prots:
test_prots = set(pdb_test_df.code)
# increment count by number of samples in test_prots
count = sum([prot_counts[p] for p in test_prots])
#%%
## Sampling remaining proteins for test set (if we are under the test_size)
for p in prots: # O(k); k = number of proteins
if count + prot_counts[p] < test_size:
test_prots.add(p)
count += prot_counts[p]
additional_prots = test_prots - set(pdb_test_df.code)
print('additional codes to add:', len(additional_prots))
print(' count:', count)
#%% ADDING FINAL PROTS
rand_sample_df = pdb_df[pdb_df.code.isin(additional_prots)]
pdb_test_df = pd.concat([pdb_test_df, rand_sample_df], axis=0).drop_duplicates(['code'])
pdb_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True)
print('Final test dataset for pdbbind:')
pdb_test_df
#%% saving
pdb_test_df.rename({"gene_x":"gene_matched_on_pdb_id", "gene_y": "gene_matched_on_uniprot_id"}, axis=1, inplace=True)
pdb_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/pdbbind_test.csv', index=False)

@jyaacoub
Copy link
Owner Author

jyaacoub commented Jul 8, 2024

This is resolved by 69add71, and we now have constant validation sets for each CV training run.

@jyaacoub jyaacoub closed this as completed Jul 8, 2024
jyaacoub added a commit that referenced this issue Jul 9, 2024
still need to train esm variants to complete #113 for davis
@jyaacoub jyaacoub linked a pull request Jul 9, 2024 that will close this issue
jyaacoub added a commit that referenced this issue Jul 9, 2024
jyaacoub added a commit that referenced this issue Jul 10, 2024
jyaacoub added a commit that referenced this issue Jul 10, 2024
Since the whole point of v115 is to compare the performance against aflow in an equal playing field. #113 #115
@jyaacoub jyaacoub mentioned this issue Jul 10, 2024
jyaacoub added a commit that referenced this issue Jul 17, 2024
Unused parameters due to inheriting from DGraphDTA but not using the forward_pro method
jyaacoub added a commit that referenced this issue Jul 18, 2024
jyaacoub added a commit that referenced this issue Jul 22, 2024
@jyaacoub jyaacoub mentioned this issue Jul 22, 2024
jyaacoub added a commit that referenced this issue Jul 24, 2024
jyaacoub added a commit that referenced this issue Jul 24, 2024
jyaacoub added a commit that referenced this issue Jul 29, 2024
jyaacoub added a commit that referenced this issue Aug 2, 2024
To limit distribution drift issues mentioned in #131.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant