Unify cross validation splits to use consistent sets #113

jyaacoub · 2024-07-02T14:54:41Z

Rebuilding existing datasets is unnecessary since the XY.csv is what is used to get items:

Lines 255 to 260 in dd324d3

    
           def __getitem__(self, idx) -> dict: 
        
               row = self.df.iloc[idx] #WARNING: idx must be a list in future versions on pandas since it is deprecated 
        
               code = row.name 
        
               prot_id = row['prot_id'] 
        
               lig_seq = row['SMILE']

We need to define a new resplit function to take the dataset, delete all the old train, test, and val subsets and replace them with new subsets that are defined by the following constraints:

New test set must contain all proteins from the test_gene_names.csv file that was used for existing analyses (TCGA analysis #95 Platinum analysis #94)
Test set must also include at least 2-3 "heavily targeted" proteins so that we can do a deeper analysis focusing on just those proteins. This resource should be helpful for identifying such heavily targeted proteins by the number of times they appear in the DataFrame for the interactions.tsv file.

The text was updated successfully, but these errors were encountered:

jyaacoub · 2024-07-02T15:18:16Z

`resplit` function:

Should be defined in - MutDTA/src/train_test/splitting.py

Takes as input the target dataset path (or list of options that define that dataset), and a list defining the splits for all 5 folds + 1 test set.
Deletes existing splits
Builds new splits (this is already defined in Dataset.save_subset()

Example usage: ```python #%% now based on this test set we can create the splits that will be used for all models # 5-fold cross validation + test set import pandas as pd from src import cfg from src.train_test.splitting import balanced_kfold_split from src.utils.loader import Loader test_df = pd.read_csv('/home/jean/projects/data/splits/davis_test_genes_oncoG.csv') test_prots = set(test_df.prot_id) db = Loader.load_dataset(f'{cfg.DATA_ROOT}/DavisKibaDataset/davis/nomsa_binary_original_binary/full/') #%% train, val, test = balanced_kfold_split(db, k_folds=5, test_split=0.1, val_split=0.1, test_prots=test_prots, random_seed=0, verbose=True ) #%% db.save_subset_folds(train, 'train') db.save_subset_folds(val, 'val') db.save_subset(test, 'test') ```

jyaacoub · 2024-07-03T20:31:51Z

Whats left:

Define resplit function which accepts the list of csvs defining our split and re-splits the target dataset using its "full" db.
Optionally on top of this function we can add a wrapper that takes as input a path to the dataset it wants to be "like"
Define test sets for Kiba and PDBbind with OncoKB file

#112

jyaacoub · 2024-07-05T20:24:55Z

test split for kiba

Test set size goal of ($118083\times 0.1 \approx 11808$)

This means all the proteins from the davis test set can be added to kiba test set
- Combining entire old dataset and davis genes gives us a total of 6028 entries
next step is to add more proteins from OncoKB (we need $11808-6028=5780 \text{ entries}$)
From the remaining matching genes from OncoKB we can just add all of them since they only give us an additional 3680 (we still need $11808-6028-3680=2100 \text{ entries}$).
For the last 2100 we just randomly sample until we arrive at the final test set below:

Code

# %%
import pandas as pd
import logging
DATA_ROOT = '../data'
biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t')
biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True)

# %% Specific to kiba:
kiba_df = pd.read_csv(f'{DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv')
kiba_df = kiba_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'), 
              left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left')
kiba_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True)

if kiba_df.gene.isna().sum() != 0: logging.warning("Some proteins failed to get their gene names!")

# %% making sure to add any matching davis prots to the kiba test set
davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv')
davis_test_prots = set(davis_df.prot_id.str.split('(').str[0])
kiba_davis_gene_overlap = kiba_df[kiba_df.gene.isin(davis_test_prots)].gene.value_counts()
print("Total # of gene overlap with davis TEST set:", len(kiba_davis_gene_overlap))
print("                       # of entries in kiba:", kiba_davis_gene_overlap.sum())

# Starting off with davis test set as the initial test set:
kiba_test_df = kiba_df[kiba_df.gene.isin(davis_test_prots)]

# %% using previous kiba test db:
kiba_test_old_df = pd.read_csv('/cluster/home/t122995uhn/projects/downloads/test_prots_gene_names.csv')
kiba_test_old_df = kiba_test_old_df[kiba_test_old_df['db'] == 'kiba']
kiba_test_old_prots = set(kiba_test_old_df.gene_name)

kiba_test_df = pd.concat([kiba_test_df, kiba_df[kiba_df.gene.isin(kiba_test_old_prots)]], axis=0).drop_duplicates(['prot_id', 'lig_id'])
print("Combined kiba test set with davis matching genes size:", len(kiba_test_df))

#%% NEXT STEP IS TO ADD MORE PROTS FROM ONCOKB IF AVAILABLE.
onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv")

kiba_join_onco = set(kiba_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene'])

#%%
remaining_onco = onco_df[~onco_df.gene.isin(kiba_join_onco)].drop_duplicates('gene')

# match with remaining kiba:
remaining_onco_kiba_df = kiba_df.merge(remaining_onco, on='gene', how="inner")
counts = remaining_onco_kiba_df.value_counts('gene')
print(counts)
# this gives us 3680 which still falls short of our 11,808 goal for the test set size
print("total entries in kiba with remaining (not already in test set) onco genes", counts.sum()) 


#%%
# drop_duplicates is redundant but just in case.
kiba_test_df = pd.concat([kiba_test_df, remaining_onco_kiba_df], axis=0).drop_duplicates(['prot_id', 'lig_id']) 
print("Combined kiba test set with remaining OncoKB genes:", len(kiba_test_df))

# %% For the remaining 2100 entries we will just choose those randomly until we reach our target of 11808 entries
# code is from balanced_kfold_split function
from collections import Counter
import numpy as np

# Get size for each dataset and indices
dataset_size = len(kiba_df)
test_size = int(0.1 * dataset_size) # 11808
indices = list(range(dataset_size))

# getting counts for each unique protein
prot_counts = kiba_df['prot_id'].value_counts().to_dict()
prots = list(prot_counts.keys())
np.random.shuffle(prots)

# manually selected prots:
test_prots = set(kiba_test_df.prot_id)
# increment count by number of samples in test_prots
count = sum([prot_counts[p] for p in test_prots])

#%%
## Sampling remaining proteins for test set (if we are under the test_size) 
for p in prots: # O(k); k = number of proteins
    if count + prot_counts[p] < test_size:
        test_prots.add(p)
        count += prot_counts[p]

additional_prots = test_prots - set(kiba_test_df.prot_id)
print('additional prot_ids to add:', len(additional_prots))
print('                     count:', count)

#%% ADDING FINAL PROTS
rand_sample_df = kiba_df[kiba_df.prot_id.isin(additional_prots)]
kiba_test_df = pd.concat([kiba_test_df, rand_sample_df], axis=0).drop_duplicates(['prot_id', 'lig_id'])

kiba_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True)
print('final test dataset for kiba:')
kiba_test_df

#%% saving
kiba_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv', index=False)

OR:

MutDTA/playground.py

Lines 1 to 95 in c7fdc86

    
           # %% 
        
           import pandas as pd 
        
           import logging 
        
           DATA_ROOT = '../data' 
        
           biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t') 
        
           biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True) 
        
           # %% Specific to kiba: 
        
           kiba_df = pd.read_csv(f'{DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv') 
        
           kiba_df = kiba_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'),  
        
                         left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left') 
        
           kiba_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True) 
        
           if kiba_df.gene.isna().sum() != 0: logging.warning("Some proteins failed to get their gene names!") 
        
           # %% making sure to add any matching davis prots to the kiba test set 
        
           davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv') 
        
           davis_test_prots = set(davis_df.prot_id.str.split('(').str[0]) 
        
           kiba_davis_gene_overlap = kiba_df[kiba_df.gene.isin(davis_test_prots)].gene.value_counts() 
        
           print("Total # of gene overlap with davis TEST set:", len(kiba_davis_gene_overlap)) 
        
           print("                       # of entries in kiba:", kiba_davis_gene_overlap.sum()) 
        
           # Starting off with davis test set as the initial test set: 
        
           kiba_test_df = kiba_df[kiba_df.gene.isin(davis_test_prots)] 
        
           # %% using previous kiba test db: 
        
           kiba_test_old_df = pd.read_csv('/cluster/home/t122995uhn/projects/downloads/test_prots_gene_names.csv') 
        
           kiba_test_old_df = kiba_test_old_df[kiba_test_old_df['db'] == 'kiba'] 
        
           kiba_test_old_prots = set(kiba_test_old_df.gene_name) 
        
           kiba_test_df = pd.concat([kiba_test_df, kiba_df[kiba_df.gene.isin(kiba_test_old_prots)]], axis=0).drop_duplicates(['prot_id', 'lig_id']) 
        
           print("Combined kiba test set with davis matching genes size:", len(kiba_test_df)) 
        
           #%% NEXT STEP IS TO ADD MORE PROTS FROM ONCOKB IF AVAILABLE. 
        
           onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv") 
        
           kiba_join_onco = set(kiba_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene']) 
        
           #%% 
        
           remaining_onco = onco_df[~onco_df.gene.isin(kiba_join_onco)].drop_duplicates('gene') 
        
           # match with remaining kiba: 
        
           remaining_onco_kiba_df = kiba_df.merge(remaining_onco, on='gene', how="inner") 
        
           counts = remaining_onco_kiba_df.value_counts('gene') 
        
           print(counts) 
        
           # this gives us 3680 which still falls short of our 11,808 goal for the test set size 
        
           print("total entries in kiba with remaining (not already in test set) onco genes", counts.sum())  
        
           #%% 
        
           # drop_duplicates is redundant but just in case. 
        
           kiba_test_df = pd.concat([kiba_test_df, remaining_onco_kiba_df], axis=0).drop_duplicates(['prot_id', 'lig_id'])  
        
           print("Combined kiba test set with remaining OncoKB genes:", len(kiba_test_df)) 
        
           # %% For the remaining 2100 entries we will just choose those randomly until we reach our target of 11808 entries 
        
           # code is from balanced_kfold_split function 
        
           from collections import Counter 
        
           import numpy as np 
        
           # Get size for each dataset and indices 
        
           dataset_size = len(kiba_df) 
        
           test_size = int(0.1 * dataset_size) # 11808 
        
           indices = list(range(dataset_size)) 
        
           # getting counts for each unique protein 
        
           prot_counts = kiba_df['prot_id'].value_counts().to_dict() 
        
           prots = list(prot_counts.keys()) 
        
           np.random.shuffle(prots) 
        
           # manually selected prots: 
        
           test_prots = set(kiba_test_df.prot_id) 
        
           # increment count by number of samples in test_prots 
        
           count = sum([prot_counts[p] for p in test_prots]) 
        
           #%% 
        
           ## Sampling remaining proteins for test set (if we are under the test_size)  
        
           for p in prots: # O(k); k = number of proteins 
        
               if count + prot_counts[p] < test_size: 
        
                   test_prots.add(p) 
        
                   count += prot_counts[p] 
        
           additional_prots = test_prots - set(kiba_test_df.prot_id) 
        
           print('additional prot_ids to add:', len(additional_prots)) 
        
           print('                     count:', count) 
        
           #%% ADDING FINAL PROTS 
        
           rand_sample_df = kiba_df[kiba_df.prot_id.isin(additional_prots)] 
        
           kiba_test_df = pd.concat([kiba_test_df, rand_sample_df], axis=0).drop_duplicates(['prot_id', 'lig_id']) 
        
           kiba_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True) 
        
           print('final test dataset for kiba:') 
        
           kiba_test_df 
        
           #%% saving 
        
           kiba_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv', index=False)

jyaacoub · 2024-07-06T00:09:52Z

Test set for PDBbind

Test set size Goal of $16265*0.1\approx1626$

Initial stats after getting gene names by matching with biomart:

                            match on PDB ID: 1120
                           match on prot_id: 1039

Combined match (not accounting for aliases): 1216
 pdb_df.gene_x.combine_first(pdb_df.gene_y): 1138

           num genes where gene_x != gene_y: 237

   Total number of entries with a gene name: 8624/16265

Number of entries after merging gene names with kiba test set: 171
- Number of genes: 13
Total # of gene overlap with davis TEST set: 6
- entries in pdb: 60
- This entirely overlaps with kiba so there is no change in test set size.
Adding remaining matching with OncoKB proteins gives us an additional 93 genes for a total of 264 entries
The remaining $1626-264=1362$ entries will be randomly sampled to arrive at our final test dataset with 1603 entries

MutDTA/playground.py

Lines 1 to 124 in 256563c

    
           # %% 
        
           import pandas as pd 
        
           import logging 
        
           DATA_ROOT = '../data' 
        
           biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t') 
        
           biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True) 
        
           biom_df['PDB ID'] = biom_df['PDB ID'].str.lower() 
        
           # %% merge on PDB ID 
        
           pdb_df = pd.read_csv(f'{DATA_ROOT}/PDBbindDataset/nomsa_binary_original_binary/full/XY.csv') 
        
           pdb_df = pdb_df.merge(biom_df.drop_duplicates('PDB ID'), left_on='code', right_on="PDB ID", how='left') 
        
           pdb_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True) 
        
           # %% merge on prot_id: - gene_y 
        
           pdb_df = pdb_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'),  
        
                         left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left') 
        
           pdb_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True) 
        
           #%% 
        
           biom_pdb_match_on_pdbID = pdb_df.gene_x.dropna().drop_duplicates() 
        
           print('                            match on PDB ID:', len(biom_pdb_match_on_pdbID)) 
        
           biom_pdb_match_on_prot_id = pdb_df.gene_y.dropna().drop_duplicates() 
        
           print('                           match on prot_id:', len(biom_pdb_match_on_prot_id)) 
        
           biom_concat = pd.concat([biom_pdb_match_on_pdbID,biom_pdb_match_on_prot_id]).drop_duplicates() 
        
           print('\nCombined match (not accounting for aliases):', len(biom_concat)) 
        
           # cases where both pdb ID and prot_id match can cause issues if gene_x != gene_y resulting in double counting  
        
           # in above concat 
        
           pdb_df['gene'] = pdb_df.gene_x.combine_first(pdb_df.gene_y) 
        
           print(' pdb_df.gene_x.combine_first(pdb_df.gene_y):', len(pdb_df['gene'].dropna().drop_duplicates())) 
        
           # case where we match on prot_id and PDB ID can cause issues with mismatched counts due to  
        
           # different names for the gene (e.g.: due to aliases) 
        
           print("\n           num genes where gene_x != gene_y:", 
        
                 len(pdb_df[pdb_df['gene_x'] != pdb_df['gene_y']].dropna().drop_duplicates(['gene_x','gene_y']))) 
        
           print(f'\n   Total number of entries with a gene name: {len(pdb_df[~pdb_df.gene.isna()])}/{len(pdb_df)}') 
        
           # %% matching with kiba gene names as our starting test set 
        
           kiba_test_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv') 
        
           kiba_test_df = kiba_test_df[['gene']].drop_duplicates() 
        
           # only 171 rows from merging with kiba... 
        
           pdb_test_df = pdb_df.merge(kiba_test_df, on='gene', how='inner').drop_duplicates(['code', 'SMILE']) 
        
           print('Number of entries after merging gene names with kiba test set:', len(pdb_test_df)) 
        
           print('                                              Number of genes:', len(pdb_test_df.gene.drop_duplicates())) 
        
           # %% adding any davis test set genes 
        
           davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv') 
        
           davis_test_prots = set(davis_df.prot_id.str.split('(').str[0]) 
        
           pdb_davis_gene_overlap = pdb_df[pdb_df.gene.isin(davis_test_prots)].gene.value_counts() 
        
           print("Total # of gene overlap with davis TEST set:", len(pdb_davis_gene_overlap)) 
        
           print("                       # of entries in pdb:", pdb_davis_gene_overlap.sum()) 
        
           pdb_test_df = pd.concat([pdb_test_df, pdb_df[pdb_df.gene.isin(davis_test_prots)]], 
        
                                   axis=0).drop_duplicates(['code', 'SMILE']) 
        
           print("# of entries in test set after adding davis genes: ", len(pdb_test_df)) 
        
           #%% CONTINUE TO GET FROM OncoKB: 
        
           onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv") 
        
           pdb_join_onco = set(pdb_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene']) 
        
           #%% 
        
           remaining_onco = onco_df[~onco_df.gene.isin(pdb_join_onco)].drop_duplicates('gene') 
        
           # match with remaining pdb: 
        
           remaining_onco_pdb_df = pdb_df.merge(remaining_onco, on='gene', how="inner") 
        
           counts = remaining_onco_pdb_df.value_counts('gene') 
        
           print(counts) 
        
           print("total entries in pdb with remaining (not already in test set) onco genes", counts.sum()) 
        
           # this only gives us 93 entries... so adding it to the rest would only give us 171+93=264 total entries 
        
           pdb_test_df = pd.concat([pdb_test_df, remaining_onco_pdb_df], axis=0).drop_duplicates(['code', 'SMILE']) 
        
           print("Combined pdb test set with remaining OncoKB genes entries:", len(pdb_test_df)) # 264 only 
        
           # %% Random sample to get the rest 
        
           # code is from balanced_kfold_split function 
        
           from collections import Counter 
        
           import numpy as np 
        
           # Get size for each dataset and indices 
        
           dataset_size = len(pdb_df) 
        
           test_size = int(0.1 * dataset_size) # 1626 
        
           indices = list(range(dataset_size)) 
        
           # getting counts for each unique protein 
        
           prot_counts = pdb_df['code'].value_counts().to_dict() 
        
           prots = list(prot_counts.keys()) 
        
           np.random.shuffle(prots) 
        
           # manually selected prots: 
        
           test_prots = set(pdb_test_df.code) 
        
           # increment count by number of samples in test_prots 
        
           count = sum([prot_counts[p] for p in test_prots]) 
        
           #%% 
        
           ## Sampling remaining proteins for test set (if we are under the test_size)  
        
           for p in prots: # O(k); k = number of proteins 
        
               if count + prot_counts[p] < test_size: 
        
                   test_prots.add(p) 
        
                   count += prot_counts[p] 
        
           additional_prots = test_prots - set(pdb_test_df.code) 
        
           print('additional codes to add:', len(additional_prots)) 
        
           print('                  count:', count) 
        
           #%% ADDING FINAL PROTS 
        
           rand_sample_df = pdb_df[pdb_df.code.isin(additional_prots)] 
        
           pdb_test_df = pd.concat([pdb_test_df, rand_sample_df], axis=0).drop_duplicates(['code']) 
        
           pdb_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True) 
        
           print('Final test dataset for pdbbind:') 
        
           pdb_test_df 
        
           #%% saving 
        
           pdb_test_df.rename({"gene_x":"gene_matched_on_pdb_id", "gene_y": "gene_matched_on_uniprot_id"}, axis=1, inplace=True) 
        
           pdb_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/pdbbind_test.csv', index=False)

jyaacoub · 2024-07-08T19:49:58Z

This is resolved by 69add71, and we now have constant validation sets for each CV training run.

still need to train esm variants to complete #113 for davis

resolves #113

For retraining on new splits #113

Since the whole point of v115 is to compare the performance against aflow in an equal playing field. #113 #115

Unused parameters due to inheriting from DGraphDTA but not using the forward_pro method

…stoppage by sensitive early stopping) #113

results for #113

To limit distribution drift issues mentioned in #131.

jyaacoub mentioned this issue Jul 2, 2024

Improving training consistency #112

Closed

2 tasks

jyaacoub added a commit that referenced this issue Jul 3, 2024

feat(resplit): resplit stub for #113

9ac093f

jyaacoub added a commit that referenced this issue Jul 4, 2024

feat(resplit): for resplitting existing datasets into proper folds #113

c47be94

#112

jyaacoub added a commit that referenced this issue Jul 4, 2024

feat(resplit): extract csvs from "like_dataset" #112 #113

ef0106c

jyaacoub added a commit that referenced this issue Jul 4, 2024

feat: davis splits #112 #113

c4c7741

jyaacoub added a commit that referenced this issue Jul 4, 2024

fix(splitting): created davis splits #113

099c3a3

jyaacoub added a commit that referenced this issue Jul 4, 2024

fix(config): new results dir for issue #113

b032b5b

jyaacoub added a commit that referenced this issue Jul 5, 2024

refactor(playground): #113

c7fdc86

jyaacoub added a commit that referenced this issue Jul 5, 2024

chore(gitignore): ignoring checkpoints for #113

e80e225

jyaacoub added a commit that referenced this issue Jul 6, 2024

chore(pdbbind): created pdbbind test set #113

256563c

jyaacoub added a commit that referenced this issue Jul 8, 2024

fix(init_dataset): adding resplit to create_datasets #113

b15e83d

jyaacoub added a commit that referenced this issue Jul 8, 2024

results(davis_DGM): retrained davis_DGM on new splits #113 #112

3fd9367

jyaacoub added a commit that referenced this issue Jul 8, 2024

feat(splits): created kiba and pdbind splits #113

69add71

jyaacoub added a commit that referenced this issue Jul 8, 2024

results(davis_gvpl): retrained davis_gvpl on new splits #113 #112

1361c7e

jyaacoub closed this as completed Jul 8, 2024

jyaacoub mentioned this issue Jul 8, 2024

Retrain models on aflow subset to compare against it. #115

Closed

jyaacoub added a commit that referenced this issue Jul 9, 2024

results(davis): retrained aflow models #113 due to issue #116

a0e4405

still need to train esm variants to complete #113 for davis

jyaacoub linked a pull request Jul 9, 2024 that will close this issue

resolves #113 #117

Merged

jyaacoub added a commit that referenced this issue Jul 9, 2024

Merge pull request #117 from jyaacoub/development

4537407

resolves #113

jyaacoub added a commit that referenced this issue Jul 10, 2024

fix(EsmDTA): added missing args and refactored attributes #113

9d561da

For retraining on new splits #113

jyaacoub added a commit that referenced this issue Jul 10, 2024

results: copied aflow results from v113 to v115

aa2eac6

Since the whole point of v115 is to compare the performance against aflow in an equal playing field. #113 #115

jyaacoub mentioned this issue Jul 10, 2024

development #119

Merged

jyaacoub added a commit that referenced this issue Jul 15, 2024

feat(splits): one csv for all tests + oncokb overlap #113

05a3bd9

jyaacoub added a commit that referenced this issue Jul 17, 2024

results: davis_esm and kiba_DG[0-3] #113

de8fc86

jyaacoub added a commit that referenced this issue Jul 17, 2024

fic(gvpl): unused parameters #113

cff0bab

Unused parameters due to inheriting from DGraphDTA but not using the forward_pro method

jyaacoub added a commit that referenced this issue Jul 17, 2024

fix: unused param due to dummy_param #113

b0e5568

jyaacoub added a commit that referenced this issue Jul 18, 2024

results(kiba): retrained DG model #113

a830819

jyaacoub added a commit that referenced this issue Jul 22, 2024

results(kiba): DG[4] and aflow[0-4] #113

1d77cbd

jyaacoub added a commit that referenced this issue Jul 22, 2024

results(pdbbind): DG[0-4] #113

eb10933

jyaacoub added a commit that referenced this issue Jul 22, 2024

results(kiba): gvpl[0-4] and gvpl_aflow[0-4] #113

246e84a

jyaacoub added a commit that referenced this issue Jul 22, 2024

chore(Tuned): model config for PDBbind_DG-_aflow-_esm #113

ba79084

jyaacoub mentioned this issue Jul 22, 2024

Development #124

Merged

jyaacoub added a commit that referenced this issue Jul 24, 2024

results(kiba): retrained gvpl mdl (+ additional run due to premature …

279764c

…stoppage by sensitive early stopping) #113

jyaacoub added a commit that referenced this issue Jul 24, 2024

results(pdbbind): retrained gvpl #113

9112227

jyaacoub added a commit that referenced this issue Jul 29, 2024

results(pdbbind): #113 gvpl and gvpl_aflow

c395e7b

jyaacoub added a commit that referenced this issue Jul 29, 2024

Merge pull request #127 from jyaacoub/development

28e1628

results for #113

jyaacoub mentioned this issue Jul 30, 2024

Worse PDBbind performance with new splits #128

Closed

jyaacoub added a commit that referenced this issue Aug 2, 2024

fix(splits): new random splits #113 #128 #131

b3e52a9

To limit distribution drift issues mentioned in #131.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify cross validation splits to use consistent sets #113

Unify cross validation splits to use consistent sets #113

jyaacoub commented Jul 2, 2024 •

edited

Loading

jyaacoub commented Jul 2, 2024

jyaacoub commented Jul 3, 2024 •

edited

Loading

jyaacoub commented Jul 5, 2024 •

edited

Loading

jyaacoub commented Jul 6, 2024 •

edited

Loading

jyaacoub commented Jul 8, 2024

Unify cross validation splits to use consistent sets #113

Unify cross validation splits to use consistent sets #113

Comments

jyaacoub commented Jul 2, 2024 • edited Loading

jyaacoub commented Jul 2, 2024

resplit function:

jyaacoub commented Jul 3, 2024 • edited Loading

Whats left:

jyaacoub commented Jul 5, 2024 • edited Loading

test split for kiba

Test set size goal of ($118083\times 0.1 \approx 11808$)

jyaacoub commented Jul 6, 2024 • edited Loading

Test set for PDBbind

Test set size Goal of $16265*0.1\approx1626$

jyaacoub commented Jul 8, 2024

jyaacoub commented Jul 2, 2024 •

edited

Loading

`resplit` function:

jyaacoub commented Jul 3, 2024 •

edited

Loading

jyaacoub commented Jul 5, 2024 •

edited

Loading

jyaacoub commented Jul 6, 2024 •

edited

Loading