Retrain models on aflow subset to compare against it. #115

jyaacoub · 2024-07-05T13:20:52Z

For an "aflow" subset I think it would be best to have it be seperate from all the other runs we are doing, similar to how we now have the results/v113 path for #113 unified CV sets we should have a results/v115 where we store the retrained models on a smaller subset defined by the splits in aflow directories created.

The text was updated successfully, but these errors were encountered:

jyaacoub · 2024-07-08T19:56:14Z

For this we simply apply the resplit function defined below with the second split_files argument being set to the directory path of the respective aflow version of that dataset (e.g.: "nomsa_aflow_original_binary" for "nomsa_binary_original_binary")

MutDTA/src/train_test/splitting.py

Lines 285 to 347 in 1361c7e

    
           def resplit(dataset:str|BaseDataset, split_files:dict|str=None, **kwargs): 
        
               """ 
        
                1.Takes as input the target dataset path or dataset object, and a dict defining the 6 splits for all 5 folds +  
        
                   1 test set. 
        
                   - Decorator will automatically convert the dataset path to a dataset object 
        
                   - split files should be a dict to csv files, each containing the proteins for the splits, where the keys are: 
        
                       - val0, val1, val2, val3, val4, test 
        
                       - training sets will be built from the remaining proteins (i.e.: proteins not in any of the val/test sets) 
        
                2.Deletes existing splits 
        
                3.Builds new splits using Dataset.save_subset() 
        
               Args: 
        
                   dataset (str | BaseDataset): path to full dataset directory or dataset object 
        
                   split_files (dict | str, optional): Dictionary of csvs for each of the n folds + the test set, where keys are  
        
                   val0, val1, val2, val3, val4, test and the values are the path to the csvs with a "prot_id" column. OR path to  
        
                   another dataset directory that you want to match in terms of dataset split where we extract the csvs from  
        
                   Defaults to None. 
        
               Raises: 
        
                   ValueError: no split_files provided 
        
                   ValueError: split_files must contain 6 files for the 5 folds and test set 
        
                   ValueError: split file does not exist 
        
               Returns: 
        
                   BaseDataset: dataset object for "full" dataset 
        
               """ 
        
               if isinstance(split_files, str): 
        
                   csv_files = {} 
        
                   for split in ['test'] + [f'val{i}' for i in range(5)]: 
        
                       csv_files[split] = f'{split_files}/{split}/cleaned_XY.csv' 
        
                   split_files = csv_files 
        
                   print('Using split files from:', split_files) 
        
               assert 'test' in split_files, 'Missing test csv from split files.' 
        
               # Check if split files exist and are in the correct format 
        
               if split_files is None: 
        
                   raise ValueError('split_files must be provided') 
        
               if len(split_files) != 6: 
        
                   raise ValueError('split_files must contain 6 files for the 5 folds and test set') 
        
               for f in split_files.values(): 
        
                   if not os.path.exists(f): 
        
                       raise ValueError(f'{f} does not exist') 
        
               # Getting indices for each split based on db.df 
        
               split_files = split_files.copy() 
        
               test_prots = set(pd.read_csv(split_files['test'])['prot_id']) 
        
               test_idxs = [i for i in range(len(dataset.df)) if dataset.df.iloc[i]['prot_id'] in test_prots] 
        
               dataset.save_subset(test_idxs, 'test') 
        
               del split_files['test'] 
        
               # Building the folds 
        
               for k, v in split_files.items(): 
        
                   prots = set(pd.read_csv(v)['prot_id']) 
        
                   val_idxs = [i for i in range(len(dataset.df)) if dataset.df.iloc[i]['prot_id'] in prots] 
        
                   dataset.save_subset(val_idxs, k) 
        
                   # Build training set from all proteins not in the val/test set 
        
                   idxs = set(val_idxs + test_idxs) 
        
                   train_idxs = [i for i in range(len(dataset.df)) if i not in idxs] 
        
                   dataset.save_subset(train_idxs, k.replace('val', 'train')) 
        
               return dataset

Allows us to get exactly the same prots from another dataset like alphaflow dataset that is limited due to longer proteins.

Since the whole point of v115 is to compare the performance against aflow in an equal playing field. #113 #115

jyaacoub · 2024-07-10T18:05:02Z

Conclusions dont change much:

Retrained model performance:

Previous non-aflow subset performance:

jyaacoub mentioned this issue Jul 5, 2024

Improving training consistency #112

Closed

2 tasks

jyaacoub added a commit that referenced this issue Jul 9, 2024

fix(splitting): include train set csv for #115

bb6ff22

Allows us to get exactly the same prots from another dataset like alphaflow dataset that is limited due to longer proteins.

jyaacoub added the analysis label Jul 9, 2024

jyaacoub added a commit that referenced this issue Jul 10, 2024

results(davis): retrained gvpl and dg with aflow subset #115

fd52afa

jyaacoub added a commit that referenced this issue Jul 10, 2024

results: copied aflow results from v113 to v115

aa2eac6

Since the whole point of v115 is to compare the performance against aflow in an equal playing field. #113 #115

jyaacoub linked a pull request Jul 10, 2024 that will close this issue

V115 aflow subset #118

Merged

jyaacoub closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrain models on aflow subset to compare against it. #115

Retrain models on aflow subset to compare against it. #115

jyaacoub commented Jul 5, 2024 •

edited

Loading

jyaacoub commented Jul 8, 2024

jyaacoub commented Jul 10, 2024 •

edited

Loading

Retrain models on aflow subset to compare against it. #115

Retrain models on aflow subset to compare against it. #115

Comments

jyaacoub commented Jul 5, 2024 • edited Loading

jyaacoub commented Jul 8, 2024

jyaacoub commented Jul 10, 2024 • edited Loading

Conclusions dont change much:

Retrained model performance:

Previous non-aflow subset performance:

jyaacoub commented Jul 5, 2024 •

edited

Loading

jyaacoub commented Jul 10, 2024 •

edited

Loading