Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrain models on aflow subset to compare against it. #115

Closed
Tracked by #112
jyaacoub opened this issue Jul 5, 2024 · 2 comments · Fixed by #118
Closed
Tracked by #112

Retrain models on aflow subset to compare against it. #115

jyaacoub opened this issue Jul 5, 2024 · 2 comments · Fixed by #118
Labels

Comments

@jyaacoub
Copy link
Owner

jyaacoub commented Jul 5, 2024

For an "aflow" subset I think it would be best to have it be seperate from all the other runs we are doing, similar to how we now have the results/v113 path for #113 unified CV sets we should have a results/v115 where we store the retrained models on a smaller subset defined by the splits in aflow directories created.

@jyaacoub
Copy link
Owner Author

jyaacoub commented Jul 8, 2024

For this we simply apply the resplit function defined below with the second split_files argument being set to the directory path of the respective aflow version of that dataset (e.g.: "nomsa_aflow_original_binary" for "nomsa_binary_original_binary")

def resplit(dataset:str|BaseDataset, split_files:dict|str=None, **kwargs):
"""
1.Takes as input the target dataset path or dataset object, and a dict defining the 6 splits for all 5 folds +
1 test set.
- Decorator will automatically convert the dataset path to a dataset object
- split files should be a dict to csv files, each containing the proteins for the splits, where the keys are:
- val0, val1, val2, val3, val4, test
- training sets will be built from the remaining proteins (i.e.: proteins not in any of the val/test sets)
2.Deletes existing splits
3.Builds new splits using Dataset.save_subset()
Args:
dataset (str | BaseDataset): path to full dataset directory or dataset object
split_files (dict | str, optional): Dictionary of csvs for each of the n folds + the test set, where keys are
val0, val1, val2, val3, val4, test and the values are the path to the csvs with a "prot_id" column. OR path to
another dataset directory that you want to match in terms of dataset split where we extract the csvs from
Defaults to None.
Raises:
ValueError: no split_files provided
ValueError: split_files must contain 6 files for the 5 folds and test set
ValueError: split file does not exist
Returns:
BaseDataset: dataset object for "full" dataset
"""
if isinstance(split_files, str):
csv_files = {}
for split in ['test'] + [f'val{i}' for i in range(5)]:
csv_files[split] = f'{split_files}/{split}/cleaned_XY.csv'
split_files = csv_files
print('Using split files from:', split_files)
assert 'test' in split_files, 'Missing test csv from split files.'
# Check if split files exist and are in the correct format
if split_files is None:
raise ValueError('split_files must be provided')
if len(split_files) != 6:
raise ValueError('split_files must contain 6 files for the 5 folds and test set')
for f in split_files.values():
if not os.path.exists(f):
raise ValueError(f'{f} does not exist')
# Getting indices for each split based on db.df
split_files = split_files.copy()
test_prots = set(pd.read_csv(split_files['test'])['prot_id'])
test_idxs = [i for i in range(len(dataset.df)) if dataset.df.iloc[i]['prot_id'] in test_prots]
dataset.save_subset(test_idxs, 'test')
del split_files['test']
# Building the folds
for k, v in split_files.items():
prots = set(pd.read_csv(v)['prot_id'])
val_idxs = [i for i in range(len(dataset.df)) if dataset.df.iloc[i]['prot_id'] in prots]
dataset.save_subset(val_idxs, k)
# Build training set from all proteins not in the val/test set
idxs = set(val_idxs + test_idxs)
train_idxs = [i for i in range(len(dataset.df)) if i not in idxs]
dataset.save_subset(train_idxs, k.replace('val', 'train'))
return dataset

jyaacoub added a commit that referenced this issue Jul 9, 2024
Allows us to get exactly the same prots from another dataset like alphaflow dataset that is limited due to longer proteins.
jyaacoub added a commit that referenced this issue Jul 10, 2024
Since the whole point of v115 is to compare the performance against aflow in an equal playing field. #113 #115
@jyaacoub
Copy link
Owner Author

jyaacoub commented Jul 10, 2024

Conclusions dont change much:

Retrained model performance:

image

Previous non-aflow subset performance:

image

@jyaacoub jyaacoub linked a pull request Jul 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant