-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CalmPropertyDataset & CalmLinearProbe Callbacks #34
base: main
Are you sure you want to change the base?
Conversation
taylormjs
commented
Feb 20, 2025
- Add CalmPropertyDataset, downloading from github repo and Zenodo (after putting GO data on Zenodo)
- Add CalmLinearProbeCallback, which randomly subsamples large dataset sizes to 3,000
- Add tests for both of these
- Update LinearProbeCallback to handle multilabel tasks -- Calm has four multilabel tasks (function & localization)
- Update __init__s accordingly
test_size: float = 0.2, | ||
max_samples: int = 3000, | ||
seed: int = 42, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a docstring, especially for explaining tasks
and species
?
) -> Tuple[CalmPropertyDataset, CalmPropertyDataset]: | ||
"""Create train/test splits for a given task.""" | ||
|
||
rng = np.random.RandomState(self.seed) # TODO - seed everything fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does this work better or equivalently to lightning's seed everything?
from torch import Tensor | ||
from torch.utils.data import DataLoader | ||
from torchmetrics import AUROC, Accuracy, F1Score, MeanSquaredError, R2Score, SpearmanCorrCoef | ||
|
||
TaskType = Literal["regression", "binary", "multiclass"] | ||
TaskType = Literal["regression", "binary", "multiclass", "multilabel"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is multilabel different from multiclass?
|
||
CALM_DATA_GITHUB_URL = "https://raw.githubusercontent.com/oxpig/CaLM/main/data" | ||
FUNCTION_ZENODO_BASE_URL = ( | ||
"https://zenodo.org/records/14890750/files" # Gene Ontology datasets processed & uploaded on Zenodo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to HF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to write something similar about CALM_DATA_GITHUB_URL -- is this a subset of the calm dataset from Nathan's HF account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's in HF, we can also load it more easily with datahaha
once we switch
self.data = pd.read_csv(file_path) | ||
|
||
if columns is None: | ||
if task == Task.FUNCTION_BP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a match case here to handle the logic?
@@ -50,46 +43,18 @@ def __init__( | |||
self._mask_percentage = mask_percentage | |||
self.max_length = max_length | |||
|
|||
# TODO zadorozk: currently only accepts one tokenizer at a time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undo delete
], | ||
} | ||
|
||
# Files hashes to check upstream data files haven't been changed. Makes data download cleaner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't love about hashes is that pooch currently just errors out when the files are different. Maybe we'd want to add custom logic where it just downloads new files instead. Could be something for a future MR though
from torch import Tensor | ||
from torch.utils.data import Dataset | ||
|
||
from lobster.constants._calm_tasks import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally not the best practice to import from modules with an underscore. If they are in constants. init, we can just do from lobster.constants
@@ -35,8 +28,8 @@ def __init__( | |||
eps: float = 1e-12, | |||
num_training_steps: int = 10_000, | |||
num_warmup_steps: int = 1_000, | |||
tokenizer: Union[str, PreTrainedTokenizer, PreTrainedTokenizerFast] = "amino_acid_tokenizer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this deleted?
tokenizer_transform_class = PmlmTokenizerTransform | ||
|
||
elif tokenizer == "amino_acid_tokenizer": | ||
tokenizer = AminoAcidTokenizerFast() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could probably get rid of tokenizer
inside the model altogether (and also sequence_to_latents?) as it's not needed for training
what do you think @ncfrey
otherwise we should probably keep the option to provide a custom tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!