Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CalmPropertyDataset & CalmLinearProbe Callbacks #34

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

taylormjs
Copy link
Collaborator

  • Add CalmPropertyDataset, downloading from github repo and Zenodo (after putting GO data on Zenodo)
  • Add CalmLinearProbeCallback, which randomly subsamples large dataset sizes to 3,000
  • Add tests for both of these
  • Update LinearProbeCallback to handle multilabel tasks -- Calm has four multilabel tasks (function & localization)
  • Update __init__s accordingly

test_size: float = 0.2,
max_samples: int = 3000,
seed: int = 42,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a docstring, especially for explaining tasks and species ?

) -> Tuple[CalmPropertyDataset, CalmPropertyDataset]:
"""Create train/test splits for a given task."""

rng = np.random.RandomState(self.seed) # TODO - seed everything fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, does this work better or equivalently to lightning's seed everything?

from torch import Tensor
from torch.utils.data import DataLoader
from torchmetrics import AUROC, Accuracy, F1Score, MeanSquaredError, R2Score, SpearmanCorrCoef

TaskType = Literal["regression", "binary", "multiclass"]
TaskType = Literal["regression", "binary", "multiclass", "multilabel"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is multilabel different from multiclass?


CALM_DATA_GITHUB_URL = "https://raw.githubusercontent.com/oxpig/CaLM/main/data"
FUNCTION_ZENODO_BASE_URL = (
"https://zenodo.org/records/14890750/files" # Gene Ontology datasets processed & uploaded on Zenodo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to HF

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to write something similar about CALM_DATA_GITHUB_URL -- is this a subset of the calm dataset from Nathan's HF account?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's in HF, we can also load it more easily with datahaha once we switch

self.data = pd.read_csv(file_path)

if columns is None:
if task == Task.FUNCTION_BP:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a match case here to handle the logic?

@@ -50,46 +43,18 @@ def __init__(
self._mask_percentage = mask_percentage
self.max_length = max_length

# TODO zadorozk: currently only accepts one tokenizer at a time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undo delete

],
}

# Files hashes to check upstream data files haven't been changed. Makes data download cleaner
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't love about hashes is that pooch currently just errors out when the files are different. Maybe we'd want to add custom logic where it just downloads new files instead. Could be something for a future MR though

from torch import Tensor
from torch.utils.data import Dataset

from lobster.constants._calm_tasks import (
Copy link
Collaborator

@karinazad karinazad Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally not the best practice to import from modules with an underscore. If they are in constants. init, we can just do from lobster.constants

@@ -35,8 +28,8 @@ def __init__(
eps: float = 1e-12,
num_training_steps: int = 10_000,
num_warmup_steps: int = 1_000,
tokenizer: Union[str, PreTrainedTokenizer, PreTrainedTokenizerFast] = "amino_acid_tokenizer",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this deleted?

tokenizer_transform_class = PmlmTokenizerTransform

elif tokenizer == "amino_acid_tokenizer":
tokenizer = AminoAcidTokenizerFast()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably get rid of tokenizer inside the model altogether (and also sequence_to_latents?) as it's not needed for training

what do you think @ncfrey

otherwise we should probably keep the option to provide a custom tokenizer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants