Add CalmPropertyDataset & CalmLinearProbe Callbacks #34

taylormjs · 2025-02-20T19:27:09Z

Add CalmPropertyDataset, downloading from github repo and Zenodo (after putting GO data on Zenodo)
Add CalmLinearProbeCallback, which randomly subsamples large dataset sizes to 3,000
Add tests for both of these
Update LinearProbeCallback to handle multilabel tasks -- Calm has four multilabel tasks (function & localization)
Update __init__s accordingly

ncfrey · 2025-02-20T22:49:11Z

src/lobster/callbacks/_calm_linear_probe_callback.py

+        test_size: float = 0.2,
+        max_samples: int = 3000,
+        seed: int = 42,
+    ):


can you add a docstring, especially for explaining tasks and species ?

ncfrey · 2025-02-20T22:50:03Z

src/lobster/callbacks/_calm_linear_probe_callback.py

+    ) -> Tuple[CalmPropertyDataset, CalmPropertyDataset]:
+        """Create train/test splits for a given task."""
+
+        rng = np.random.RandomState(self.seed)  # TODO - seed everything fn


hmm, does this work better or equivalently to lightning's seed everything?

ncfrey · 2025-02-20T22:51:47Z

src/lobster/callbacks/_linear_probe_callback.py

 from torch import Tensor
 from torch.utils.data import DataLoader
 from torchmetrics import AUROC, Accuracy, F1Score, MeanSquaredError, R2Score, SpearmanCorrCoef

-TaskType = Literal["regression", "binary", "multiclass"]
+TaskType = Literal["regression", "binary", "multiclass", "multilabel"]


is multilabel different from multiclass?

ncfrey · 2025-02-20T22:53:37Z

src/lobster/constants/_calm_tasks.py

+
+CALM_DATA_GITHUB_URL = "https://raw.githubusercontent.com/oxpig/CaLM/main/data"
+FUNCTION_ZENODO_BASE_URL = (
+    "https://zenodo.org/records/14890750/files"  # Gene Ontology datasets processed & uploaded on Zenodo


switch to HF

I was going to write something similar about CALM_DATA_GITHUB_URL -- is this a subset of the calm dataset from Nathan's HF account?

if it's in HF, we can also load it more easily with datahaha once we switch

ncfrey · 2025-02-20T22:54:54Z

src/lobster/datasets/_calm_property_dataset.py

+            self.data = pd.read_csv(file_path)
+
+        if columns is None:
+            if task == Task.FUNCTION_BP:


maybe a match case here to handle the logic?

ncfrey · 2025-02-20T22:56:09Z

src/lobster/model/modern_bert/_modern_bert.py

@@ -50,46 +43,18 @@ def __init__(
        self._mask_percentage = mask_percentage
        self.max_length = max_length

-        # TODO zadorozk: currently only accepts one tokenizer at a time


undo delete

karinazad · 2025-02-20T22:58:19Z

src/lobster/constants/_calm_tasks.py

+    ],
+}
+
+# Files hashes to check upstream data files haven't been changed. Makes data download cleaner


One thing I don't love about hashes is that pooch currently just errors out when the files are different. Maybe we'd want to add custom logic where it just downloads new files instead. Could be something for a future MR though

karinazad · 2025-02-20T22:59:13Z

src/lobster/datasets/_calm_property_dataset.py

+from torch import Tensor
+from torch.utils.data import Dataset
+
+from lobster.constants._calm_tasks import (


generally not the best practice to import from modules with an underscore. If they are in constants. init, we can just do from lobster.constants

karinazad · 2025-02-20T23:01:45Z

src/lobster/model/modern_bert/_modern_bert.py

@@ -35,8 +28,8 @@ def __init__(
        eps: float = 1e-12,
        num_training_steps: int = 10_000,
        num_warmup_steps: int = 1_000,
-        tokenizer: Union[str, PreTrainedTokenizer, PreTrainedTokenizerFast] = "amino_acid_tokenizer",


why was this deleted?

karinazad · 2025-02-20T23:03:23Z

src/lobster/model/modern_bert/_modern_bert.py

-                tokenizer_transform_class = PmlmTokenizerTransform
-
-            elif tokenizer == "amino_acid_tokenizer":
-                tokenizer = AminoAcidTokenizerFast()


we could probably get rid of tokenizer inside the model altogether (and also sequence_to_latents?) as it's not needed for training

what do you think @ncfrey

otherwise we should probably keep the option to provide a custom tokenizer

karinazad · 2025-02-20T23:04:22Z

tests/lobster/callbacks/test__calm_linear_probe_callback.py

Ed Wagstaff and others added 5 commits February 20, 2025 19:20

FlexBERT tweaks

825743b

ruff

5d0eecc

add calm tasks melt, solu, cell_loc, function

68bb4fd

add complete calm property dataset

ee11722

add calm linear probe, multilabel evals, tests

6f3e07c

taylormjs requested review from ncfrey and karinazad February 20, 2025 19:27

fix ruff

ab21614

taylormjs temporarily deployed to test.pypi.org February 20, 2025 20:14 — with GitHub Actions Inactive

re-ruff

be039a0

taylormjs deployed to test.pypi.org February 20, 2025 20:23 — with GitHub Actions View deployment

ncfrey reviewed Feb 20, 2025

View reviewed changes

karinazad reviewed Feb 20, 2025

View reviewed changes

tests/lobster/callbacks/test__calm_linear_probe_callback.py

Copy link

Collaborator

karinazad Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CalmPropertyDataset & CalmLinearProbe Callbacks #34

Add CalmPropertyDataset & CalmLinearProbe Callbacks #34

taylormjs commented Feb 20, 2025

ncfrey Feb 20, 2025

ncfrey Feb 20, 2025

ncfrey Feb 20, 2025

ncfrey Feb 20, 2025

karinazad Feb 20, 2025

karinazad Feb 20, 2025

ncfrey Feb 20, 2025

ncfrey Feb 20, 2025

karinazad Feb 20, 2025

karinazad Feb 20, 2025 •

edited

Loading

karinazad Feb 20, 2025

karinazad Feb 20, 2025

karinazad Feb 20, 2025

Add CalmPropertyDataset & CalmLinearProbe Callbacks #34

Are you sure you want to change the base?

Add CalmPropertyDataset & CalmLinearProbe Callbacks #34

Conversation

taylormjs commented Feb 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karinazad Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karinazad Feb 20, 2025 •

edited

Loading