ESM2 Finetuning refactor #574

farhadrgh · 2025-01-06T18:04:52Z

Description

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by checking relevant boxes below. This will automatically apply labels.

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Usage

TODO: Add code snippet

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

farhadrgh · 2025-01-06T18:05:23Z

Ported over the changes from #546

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

farhadrgh · 2025-01-09T22:28:28Z

/build-ci

sichu2023 · 2025-01-13T17:04:20Z

sub-packages/bionemo-esm2/src/bionemo/esm2/model/finetune/finetune_regressor.py

@@ -180,12 +180,12 @@ def get_loss_reduction_class(self) -> Type[RegressorLossReduction]:
        return RegressorLossReduction


-class InMemorySingleValueDataset(Dataset):
+class InMemorySingleValueDataset(InMemoryCSVDataset):


Should we move this into dataset.py or anywhere under data?

sichu2023 · 2025-01-13T17:05:47Z

sub-packages/bionemo-esm2/src/bionemo/esm2/model/finetune/finetune_token_classifier.py

@@ -205,12 +190,12 @@ def get_loss_reduction_class(self) -> Type[ClassifierLossReduction]:
        return ClassifierLossReduction


-class InMemoryPerTokenValueDataset(Dataset):
+class InMemoryPerTokenValueDataset(InMemoryCSVDataset):


Similarly should we move this under dataset.py? Like esm2/data/finetune/dataset.py or similar.

sichu2023 · 2025-01-13T17:13:26Z

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

+    dataset_class_options: Dict[str, Type[InMemoryCSVDataset]] = SUPPORTED_DATASETS
+
+    def dataset_class_type(desc: str) -> Type[InMemoryCSVDataset]:
+        try:
+            return dataset_class_options[desc]
+        except KeyError:
+            raise argparse.ArgumentTypeError(
+                f"Do not recognize key {desc}, valid options are: {dataset_class_options.keys()}"


@pstjohn had a similar approach by inheriting from both str and enum to streamline argument parsing.

https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/transformer_specs.py#L47
https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py#L586

sichu2023 · 2025-01-13T17:41:46Z

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

+    config_class: Type[BioBertConfig] = ESM2FineTuneSeqConfig,
+    metric_tracker: Callback | None = None,
+    overlap_grad_reduce: bool = True,
+    overlap_param_gather: bool = False,  # TODO waiting for a NeMo fix


Updated on https://github.com/NVIDIA/bionemo-framework/pull/582/files

sichu2023 · 2025-01-13T17:42:03Z

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

+        dataset_class=args.dataset_class,
+        config_class=args.config_class,
+        overlap_grad_reduce=not args.no_overlap_grad_reduce,
+        overlap_param_gather=args.overlap_param_gather,


Updated on https://github.com/NVIDIA/bionemo-framework/pull/582/files

sichu2023 · 2025-01-13T17:42:23Z

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

+    parser.add_argument(
+        "--overlap-param-gather",
+        action="store_true",
+        default=False,
+    )  # TODO waiting for a NeMo fix


Updated with https://github.com/NVIDIA/bionemo-framework/pull/582/files

farhadrgh added 3 commits January 6, 2025 10:01

refactor datasets

22bb887

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

refactor datasets

4848f9c

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

add finetune script

db955ef

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

farhadrgh and others added 9 commits January 6, 2025 12:52

fix typing

4cac6d7

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

simplify, use 650m

900ef00

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

Merge branch 'main' into farhadr/ft_refactor

d7d526b

executable finetune_esm2

5e2dd4b

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

typing

4577bc2

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

resolve conflicts

9154c56

add finetune notebook

8fc64a3

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

move test

087341a

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

deprecate old example

3eb8740

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

farhadrgh marked this pull request as ready for review January 9, 2025 17:58

farhadrgh requested review from jstjohn, malcolmgreaves, skothenhill-nv, ohadmo, pstjohn, trvachov and dorotat-nv as code owners January 9, 2025 17:58

farhadrgh and others added 4 commits January 9, 2025 10:49

update tests

ca2a89a

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

add unit tests

1efd34c

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

update brev.dev link

884bbf9

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

Merge branch 'main' into farhadr/ft_refactor

b93ed37

farhadrgh requested a review from sichu2023 January 13, 2025 16:25

sichu2023 reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESM2 Finetuning refactor #574

ESM2 Finetuning refactor #574

farhadrgh commented Jan 6, 2025 •

edited

Loading

farhadrgh commented Jan 6, 2025

farhadrgh commented Jan 9, 2025

sichu2023 Jan 13, 2025

sichu2023 Jan 13, 2025

sichu2023 Jan 13, 2025

sichu2023 Jan 13, 2025

sichu2023 Jan 13, 2025

sichu2023 Jan 13, 2025

ESM2 Finetuning refactor #574

Are you sure you want to change the base?

ESM2 Finetuning refactor #574

Conversation

farhadrgh commented Jan 6, 2025 • edited Loading

Description

Type of changes

CI Pipeline Configuration

Usage

Pre-submit Checklist

farhadrgh commented Jan 6, 2025

farhadrgh commented Jan 9, 2025

sichu2023 Jan 13, 2025

Choose a reason for hiding this comment

sichu2023 Jan 13, 2025

Choose a reason for hiding this comment

sichu2023 Jan 13, 2025

Choose a reason for hiding this comment

sichu2023 Jan 13, 2025

Choose a reason for hiding this comment

sichu2023 Jan 13, 2025

Choose a reason for hiding this comment

sichu2023 Jan 13, 2025

Choose a reason for hiding this comment

farhadrgh commented Jan 6, 2025 •

edited

Loading