Merge branch 'main' into jstjohn/process_dataset_perf

NVIDIA · Nov 6, 2024 · 2e1078d · 2e1078d
2 parents b2fd9f8 + 6773d00
commit 2e1078d
Show file tree

Hide file tree

Showing 28 changed files with 2,807 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -155,7 +155,19 @@ git tag MY-VERSION-TAG
 uv build /sub-packages/bionemo-core
 TWINE_PASSWORD="<pypi pass>" TWINE_USERNAME="<pypi user>" uvx twine upload /sub-packages/bionemo-core/dist/*
 ```
+## Pydantic Configuration
 
+BioNeMo 2 provides two entrypoints for models with both argparse and pydantic. Both documented in the `Models` section below.
+Pydantic based configuration is designed to accept a configuration json file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These JSON configs go through a Pydantic Validator, in this case referred to as `MainConfig`. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a JSON configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training.
+
+These two workflows are packaged as executables when esm2 or geneformer are installed with pip. These commands will appear as:
+
+```bash
+bionemo-geneformer-recipe
+bionemo-esm2-recipe
+bionemo-geneformer-train
+bionemo-esm2-train
+```
 
 ## Models
 ### ESM-2
@@ -198,6 +210,62 @@ python  \
     --restore-from-checkpoint-path ${ESM2_650M_CKPT}
 ```
 
+##### Running with Pydantic configs
+
+Alternatively, we provide a validated and serialized configuration file entrypoint for executing the same workflow. Recipes
+are available for 8m, 650m, and 3b ESM2 models. You may select which preset config to use by setting the `--recipe` parameter.
+
+```bash
+# The fastest transformer engine environment variables in testing were the following two
+TEST_DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source $MY_DATA_SOURCE); \
+bionemo-esm2-recipe \
+--train-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet     \
+--train-database-path ${TEST_DATA_DIR}/2024_03_sanity/train_sanity.db     \
+--valid-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/valid_clusters.parquet     \
+--valid-database-path ${TEST_DATA_DIR}/2024_03_sanity/validation.db     \
+--result-dir ./results     \
+--dest my_config.json \
+--recipe 8m
+```
+
+> ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit
+
+> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
+
+To submit a training job with the passed config, first update the json file with any additional execution parameters
+of your choosing: number of devices, workers, steps, etc. Second, invoke our training entrypoint. To do this, we need
+three things:
+
+- Configuration file, the JSON produced by the previous step
+- Model config type, in this case the pretraining config. This will validate the arguments in the config JSON against
+    those required for pretraining. Alternatively, things like fine-tuning with custom task heads may be specified here.
+    This allows for mixing/matching Data Modules with various tasks.
+- Data Config type, this specifies how to parse, validate, and prepare the DataModule. This may change depending on task,
+for example, pretraining ESM2 uses a protein cluster oriented sampling method. In the case of inference or fine-tuning
+a pretrained model, a simple fasta file may be sufficient. There is a one-to-one relationship between DataConfig types
+and DataModule types.
+
+> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config JSON and populate it with your WandB details.
+
+```
+export NVTE_FUSED_ATTN=1
+export NVTE_FLASH_ATTN=0
+
+bionemo-esm2-train \
+--data-config-t bionemo.esm2.run.config_models.ESM2DataConfig \
+--model-config-t bionemo.esm2.run.config_models.ExposedESM2PretrainConfig \
+--config my_config.json
+```
+
+> NOTE: both data-config-t and model-config-t have default values corresponding to ESM2DataConfig and ExposedESM2PretrainingConfig
+
+DataConfigT and ModelConfigT can also refer to locally defined types by the user. As long as python knows how to import
+the specified path, they may be configured. For example, you may have a custom Dataset/DataModule that you would like to
+mix with an existing recipe. In this case, you define a DataConfig object with the generic specified as your DataModule
+type, and then pass in the config type to the training recipe.
+
+
+
 ### Geneformer
 #### Running
 
@@ -221,7 +289,7 @@ train_geneformer     \
     --micro-batch-size 2
 ```
 
-To fine-tune, you just need to specify a different combination of model and loss. Pass the path to the outputted config file from the previous step as the `--restore-from-checkpoint-path`, and also change
+To fine-tune, you to specify a different combination of model and loss. Pass the path to the outputted config file from the previous step as the `--restore-from-checkpoint-path`, and also change
 `--training-model-config-class` to the newly created model-config-class.
 
 While no CLI option currently exists to hot swap in different data modules and processing functions _now_, you could
@@ -247,6 +315,54 @@ train_geneformer     \
     --restore-from-checkpoint-path results/test_experiment/dev/checkpoints/test_experiment--val_loss=4.3506-epoch=1-last
 ```
 
+##### Running with Pydantic configs
+Alternatively, we provide a validated and serialized configuration file entrypoint for executing the same workflow. Recipes
+are available for 10m, and 106m geneformer models. Additionally we provide an example recipe of finetuning, where the objective
+is to 'regress' on token IDs rather than the traditional masked language model approach. In practice, you will likely
+need to implement your own DataModule, DataConfig, and Finetuning model. You can use the same overall approach, but with
+customizations for your task.
+
+
+```bash
+TEST_DATA_DIR=$(download_bionemo_data single_cell/testdata-20240506 --source $MY_DATA_SOURCE); \
+bionemo-geneformer-recipe \
+    --recipe 10m-pretrain \
+    --dest my_config.json \
+    --data-path ${TEST_DATA_DIR}/cellxgene_2023-12-15_small/processed_data \
+    --result-dir ./results
+```
+> ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit
+
+> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
+
+To submit a training job with the passed config, first update the json file with any additional execution parameters
+of your choosing: number of devices, workers, steps, etc. Second, invoke our training entrypoint. To do this, we need
+three things:
+
+- Configuration file, the JSON produced by the previous step
+- Model config type, in this case the pretraining config. This will validate the arguments in the config JSON against
+    those required for pretraining. Alternatively, things like fine-tuning with custom task heads may be specified here.
+    This allows for mixing/matching Data Modules with various tasks.
+- Data Config type, this specifies how to parse, validate, and prepare the DataModule. This may change depending on task,
+for example, while fine-tuning you may want to use a custom Dataset/DataModule that includes PERTURB-seq. In this case,
+the default pretraining DataConfig and DataModule will be insufficient. See ESM2 for additional example usecases.
+
+> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config JSON and populate it with your WandB details.
+
+```bash
+bionemo-geneformer-train \
+--data-config-t bionemo.geneformer.run.config_models.GeneformerPretrainingDataConfig \
+--model-config-t bionemo.geneformer.run.config_models.ExposedGeneformerPretrainConfig \
+--config my_config.json
+```
+
+> NOTE: both data-config-t and model-config-t have default values corresponding to GeneformerPretrainingDataConfig and ExposedGeneformerPretrainConfig
+
+DataConfigT and ModelConfigT can also refer to locally defined types by the user. As long as python knows how to import
+the specified path, they may be configured. For example, you may have a custom Dataset/DataModule that you would like to
+mix with an existing recipe. In this case, you define a DataConfig object with the generic specified as your DataModule
+type, and then pass in the config type to the training recipe.
+
 
 
 ## Updating License Header on Python Files

diff --git a/scripts/protein/esm2/esm2_pretrain.py b/scripts/protein/esm2/esm2_pretrain.py
@@ -30,12 +30,12 @@
 from bionemo.esm2.data.datamodule import ESMDataModule
 from bionemo.esm2.data.dataset import RandomMaskStrategy
 from bionemo.esm2.data.tokenizer import get_tokenizer
-from bionemo.esm2.model.lr_scheduler import WarmupAnnealDecayHoldScheduler
 from bionemo.llm.lightning import PerplexityLoggingCallback
 from bionemo.llm.model.biobert.lightning import biobert_lightning_module
 from bionemo.llm.model.biobert.model import BiobertSpecOption
+from bionemo.llm.model.lr_scheduler import WarmupAnnealDecayHoldScheduler
 from bionemo.llm.utils.datamodule_utils import float_or_int_or_none, infer_global_batch_size
-from bionemo.llm.utils.logger_utils import WandbLoggerOptions, setup_nemo_lightning_logger
+from bionemo.llm.utils.logger_utils import WandbConfig, setup_nemo_lightning_logger
 
 
 __all__: Sequence[str] = ("main", "parser")
@@ -147,10 +147,10 @@ def main(
 
     # for wandb integration
     # Please refer to https://pytorch-lightning.readthedocs.io/en/0.7.6/api/pytorch_lightning.loggers.html"
-    wandb_options: Optional[WandbLoggerOptions] = (
+    wandb_config: Optional[WandbConfig] = (
         None
         if wandb_project is None
-        else WandbLoggerOptions(
+        else WandbConfig(
             offline=wandb_offline,
             project=wandb_project,
             entity=wandb_entity,
@@ -203,8 +203,8 @@ def main(
         max_seq_length=max_seq_length,
         num_workers=num_dataset_workers,
         random_mask_strategy=random_mask_strategy,
+        tokenizer=tokenizer,
     )
-
     # Configure the model
     esm2_config = ESM2Config(
         seq_length=max_seq_length,
@@ -254,7 +254,7 @@ def main(
         root_dir=result_dir,
         name=experiment_name,
         initialize_tensorboard_logger=create_tensorboard_logger,
-        wandb_kwargs=wandb_options,
+        wandb_config=wandb_config,
         ckpt_callback=checkpoint_callback,
     )
 

diff --git a/scripts/protein/esm2/test_esm2_pretrain.py b/scripts/protein/esm2/test_esm2_pretrain.py
@@ -144,6 +144,7 @@ def test_main_runs(monkeypatch, tmpdir, dummy_protein_dataset, dummy_parquet_tra
 def test_val_dataloader_in_main_runs_with_limit_val_batches(
     monkeypatch, tmpdir, dummy_protein_dataset, dummy_parquet_train_val_inputs, limit_val_batches
 ):
+    # TODO: pydantic.
     """Ensures doesn't run out of validation samples whenever updating limit_val_batches logic.
 
     Args:

diff --git a/scripts/protein/esm2/test_pydantic_train.py b/scripts/protein/esm2/test_pydantic_train.py
@@ -0,0 +1,97 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-Apache2
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shlex
+import subprocess
+from pathlib import Path
+
+import pytest
+from lightning.fabric.plugins.environments.lightning import find_free_network_port
+
+from bionemo.testing.data.esm2 import create_mock_parquet_train_val_inputs, create_mock_protein_dataset
+from bionemo.testing.data.load import load
+
+
+data_path: Path = load("single_cell/testdata-20240506") / "cellxgene_2023-12-15_small" / "processed_data"
+
+
+def test_bionemo2_rootdir():
+    data_error_str = (
+        "Please download test data with:\n"
+        "`python scripts/download_artifacts.py --models all --model_dir ./models --data all --data_dir ./ --verbose --source pbss`"
+    )
+    assert data_path.exists(), f"Could not find test data directory.\n{data_error_str}"
+    assert data_path.is_dir(), f"Test data directory is supposed to be a directory.\n{data_error_str}"
+
+
+@pytest.fixture
+def dummy_protein_dataset(tmp_path):
+    """Create a mock protein dataset."""
+    db_file = create_mock_protein_dataset(tmp_path)
+    return db_file
+
+
+@pytest.fixture
+def dummy_parquet_train_val_inputs(tmp_path):
+    """Create a mock protein train and val cluster parquet."""
+    train_cluster_path, valid_cluster_path = create_mock_parquet_train_val_inputs(tmp_path)
+    return train_cluster_path, valid_cluster_path
+
+
+def test_pretrain_pydantic_cli(dummy_protein_dataset, dummy_parquet_train_val_inputs, tmpdir):
+    result_dir = tmpdir.mkdir("results")
+    train_cluster_path, valid_cluster_path = dummy_parquet_train_val_inputs
+
+    open_port = find_free_network_port()
+    config = f"{result_dir}/test_config.json"
+
+    # Invoke with blocking
+    cmd_str = f"""bionemo-esm2-recipe --dest {config} --recipe test
+      --train-database-path {dummy_protein_dataset}
+      --train-cluster-path {train_cluster_path}
+      --valid-database-path {dummy_protein_dataset}
+      --valid-cluster-path {valid_cluster_path}
+      --result-dir {result_dir}""".strip()
+
+    # continue when finished
+    env = dict(**os.environ)  # a local copy of the environment
+    env["MASTER_PORT"] = str(open_port)
+    cmd = shlex.split(cmd_str)
+    result = subprocess.run(
+        cmd,
+        cwd=tmpdir,
+        env=env,
+        capture_output=True,
+    )
+    # Now do pretrain
+    if result.returncode != 0:
+        raise Exception(f"Pretrain script failed:\n{cmd_str=}\n{result.stdout=}\n{result.stderr=}")
+
+    cmd_str = f"""bionemo-esm2-train --conf {config}""".strip()
+    env = dict(**os.environ)  # a local copy of the environment
+    open_port = find_free_network_port()
+    env["MASTER_PORT"] = str(open_port)
+    cmd = shlex.split(cmd_str)
+    result = subprocess.run(
+        cmd,
+        cwd=tmpdir,
+        env=env,
+        capture_output=True,
+    )
+    if result.returncode != 0:
+        raise Exception(f"Pretrain script failed:\n{cmd_str=}\n{result.stdout=}\n{result.stderr=}")
+    # NOTE this looks a lot like a magic value. But we also could do json.loads(config)['experiment_config']['experiment_name']
+    assert (result_dir / "default_experiment").exists(), "Could not find test experiment directory."
diff --git a/sub-packages/bionemo-core/src/bionemo/core/utils/dtypes.py b/sub-packages/bionemo-core/src/bionemo/core/utils/dtypes.py
@@ -14,7 +14,7 @@
 # limitations under the License.
 
 
-from typing import Literal, Sequence
+from typing import Dict, Literal, Sequence
 
 import torch
 
@@ -24,7 +24,23 @@
     "PrecisionTypes",
 )
 
+
+# NOTE(SKH) our precision types are a mess, but we inherit this problem from NeMo and Megatron.
 PrecisionTypes = Literal["fp16", "bf16", "fp32", "bf16-mixed", "fp32-mixed", "16-mixed", "fp16-mixed", 16, 32]
+precision_to_dtype: Dict[PrecisionTypes, torch.dtype] = {
+    "fp16": torch.float16,
+    "bf16": torch.bfloat16,
+    "fp32": torch.float32,
+    "16-mixed": torch.float16,
+    "fp16-mixed": torch.float16,
+    "bf16-mixed": torch.bfloat16,
+    "fp32-mixed": torch.float32,
+    16: torch.float16,
+    32: torch.float32,
+}
+
+# NOTE(SKH) these do not have a perfect 1-1 relationship, but we can use this to serialize/deserialize dtypes in ModelConfigs since its ultimately converted with precision_to_dtype.
+dtype_to_precision: Dict[torch.dtype, PrecisionTypes] = {v: k for k, v in precision_to_dtype.items()}
 
 
 def get_autocast_dtype(precision: PrecisionTypes) -> torch.dtype:

diff --git a/sub-packages/bionemo-esm2/pyproject.toml b/sub-packages/bionemo-esm2/pyproject.toml
@@ -21,6 +21,10 @@ dependencies = [
 [tool.setuptools.package-data]
 "bionemo.esm2" = ["data/tokenizer/*.json", "data/tokenizer/*.txt"]
 
+[project.scripts]
+bionemo-esm2-train= "bionemo.esm2.run.main:main"
+bionemo-esm2-recipe= "bionemo.esm2.run.recipes:main"
+
 [tool.setuptools.packages.find]
 where = ["src"]
 include = ["bionemo.*"]

diff --git a/sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py b/sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py
@@ -108,6 +108,11 @@ def __init__(
             rampup_batch_size=rampup_batch_size,
         )
 
+    @property
+    def tokenizer(self) -> tokenizer.BioNeMoESMTokenizer:
+        """Returns the tokenizer."""
+        return self._tokenizer
+
     def setup(self, stage: str = "") -> None:
         """Setup the ESMDataModule.
 

diff --git a/sub-packages/bionemo-esm2/src/bionemo/esm2/model/finetune/infer.py b/sub-packages/bionemo-esm2/src/bionemo/esm2/model/finetune/infer.py
@@ -24,6 +24,7 @@
 from bionemo.esm2.data.tokenizer import BioNeMoESMTokenizer, get_tokenizer
 from bionemo.esm2.model.finetune.datamodule import ESM2FineTuneDataModule
 from bionemo.esm2.model.finetune.finetune_regressor import ESM2FineTuneSeqConfig, InMemorySingleValueDataset
+from bionemo.llm.lightning import batch_collator
 from bionemo.llm.model.biobert.lightning import biobert_lightning_module
 
 
@@ -57,7 +58,7 @@ def infer_model(
         plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
     )
     module = biobert_lightning_module(config=config, tokenizer=tokenizer)
-    results = trainer.predict(module, datamodule=data_module)
+    results = batch_collator(trainer.predict(module, datamodule=data_module))
 
     return results
 

diff --git a/sub-packages/bionemo-esm2/src/bionemo/esm2/model/model.py b/sub-packages/bionemo-esm2/src/bionemo/esm2/model/model.py
@@ -323,6 +323,7 @@ class ESM2GenericConfig(BioBertConfig[ESM2ModelT, MegatronLossType]):
     return_only_hidden_states: bool = False  # return logits
 
     def __post_init__(self):
+        # TODO, as a validator?
         """Check compatibility between biobert_spec_option and apply_query_key_layer_scaling post initialization."""
         super().__post_init__()
         if self.biobert_spec_option == BiobertSpecOption.esm2_bert_layer_with_transformer_engine_spec:

diff --git a/sub-packages/bionemo-esm2/src/bionemo/esm2/run/__init__.py b/sub-packages/bionemo-esm2/src/bionemo/esm2/run/__init__.py
@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-Apache2
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.