-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fully Integrate SCDL into Geneformer (#480)
## Summary In this PR we refactor the Geneformer `SingleCellDataset` class to integrate the `SingleCellMemmapDataset`(SCDL). The goal of this is to streamline and increase readability of the dataset class. ## Details We make the following changes: - Input Format: - The `SingleCellDataset` now assumes that the input path to the data is a directory formatted in the `SingleCellMemmap` format. - The SingleCellModule now assumes that the train, val, and test input paths are to directories that are formatted in the `SingleCellMemmap` format - Get Item: - `_get_item()` now leverages the get_row function from SCDL (so we eliminate the need to store and parse information in metadata.json) - Error Handling for Genes not in the Tokenizer Vocabulary: - We add an optional parameter to SingleCellDataset and SingleCellDataModule called `bypass_tokenizer_vocab` which is by default `False`. So by default, we throw an error if a gene ID is not in the tokenizer vocabulary. If a user wants to bypass this, they can change `bypass_tokenizer_vocab` to `True`. - Error Handling for Genes with Zero Expression Values: - We throw an invalid input error in the cases that certain cells have no gene expression values (i.e. `sc_dataset.scdl.get_item()` returns `[]` for the gene data value) ## Usage The main change from a user perspective is to ensure that they convert their single cell h5ad files (or directories of h5ad files) to SingleCellMemmap format. 1) For a single h5ad file, i.e. `data.h5ad`, they can simply run the following, where `output_path` is the file path the SingleCellMemmap directory should be written to: ` SingleCellMemMapDataset(output_path, data.h5ad) ` 2) For a directory of h5ad files, they can simply run the `convert_h5ad_to_scdl` script (more information available in the SCDL ReadMe). ## Testing We test that the updated SingleCellDataset produces the same output as the old dataset on synthetic samples and samples from the cellxsmall dataset. We also test for Megatron compatibility (as this dataset uses the MultiEpochDatasampler / Epoch Index) and for correct error handling of the above cases. Tests for these changes can be run via: ```shell pytest -vsub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_dataset.py ``` Note that we have also updated the following test files to use the MemMap dataset format + set bypass_tokenizer_vocab=True in them, because the cellxsmall dataset does have a few genes not in the HuggingFace tokenizer vocab and so the tests will error otherwise: `sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py` `scripts/singlecell/geneformer/test_train.py` `sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_stop_and_go.py` --------- Signed-off-by: savitha-eng <[email protected]> Signed-off-by: polinabinder1 <[email protected]> Co-authored-by: Savitha Srinivasan <[email protected]> Co-authored-by: polinabinder1 <[email protected]>
- Loading branch information
1 parent
e9ed8cf
commit 30527b1
Showing
22 changed files
with
1,073 additions
and
729 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,6 @@ | |
docs/site/ | ||
*.nemo | ||
protein/ | ||
singlecell/ | ||
results/ | ||
|
||
# Local configs | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
549 changes: 278 additions & 271 deletions
549
docs/docs/user-guide/examples/bionemo-geneformer/geneformer-celltype-classification.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,3 +5,11 @@ | |
sha256: 7a4237537bf535dfa00301ce8cc7073e0a23d5bc8aa902ad65db9f51b57a6df9 # pragma: allowlist secret | ||
owner: Polina Binder <[email protected]> | ||
description: Sample test data for SCDL. | ||
|
||
- tag: sample_scdl_feature_ids | ||
ngc: nvidia/clara/scdl_sample_test_feature_ids:1.0 | ||
ngc_registry: resource | ||
pbss: s3://bionemo-ci/test-data/scdl_sample_test_feat_ids.tar.gz | ||
sha256: 9020ba336dbfe33bddadba26ca0cde49958cbd73c5ad44f0960a5a4837c9db26 # pragma: allowlist secret | ||
owner: Savitha Srinivasan <[email protected]> | ||
description: Sample test data for SCDL with feature IDs appended. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,3 +21,11 @@ | |
sha256: ab038b184de52e53ff7bcea5e01d97d55944c507db88c0495bdf9e5e9e0303a4 # pragma: allowlist secret | ||
owner: John St John <[email protected]> | ||
description: Golden values for geneformer QA model. | ||
|
||
- tag: testdata-20241203 | ||
ngc: nvidia/clara/singlecell-testdata:2.0 | ||
ngc_registry: resource | ||
pbss: "s3://bionemo-ci/test-data/singlecell/singlecell-scdltestdata-20241203.tar.gz" | ||
sha256: d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3 # pragma: allowlist secret | ||
owner: Savitha Srinivasan <[email protected]> | ||
description: Test data for single cell models in SCDL Memmap format. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.