diff --git a/sub-packages/bionemo-scdl/README.md b/sub-packages/bionemo-scdl/README.md index e310f717a0..4adde79b5c 100644 --- a/sub-packages/bionemo-scdl/README.md +++ b/sub-packages/bionemo-scdl/README.md @@ -1,16 +1,17 @@ -# Bionemo-scdl: Single Cell Data Loading for Scalable Training of Single Cell Foundation Models. +# BioNemo-SCDL: Single Cell Data Loading for Scalable Training of Single Cell Foundation Models. ## Package Overview -Bionemo-scdl provides an independent pytorch-compatible dataset class for single cell data with a consistent API. Bionemo-scdl is developed and maintained by NVIDIA. This package can be run independently from bionemo. It improves upon simple AnnData-based dataset classes in the following ways: +BioNeMo-SCDL provides an independent pytorch-compatible dataset class for single cell data with a consistent API. BioNeMo-SCDL is developed and maintained by NVIDIA. This package can be run independently from BioNeMo. It improves upon simple AnnData-based dataset classes in the following ways: - A consistent API across input formats that is promised to be consistent across package versions. -- Improved performance when loading large datasets. +- Improved performance when loading large datasets. It allows for loading and fast iteration of large datasets. +- Ability to use datasets that are much, much larger than memory. This is because the datasets are stored in a numpy memory-mapped format. +- Additionally, conversion of large (significantly larger than memory) AnnData files into the SCDL format. - [Future] Full support for ragged arrays (i.e., datasets with different feature counts; currently only a subset of the API functionality is supported for ragged arrays). -- Ability to use datasets that are much, much larger than memory. - [Future] Support for improved compression. -Bionemo-scdl's API resembles that of AnnData, so code changes are minimal. +BioNeMo-SCDL's API resembles that of AnnData, so code changes are minimal. In most places a simple swap from an attribute to a function is sufficient (i.e., swapping `data.n_obs` for `data.number_of_rows()`). ## Installation @@ -35,9 +36,14 @@ from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset data = SingleCellMemMapDataset("97e_scmm", "hdf5s/97e96fb1-8caf-4f08-9174-27308eabd4ea.h5ad") ``` -This creates a SingleCellMemMapDataset that is stored at 97e_scmm in large, memory-mapped arrays + +This creates a `SingleCellMemMapDataset` that is stored at 97e_scmm in large, memory-mapped arrays that enables fast access of datasets larger than the available amount of RAM on a system. +If the dataset is large, the AnnData file can be lazy-loaded and then read in based on chunks of rows in a paginated manner. This can be done by setting the parameters when instantiating the `SingleCellMemMapDataset`: +- `paginated_load_cutoff`, which sets the minimal file size in megabytes at which an AnnData file will be read in in a paginated manner. +- `load_block_row_size`, which is the number of rows that are read into memory at a given time. + ### Interrogating single cell datasets and exploring the API ```python @@ -63,7 +69,7 @@ data structures are stored. However, these structures are not guaranteed to be in a valid serialized state during runtime. Calling the `save` method guarantees the on-disk object is in a valid serialized -state, at which point the current python process can exit and the object can be +state, at which point the current python process can exit, and the object can be loaded by another process later. ```python @@ -86,7 +92,7 @@ reloaded_data = SingleCellMemMapDataset("97e_scmm") SCDL implements the required functions of the PyTorch Dataset abstract class. You can use PyTorch-compatible DataLoaders to load batches of data from a SCDL class. With a batch size of 1 this can be run without a collating function. With a batch size -greater than 1, there is a collation function (collate_sparse_matrix_batch), that will +greater than 1, there is a collation function (`collate_sparse_matrix_batch`), that will collate several sparse arrays into the CSR (Compressed Sparse Row) torch tensor format. ```python @@ -109,16 +115,16 @@ The examples directory contains various examples for utilizing SCDL. ### Converting existing Cell x Gene data to SCDL -To convert existing AnnData files from CellxGene, you can either write your own -script using the SCDL API or utilize the convenience script `convert_h5ad_to_scdl`. +If there are multiple AnnData files, they can be converted into a single `SingleCellMemMapDataset`. If the hdf5 directory has one or more AnnData files, the `SingleCellCollection` class crawls the filesystem to recursively find AnnData files (with the h5ad extension). -This script crawls the filesystem to recursively find AnnData files (with the h5ad extension) and converts them to a single SingleCellMemMapDataset. Here's an example: +To convert existing AnnData files, you can either write your own script using the SCDL API or utilize the convenience script `convert_h5ad_to_scdl`. + +Here's an example: ```bash convert_h5ad_to_scdl --data-path hdf5s --save-path example_dataset ``` - ## Future Work and Roadmap SCDL is currently in public beta. In the future, expect improvements in data compression @@ -126,4 +132,4 @@ and data loading performance. ## LICENSE -Bionemo-scdl has an Apache 2.0 license, as found in the LICENSE file. +BioNemo-SCDL has an Apache 2.0 license, as found in the LICENSE file.