Polinabinder/scdl document (#390)

I'm adding a few changes to the SCDL documentation to make it a bit more clear. I am also adding explanation of the paginated loading. --------- Signed-off-by: polinabinder1 <[email protected]>
NVIDIA · Nov 7, 2024 · 51104b8 · 51104b8
1 parent 0bbea8e
commit 51104b8
Showing 1 changed file with 19 additions and 13 deletions.
diff --git a/sub-packages/bionemo-scdl/README.md b/sub-packages/bionemo-scdl/README.md
@@ -1,16 +1,17 @@
-# Bionemo-scdl: Single Cell Data Loading for Scalable Training of Single Cell Foundation Models.
+# BioNemo-SCDL: Single Cell Data Loading for Scalable Training of Single Cell Foundation Models.
 
 ## Package Overview
 
-Bionemo-scdl provides an independent pytorch-compatible dataset class for single cell data with a consistent API. Bionemo-scdl is developed and maintained by NVIDIA. This package can be run independently from bionemo. It improves upon simple AnnData-based dataset classes in the following ways:
+BioNeMo-SCDL provides an independent pytorch-compatible dataset class for single cell data with a consistent API. BioNeMo-SCDL is developed and maintained by NVIDIA. This package can be run independently from BioNeMo. It improves upon simple AnnData-based dataset classes in the following ways:
 
 - A consistent API across input formats that is promised to be consistent across package versions.
-- Improved performance when loading large datasets.
+- Improved performance when loading large datasets. It allows for loading and fast iteration of large datasets.
+- Ability to use datasets that are much, much larger than memory. This is because the datasets are stored in a numpy memory-mapped format.
+- Additionally, conversion of large (significantly larger than memory) AnnData files into the SCDL format.
 - [Future] Full support for ragged arrays (i.e., datasets with different feature counts; currently only a subset of the API functionality is supported for ragged arrays).
-- Ability to use datasets that are much, much larger than memory.
 - [Future] Support for improved compression.
 
-Bionemo-scdl's API resembles that of AnnData, so code changes are minimal.
+BioNeMo-SCDL's API resembles that of AnnData, so code changes are minimal.
 In most places a simple swap from an attribute to a function is sufficient (i.e., swapping `data.n_obs` for `data.number_of_rows()`).
 
 ## Installation
@@ -35,9 +36,14 @@ from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset
 data = SingleCellMemMapDataset("97e_scmm", "hdf5s/97e96fb1-8caf-4f08-9174-27308eabd4ea.h5ad")
 
 ```
-This creates a SingleCellMemMapDataset that is stored at 97e_scmm in large, memory-mapped arrays
+
+This creates a `SingleCellMemMapDataset` that is stored at 97e_scmm in large, memory-mapped arrays
 that enables fast access of datasets larger than the available amount of RAM on a system.
 
+If the dataset is large, the AnnData file can be lazy-loaded and then read in based on chunks of rows in a paginated manner. This can be done by setting the parameters when instantiating the `SingleCellMemMapDataset`:
+- `paginated_load_cutoff`, which sets the minimal file size in megabytes at which an AnnData file will be read in in a paginated manner.
+- `load_block_row_size`, which is the number of rows that are read into memory at a given time.
+
 ### Interrogating single cell datasets and exploring the API
 
 ```python
@@ -63,7 +69,7 @@ data structures are stored. However, these structures are not guaranteed
 to be in a valid serialized state during runtime.
 
 Calling the `save` method guarantees the on-disk object is in a valid serialized
-state, at which point the current python process can exit and the object can be
+state, at which point the current python process can exit, and the object can be
 loaded by another process later.
 
 ```python
@@ -86,7 +92,7 @@ reloaded_data = SingleCellMemMapDataset("97e_scmm")
 SCDL implements the required functions of the PyTorch Dataset abstract class.
 You can use PyTorch-compatible DataLoaders to load batches of data from a SCDL class.
 With a batch size of 1 this can be run without a collating function. With a batch size
-greater than 1, there is a collation function (collate_sparse_matrix_batch), that will
+greater than 1, there is a collation function (`collate_sparse_matrix_batch`), that will
 collate several sparse arrays into the CSR (Compressed Sparse Row) torch tensor format.
 
 ```python
@@ -109,21 +115,21 @@ The examples directory contains various examples for utilizing SCDL.
 
 ### Converting existing Cell x Gene data to SCDL
 
-To convert existing AnnData files from CellxGene, you can either write your own
-script using the SCDL API or utilize the convenience script `convert_h5ad_to_scdl`.
+If there are multiple AnnData files, they can be converted into a single `SingleCellMemMapDataset`. If the hdf5 directory has one or more AnnData files, the `SingleCellCollection` class crawls the filesystem to recursively find AnnData files (with the h5ad extension).
 
-This script crawls the filesystem to recursively find AnnData files (with the h5ad extension) and converts them to a single SingleCellMemMapDataset. Here's an example:
+To convert existing AnnData files, you can either write your own script using the SCDL API or utilize the convenience script `convert_h5ad_to_scdl`.
+
+Here's an example:
 
 ```bash
 convert_h5ad_to_scdl --data-path hdf5s --save-path example_dataset
 ```
 
-
 ## Future Work and Roadmap
 
 SCDL is currently in public beta. In the future, expect improvements in data compression
 and data loading performance.
 
 ## LICENSE
 
-Bionemo-scdl has an Apache 2.0 license, as found in the LICENSE file.
+BioNemo-SCDL has an Apache 2.0 license, as found in the LICENSE file.