Skip to content

Commit

Permalink
Polinabinder/scdl document (#390)
Browse files Browse the repository at this point in the history
I'm adding a few changes to the SCDL documentation to make it a bit more
clear. I am also adding explanation of the paginated loading.

---------

Signed-off-by: polinabinder1 <[email protected]>
  • Loading branch information
polinabinder1 authored Nov 7, 2024
1 parent 0bbea8e commit 51104b8
Showing 1 changed file with 19 additions and 13 deletions.
32 changes: 19 additions & 13 deletions sub-packages/bionemo-scdl/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Bionemo-scdl: Single Cell Data Loading for Scalable Training of Single Cell Foundation Models.
# BioNemo-SCDL: Single Cell Data Loading for Scalable Training of Single Cell Foundation Models.

## Package Overview

Bionemo-scdl provides an independent pytorch-compatible dataset class for single cell data with a consistent API. Bionemo-scdl is developed and maintained by NVIDIA. This package can be run independently from bionemo. It improves upon simple AnnData-based dataset classes in the following ways:
BioNeMo-SCDL provides an independent pytorch-compatible dataset class for single cell data with a consistent API. BioNeMo-SCDL is developed and maintained by NVIDIA. This package can be run independently from BioNeMo. It improves upon simple AnnData-based dataset classes in the following ways:

- A consistent API across input formats that is promised to be consistent across package versions.
- Improved performance when loading large datasets.
- Improved performance when loading large datasets. It allows for loading and fast iteration of large datasets.
- Ability to use datasets that are much, much larger than memory. This is because the datasets are stored in a numpy memory-mapped format.
- Additionally, conversion of large (significantly larger than memory) AnnData files into the SCDL format.
- [Future] Full support for ragged arrays (i.e., datasets with different feature counts; currently only a subset of the API functionality is supported for ragged arrays).
- Ability to use datasets that are much, much larger than memory.
- [Future] Support for improved compression.

Bionemo-scdl's API resembles that of AnnData, so code changes are minimal.
BioNeMo-SCDL's API resembles that of AnnData, so code changes are minimal.
In most places a simple swap from an attribute to a function is sufficient (i.e., swapping `data.n_obs` for `data.number_of_rows()`).

## Installation
Expand All @@ -35,9 +36,14 @@ from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset
data = SingleCellMemMapDataset("97e_scmm", "hdf5s/97e96fb1-8caf-4f08-9174-27308eabd4ea.h5ad")

```
This creates a SingleCellMemMapDataset that is stored at 97e_scmm in large, memory-mapped arrays

This creates a `SingleCellMemMapDataset` that is stored at 97e_scmm in large, memory-mapped arrays
that enables fast access of datasets larger than the available amount of RAM on a system.

If the dataset is large, the AnnData file can be lazy-loaded and then read in based on chunks of rows in a paginated manner. This can be done by setting the parameters when instantiating the `SingleCellMemMapDataset`:
- `paginated_load_cutoff`, which sets the minimal file size in megabytes at which an AnnData file will be read in in a paginated manner.
- `load_block_row_size`, which is the number of rows that are read into memory at a given time.

### Interrogating single cell datasets and exploring the API

```python
Expand All @@ -63,7 +69,7 @@ data structures are stored. However, these structures are not guaranteed
to be in a valid serialized state during runtime.

Calling the `save` method guarantees the on-disk object is in a valid serialized
state, at which point the current python process can exit and the object can be
state, at which point the current python process can exit, and the object can be
loaded by another process later.

```python
Expand All @@ -86,7 +92,7 @@ reloaded_data = SingleCellMemMapDataset("97e_scmm")
SCDL implements the required functions of the PyTorch Dataset abstract class.
You can use PyTorch-compatible DataLoaders to load batches of data from a SCDL class.
With a batch size of 1 this can be run without a collating function. With a batch size
greater than 1, there is a collation function (collate_sparse_matrix_batch), that will
greater than 1, there is a collation function (`collate_sparse_matrix_batch`), that will
collate several sparse arrays into the CSR (Compressed Sparse Row) torch tensor format.

```python
Expand All @@ -109,21 +115,21 @@ The examples directory contains various examples for utilizing SCDL.

### Converting existing Cell x Gene data to SCDL

To convert existing AnnData files from CellxGene, you can either write your own
script using the SCDL API or utilize the convenience script `convert_h5ad_to_scdl`.
If there are multiple AnnData files, they can be converted into a single `SingleCellMemMapDataset`. If the hdf5 directory has one or more AnnData files, the `SingleCellCollection` class crawls the filesystem to recursively find AnnData files (with the h5ad extension).

This script crawls the filesystem to recursively find AnnData files (with the h5ad extension) and converts them to a single SingleCellMemMapDataset. Here's an example:
To convert existing AnnData files, you can either write your own script using the SCDL API or utilize the convenience script `convert_h5ad_to_scdl`.

Here's an example:

```bash
convert_h5ad_to_scdl --data-path hdf5s --save-path example_dataset
```


## Future Work and Roadmap

SCDL is currently in public beta. In the future, expect improvements in data compression
and data loading performance.

## LICENSE

Bionemo-scdl has an Apache 2.0 license, as found in the LICENSE file.
BioNemo-SCDL has an Apache 2.0 license, as found in the LICENSE file.

0 comments on commit 51104b8

Please sign in to comment.