add tutorial notebooks (PR 7 of N) #10

bkmartinjr · 2024-09-23T21:55:38Z

Add three notebooks (some are partial - functional but need additional verbiage before release).

tutorial_pytorch: demonstrate toy model using a DataPipe
tutorial_lightning: demonstrate toy model using Lighnting
tutorial_multiworker: demo sharp edges with distributed/multi-worker usage (e.g., set_epoch)

aaronwolen

The notebooks are really helpful and ran perfectly for me. I made a few minor comments/suggestions.

aaronwolen · 2024-09-27T18:58:48Z

notebooks/tutorial_pytorch.ipynb

+   "source": [
+    "### `ExperimentAxisQueryIterDataPipe` class explained\n",
+    "\n",
+    "This class provides an implementation of PyTorch's `torchdata` [IterDataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source. The `ExperimentAxisQueryIterDataPipe` class encapsulates the details of querying and retrieving Census data from a single SOMA `Experiment` and returning it to the caller a NumPy `ndarray` and a Pandas `DataFrame`. Most importantly, it retrieves the data lazily from the Census in batches, avoiding having to load the entire training dataset into memory at once.\n",


Suggested change

"This class provides an implementation of PyTorch's `torchdata` [IterDataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source. The `ExperimentAxisQueryIterDataPipe` class encapsulates the details of querying and retrieving Census data from a single SOMA `Experiment` and returning it to the caller a NumPy `ndarray` and a Pandas `DataFrame`. Most importantly, it retrieves the data lazily from the Census in batches, avoiding having to load the entire training dataset into memory at once.\n",

"This class provides an implementation of PyTorch's `torchdata` [IterDataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source. The `ExperimentAxisQueryIterDataPipe` class encapsulates the details of querying and retrieving data from a single SOMA `Experiment` and returning to the caller a NumPy `ndarray` and a Pandas `DataFrame`. Most importantly, it retrieves the data lazily and in batches, avoiding the need to load the entire training dataset into memory at once.\n",

I removed a couple references to the Census because the wording made it sound like this is specific to the Census.

aaronwolen · 2024-09-27T19:00:38Z

notebooks/tutorial_pytorch.ipynb

+    "\n",
+    "To retrieve a subset of the Experiment's data, along either the `obs` or `var` axes, you may specify query filters via the `obs_query` and `var_query` parameters, which are both `soma.AxisQuery` objects.\n",
+    "\n",
+    "The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` array.\n",


Suggested change

"The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` array.\n",

"The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` (or `var_column_names`) array.\n",

aaronwolen · 2024-09-27T19:01:09Z

notebooks/tutorial_pytorch.ipynb

+    "\n",
+    "The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` array.\n",
+    "\n",
+    "The `batch_size` allows you to specify the number of obs rows (cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",


Suggested change

"The `batch_size` allows you to specify the number of obs rows (cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",

"The `batch_size` parameter allows you to specify the number of `obs` rows (i.e., cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",

aaronwolen · 2024-09-27T19:06:24Z

notebooks/tutorial_pytorch.ipynb

+    "\n",
+    "The `shuffle` flag allows you to randomize the ordering of the training data for each training epoch. Note:\n",
+    "* You should use this flag instead of the `DataLoader` `shuffle` flag, primarily for performance reasons.\n",
+    "* PyTorch's TorchData library provides a [Shuffler](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html) `DataPipe`, which is alternate mechanism one can use to perform shuffling of an `IterableDataset`. However, the `Shuffler` will not \"globally\" randomize the training data, as it only \"locally\" randomizes the ordering of the training data within fixed-size \"windows\". Due to the layout of Census data, a given \"window\" of Census data may be highly homogeneous in terms of its `obs` axis attribute values, and so this shuffling strategy may not provide sufficient randomization for certain types of models."


Suggested change

"* PyTorch's TorchData library provides a [Shuffler](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html) `DataPipe`, which is alternate mechanism one can use to perform shuffling of an `IterableDataset`. However, the `Shuffler` will not \"globally\" randomize the training data, as it only \"locally\" randomizes the ordering of the training data within fixed-size \"windows\". Due to the layout of Census data, a given \"window\" of Census data may be highly homogeneous in terms of its `obs` axis attribute values, and so this shuffling strategy may not provide sufficient randomization for certain types of models."

"* PyTorch's TorchData library provides a [Shuffler](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html) `DataPipe`, which is an alternate mechanism one can use to perform shuffling of an `IterableDataset`. However, the `Shuffler` will not \"globally\" randomize the training data, as it only \"locally\" randomizes the ordering of the training data within fixed-size \"windows\". This problematic for atlas-style datasets such as Census, where a given \"window\" of data may be highly homogeneous in terms of its `obs` axis attribute values, and so this shuffling strategy may not provide sufficient randomization for certain types of models."

aaronwolen · 2024-09-27T19:07:51Z

notebooks/tutorial_lightning.ipynb

+    "\n",
+    "**Prerequesites**\n",
+    "\n",
+    "Install `tiledbsoma_ml` and `scikit-learn`, for example:\n",


Does pytorch_lightning need to be listed here too?

aaronwolen · 2024-09-27T19:13:45Z

notebooks/tutorial_multiworker.ipynb

+    "# Multi-process training\n",
+    "\n",
+    "Multi-process usage of `tiledbsoma_ml.ExperimentAxisQueryIterDataset` includes both:\n",
+    "* using the `torch.utils.data.DataLoader` with 1 or more worker (ie., with an argument of `n_workers=1` or greater)\n",


Suggested change

"* using the `torch.utils.data.DataLoader` with 1 or more worker (ie., with an argument of `n_workers=1` or greater)\n",

"* using the `torch.utils.data.DataLoader` with 1 or more workers (i.e., with an argument of `n_workers=1` or greater)\n",

bkmartinjr requested review from ryan-williams, aaronwolen and johnkerl September 23, 2024 21:55

bkmartinjr mentioned this pull request Sep 24, 2024

Initial work toward PyTorch data loaders #1

Draft

ryan-williams force-pushed the bkmartinjr/add-shuffling branch from 33f2c06 to ccf373d Compare September 25, 2024 16:35

ryan-williams force-pushed the bkmartinjr/add-notebooks branch 2 times, most recently from 9334e02 to a8c16c6 Compare September 25, 2024 16:40

aaronwolen approved these changes Sep 27, 2024

View reviewed changes

add tutorial notebooks

74b77eb

ryan-williams force-pushed the bkmartinjr/add-shuffling branch from ccf373d to 1cc3670 Compare October 3, 2024 21:40

ryan-williams force-pushed the bkmartinjr/add-notebooks branch from ee453d4 to 74b77eb Compare October 3, 2024 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tutorial notebooks (PR 7 of N) #10

add tutorial notebooks (PR 7 of N) #10

bkmartinjr commented Sep 23, 2024

aaronwolen left a comment

aaronwolen Sep 27, 2024

aaronwolen Sep 27, 2024

aaronwolen Sep 27, 2024

aaronwolen Sep 27, 2024

aaronwolen Sep 27, 2024

aaronwolen Sep 27, 2024

	"The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` array.\n",
	"The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` (or `var_column_names`) array.\n",

	"The `batch_size` allows you to specify the number of obs rows (cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",
	"The `batch_size` parameter allows you to specify the number of `obs` rows (i.e., cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",

	"* using the `torch.utils.data.DataLoader` with 1 or more worker (ie., with an argument of `n_workers=1` or greater)\n",
	"* using the `torch.utils.data.DataLoader` with 1 or more workers (i.e., with an argument of `n_workers=1` or greater)\n",

add tutorial notebooks (PR 7 of N) #10

Are you sure you want to change the base?

add tutorial notebooks (PR 7 of N) #10

Conversation

bkmartinjr commented Sep 23, 2024

aaronwolen left a comment

Choose a reason for hiding this comment

aaronwolen Sep 27, 2024

Choose a reason for hiding this comment

aaronwolen Sep 27, 2024

Choose a reason for hiding this comment

aaronwolen Sep 27, 2024

Choose a reason for hiding this comment

aaronwolen Sep 27, 2024

Choose a reason for hiding this comment

aaronwolen Sep 27, 2024

Choose a reason for hiding this comment

aaronwolen Sep 27, 2024

Choose a reason for hiding this comment