Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility #71

weiji14 · 2022-07-10T05:26:45Z

Removes need for using _as_xarray_dataset so xr.DataArray inputs are preserved as xr.DataArray objects on the returned output.

Does so by changing .dims (which returns a different thing between xr.Dataset and xr.DataArray) to .sizes which is a property added in xarray v0.9.0 (25 January 2017), see also pydata/xarray#921.

Note that this change is backward incompatible, so an xbatcher v0.2.0 release might be needed. Happy to help out with setting up some of the release infrastructure if needed, e.g. if there's time during the SciPy sprint 😄

Fixes #70

Removes need for using `_as_xarray_dataset` so xr.DataArray inputs are preserved as xr.DataArray objects on the returned output.

Finalize tutorial by converting chips from xarray.Dataset to torch.Tensor and stacking them per mini-batch! Debated on whether to have the xarray collate function in the codebase, but let's wait for updates on xbatcher's end (xarray-contrib/xbatcher#71). Also renamed the tutorial file from batching to chipping and added more emojis to the intro section.

* 🚧 Walkthrough on creating batches of data Initial draft tutorial on creating batches of chipped data from full-size satellite scenes! Will be working with Sentinel-1 GRD GeoTIFFs, let's see how far this will go. * 💡 Demo XbatcherSlicer to get 512x512 chips from larger scene Walkthrough how to cut up a large satellite scene into multiple smaller chips of size 512 pixels by 512 pixels. Heavy lifting done by xbatcher which handles slicing along dimensions and overlapping strides. Needed a hacky workaround in XbatcherSlicer to fix a ValueError due to the xarray.DataArray name not being set (though it should be). * 💚 Install xbatcher for documentation build Fix readthedocs build failure because xbatcher was not installed. * 🗃️ Collate chips into mini-batches Finalize tutorial by converting chips from xarray.Dataset to torch.Tensor and stacking them per mini-batch! Debated on whether to have the xarray collate function in the codebase, but let's wait for updates on xbatcher's end (xarray-contrib/xbatcher#71). Also renamed the tutorial file from batching to chipping and added more emojis to the intro section.

maxrjones · 2022-07-28T18:31:26Z

Thanks @weiji14! I agree that returning xarray.DataArray makes sense for that case.

With this change, https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/generators.py#L10-L15 is no longer used and could be removed to maintain code coverage.

The pytorch data loader expects a xarray.Dataset with hardcoded variable names https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/loaders/torch.py#L55-L58

which causes the tests to fail in this PR. Let me know if you'd like to work on fixing that. If not, two options would be for me to submit a PR with suggested changes to this branch or to reduce the scope of this PR to not remove the coercion to dataset, while still updating .dims to .sizes.

As for release infrastructure, I'll get started on an automated changelog. Issues/PRs for other things you notice missing would be much appreciated 😃

weiji14 · 2022-07-28T19:30:27Z

https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/generators.py#L10-L15
is no longer used and could be removed to maintain code coverage.

Good point, done in 50f8b7d

The pytorch data loader expects a xarray.Dataset with hardcoded variable names

https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/loaders/torch.py#L55-L58

which causes the tests to fail in this PR. Let me know if you'd like to work on fixing that. If not, two options would be for me to submit a PR with suggested changes to this branch or to reduce the scope of this PR to not remove the coercion to dataset, while still updating .dims to .sizes.

Yeah, I noticed the hardcoded values and the reliance on the dataset requiring a name. Needed a workaround at https://github.com/weiji14/zen3geo/blob/71e886d95454de70651cc31ea6dedc33e929145c/zen3geo/datapipes/xbatcher.py#L97-L101. Feel free to push changes to this branch if you've got a good solution.

As for release infrastructure, I'll get started on an automated changelog. Issues/PRs for other things you notice missing would be much appreciated smiley

👍

weiji14 · 2022-07-28T19:40:57Z

The TypeError: <class 'numpy.typing._dtype_like._SupportsDType'> is not a generic class is an upstream xarray issue, see pydata/xarray#6818 🙂

codecov · 2022-07-29T01:00:34Z

Codecov Report

Merging #71 (9516190) into main (0ded974) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #71   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          179       173    -6     
  Branches        40        37    -3     
=========================================
- Hits           179       173    -6

Impacted Files	Coverage Δ
xbatcher/accessors.py	`100.00% <100.00%> (ø)`
xbatcher/generators.py	`100.00% <100.00%> (ø)`
xbatcher/loaders/keras.py	`100.00% <100.00%> (ø)`
xbatcher/loaders/torch.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jhamman · 2022-08-09T19:59:59Z

closed and re-opened to trigger CI.

jhamman · 2022-08-09T20:07:18Z

Looking at this now, I'm wondering how important keeping the flexible input type (Dataset and DataArray) in Xbatcher, or more narrowly in the torch dataloader, is? Is there an argument to be made in favor of only supporting DataArrays?

weiji14 · 2022-08-09T22:16:06Z

Looking at this now, I'm wondering how important keeping the flexible input type (Dataset and DataArray) in Xbatcher, or more narrowly in the torch dataloader, is? Is there an argument to be made in favor of only supporting DataArrays?

Would argue for keeping both Dataset and DataArray. Yes, xr.DataArray is easier to convert directly into tensor compared to xr.Dataset, but you'll open up a bunch of issues with type casting if you convert Dataset inputs to DataArray (since data variables can be of different dtypes). Oh, and I've got a project too using xbatcher to slice xr.Dataset objects (it's easier and more efficient to slice one xr.Dataset with many data variables than many xr.DataArrays in a for-loop).

Not sure if this is correct because the unit test is quite vague on what is being tested, but the tests pass now.

weiji14

I've managed to make all the unit tests pass, but to be honest, the unit tests were not well written. They seem to only test xarray.DataArray inputs and not xarray.Dataset. Maybe some parametrized tests would be good, but I'm also wary of making this PR too long...

xbatcher/loaders/keras.py

weiji14 · 2022-08-15T23:46:38Z

xbatcher/loaders/torch.py

+        X_batch = self.X_generator[idx].torch.to_tensor()
+        y_batch = self.y_generator[idx].torch.to_tensor()


This change works because the unit tests in test_torch_loaders.py are actually testing xarray.DataArray inputs only, and not xarray.Dataset. Ideally there would be unit tests for both xr.DataArray and xr.Dataset inputs, but this might expand the scope of the Pull Request a bit too much 😅

Before this PR, the tests used xarray.DataArray inputs to the batch generator, but xarray.Dataset inputs to the dataloaders since the batches were coerced into datasets. So I think this would be an additional breaking change in expecting xarray.DataArray inputs.

For the unit tests, I opened #83 to keep track of improvements for subsequent PRs.

What are your thoughts on backwards compatibility here @weiji14? My impression is that the hardcoded xarray.Dataset variable names severely restricts the utility of the data loader. So, I think this is a worthwhile change since we'd be forced to break backwards compatibility anyways eventually for flexible variable names and that working from an xarray.DataArray implementation is better. But we could add an if; else block for dataset vs. dataarray if it's necessary to maintain the past behavior.

Yes, I agree that we should try to be backward compatible and support both xr.DataArray and xr.Dataset inputs. If you want, I can either:

Add the if-then block, but write the comprehensive unit tests covering xr.Dataset/xr.DataArray cases in a separate PR

Do the if-then block and unit-tests in a follow up PR, in order to keep this PR small.

+1 for option 2

xbatcher/tests/test_generators.py

xbatcher/tests/test_torch_loaders.py

xbatcher/loaders/keras.py

Co-authored-by: Max Jones <[email protected]>

Also testing not just the batch size, but the actual shape of the output x_batch and y_batch tensor.

maxrjones · 2022-08-19T16:56:35Z

Thanks for this contribution @weiji14!

weiji14 · 2022-08-19T17:02:24Z

Awesome, thanks for the review!

weiji14 added 2 commits July 9, 2022 22:20

Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility

d6606cf

Removes need for using `_as_xarray_dataset` so xr.DataArray inputs are preserved as xr.DataArray objects on the returned output.

Do some pre-commit linting

f95d1c6

weiji14 added 2 commits July 28, 2022 15:10

Merge branch 'main' into dims_to_sizes

c465309

Remove unused _as_xarray_dataset function

50f8b7d

jhamman closed this Aug 9, 2022

jhamman reopened this Aug 9, 2022

weiji14 added 4 commits August 15, 2022 17:46

Merge branch 'main' into dims_to_sizes

f6a00e9

FIx conversion to torch named tensors by changing frozen dict to tuple

f0ebd20

Fix KeyError for Pytorch tests

e110995

Fix AttributeError on keras.py

954cd1d

Not sure if this is correct because the unit test is quite vague on what is being tested, but the tests pass now.

weiji14 commented Aug 15, 2022

View reviewed changes

weiji14 marked this pull request as ready for review August 15, 2022 23:49

maxrjones mentioned this pull request Aug 16, 2022

Testing improvements #83

Open

7 tasks

maxrjones reviewed Aug 16, 2022

View reviewed changes

xbatcher/tests/test_generators.py Outdated Show resolved Hide resolved

xbatcher/tests/test_torch_loaders.py Outdated Show resolved Hide resolved

xbatcher/tests/test_torch_loaders.py Outdated Show resolved Hide resolved

xbatcher/loaders/keras.py Outdated Show resolved Hide resolved

weiji14 and others added 3 commits August 17, 2022 15:20

Mention that unit tests are only for DataArray and not Dataset

256de35

Co-authored-by: Max Jones <[email protected]>

Remove concat_dim from CustomTFDataset to support xr.DataArray only

f448a74

Also testing not just the batch size, but the actual shape of the output x_batch and y_batch tensor.

Test torch.Tensor shapes to be more precise than just then batch size

9516190

maxrjones approved these changes Aug 19, 2022

View reviewed changes

maxrjones merged commit 714b624 into xarray-contrib:main Aug 19, 2022

maxrjones mentioned this pull request Aug 19, 2022

Support xarray.DataArray and xarray.Dataset batches in PyTorch dataloader #84

Closed

weiji14 deleted the dims_to_sizes branch August 19, 2022 17:01

weiji14 mentioned this pull request Aug 19, 2022

Let torch accessor and dataloader handle either xarray.DataArray or xarray.Dataset inputs #85

Merged

4 tasks

maxrjones added the enhancement New feature or request label Oct 6, 2022

weiji14 mentioned this pull request Nov 4, 2022

📌 Pin minimum xbatcher version to 0.2.0 weiji14/zen3geo#73

Merged

		X_batch = self.X_generator[idx].torch.to_tensor()
		y_batch = self.y_generator[idx].torch.to_tensor()

Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility #71

Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility #71

Uh oh!

Conversation

weiji14 commented Jul 10, 2022

Uh oh!

maxrjones commented Jul 28, 2022

Uh oh!

weiji14 commented Jul 28, 2022

Uh oh!

weiji14 commented Jul 28, 2022

Uh oh!

codecov bot commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jhamman commented Aug 9, 2022

Uh oh!

jhamman commented Aug 9, 2022

Uh oh!

weiji14 commented Aug 9, 2022

Uh oh!

weiji14 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weiji14 Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxrjones Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

maxrjones Aug 19, 2022

Choose a reason for hiding this comment

Uh oh!

weiji14 Aug 19, 2022

Choose a reason for hiding this comment

Uh oh!

maxrjones Aug 19, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxrjones commented Aug 19, 2022

Uh oh!

weiji14 commented Aug 19, 2022

Uh oh!

Uh oh!

codecov bot commented Jul 29, 2022 •

edited

Loading

weiji14 Aug 15, 2022 •

edited

Loading