Skip to content

Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 19, 2022
Merged

Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility #71

merged 11 commits into from
Aug 19, 2022

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Jul 10, 2022

Removes need for using _as_xarray_dataset so xr.DataArray inputs are preserved as xr.DataArray objects on the returned output.

Does so by changing .dims (which returns a different thing between xr.Dataset and xr.DataArray) to .sizes which is a property added in xarray v0.9.0 (25 January 2017), see also pydata/xarray#921.

Note that this change is backward incompatible, so an xbatcher v0.2.0 release might be needed. Happy to help out with setting up some of the release infrastructure if needed, e.g. if there's time during the SciPy sprint 😄

Fixes #70

weiji14 added 2 commits July 9, 2022 22:20
Removes need for using `_as_xarray_dataset` so xr.DataArray inputs are preserved as xr.DataArray objects on the returned output.
weiji14 added a commit to weiji14/zen3geo that referenced this pull request Jul 16, 2022
Finalize tutorial by converting chips from xarray.Dataset to torch.Tensor and stacking them per mini-batch! Debated on whether to have the xarray collate function in the codebase, but let's wait for updates on xbatcher's end (xarray-contrib/xbatcher#71). Also renamed the tutorial file from batching to chipping and added more emojis to the intro section.
weiji14 added a commit to weiji14/zen3geo that referenced this pull request Jul 16, 2022
* 🚧 Walkthrough on creating batches of data

Initial draft tutorial on creating batches of chipped data from full-size satellite scenes! Will be working with Sentinel-1 GRD GeoTIFFs, let's see how far this will go.

* 💡 Demo XbatcherSlicer to get 512x512 chips from larger scene

Walkthrough how to cut up a large satellite scene into multiple smaller chips of size 512 pixels by 512 pixels. Heavy lifting done by xbatcher which handles slicing along dimensions and overlapping strides. Needed a hacky workaround in XbatcherSlicer to fix a ValueError due to the xarray.DataArray name not being set (though it should be).

* 💚 Install xbatcher for documentation build

Fix readthedocs build failure because xbatcher was not installed.

* 🗃️ Collate chips into mini-batches

Finalize tutorial by converting chips from xarray.Dataset to torch.Tensor and stacking them per mini-batch! Debated on whether to have the xarray collate function in the codebase, but let's wait for updates on xbatcher's end (xarray-contrib/xbatcher#71). Also renamed the tutorial file from batching to chipping and added more emojis to the intro section.
@maxrjones
Copy link
Member

Thanks @weiji14! I agree that returning xarray.DataArray makes sense for that case.

With this change, https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/generators.py#L10-L15 is no longer used and could be removed to maintain code coverage.

The pytorch data loader expects a xarray.Dataset with hardcoded variable names https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/loaders/torch.py#L55-L58

which causes the tests to fail in this PR. Let me know if you'd like to work on fixing that. If not, two options would be for me to submit a PR with suggested changes to this branch or to reduce the scope of this PR to not remove the coercion to dataset, while still updating .dims to .sizes.

As for release infrastructure, I'll get started on an automated changelog. Issues/PRs for other things you notice missing would be much appreciated 😃

@weiji14
Copy link
Member Author

weiji14 commented Jul 28, 2022

https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/generators.py#L10-L15
is no longer used and could be removed to maintain code coverage.

Good point, done in 50f8b7d

The pytorch data loader expects a xarray.Dataset with hardcoded variable names

https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/loaders/torch.py#L55-L58

which causes the tests to fail in this PR. Let me know if you'd like to work on fixing that. If not, two options would be for me to submit a PR with suggested changes to this branch or to reduce the scope of this PR to not remove the coercion to dataset, while still updating .dims to .sizes.

Yeah, I noticed the hardcoded values and the reliance on the dataset requiring a name. Needed a workaround at https://github.com/weiji14/zen3geo/blob/71e886d95454de70651cc31ea6dedc33e929145c/zen3geo/datapipes/xbatcher.py#L97-L101. Feel free to push changes to this branch if you've got a good solution.

As for release infrastructure, I'll get started on an automated changelog. Issues/PRs for other things you notice missing would be much appreciated smiley

👍

@weiji14
Copy link
Member Author

weiji14 commented Jul 28, 2022

The TypeError: <class 'numpy.typing._dtype_like._SupportsDType'> is not a generic class is an upstream xarray issue, see pydata/xarray#6818 🙂

@codecov
Copy link

codecov bot commented Jul 29, 2022

Codecov Report

Merging #71 (9516190) into main (0ded974) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #71   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          179       173    -6     
  Branches        40        37    -3     
=========================================
- Hits           179       173    -6     
Impacted Files Coverage Δ
xbatcher/accessors.py 100.00% <100.00%> (ø)
xbatcher/generators.py 100.00% <100.00%> (ø)
xbatcher/loaders/keras.py 100.00% <100.00%> (ø)
xbatcher/loaders/torch.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jhamman jhamman closed this Aug 9, 2022
@jhamman jhamman reopened this Aug 9, 2022
@jhamman
Copy link
Contributor

jhamman commented Aug 9, 2022

closed and re-opened to trigger CI.

@jhamman
Copy link
Contributor

jhamman commented Aug 9, 2022

Looking at this now, I'm wondering how important keeping the flexible input type (Dataset and DataArray) in Xbatcher, or more narrowly in the torch dataloader, is? Is there an argument to be made in favor of only supporting DataArrays?

@weiji14
Copy link
Member Author

weiji14 commented Aug 9, 2022

Looking at this now, I'm wondering how important keeping the flexible input type (Dataset and DataArray) in Xbatcher, or more narrowly in the torch dataloader, is? Is there an argument to be made in favor of only supporting DataArrays?

Would argue for keeping both Dataset and DataArray. Yes, xr.DataArray is easier to convert directly into tensor compared to xr.Dataset, but you'll open up a bunch of issues with type casting if you convert Dataset inputs to DataArray (since data variables can be of different dtypes). Oh, and I've got a project too using xbatcher to slice xr.Dataset objects (it's easier and more efficient to slice one xr.Dataset with many data variables than many xr.DataArrays in a for-loop).

Copy link
Member Author

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to make all the unit tests pass, but to be honest, the unit tests were not well written. They seem to only test xarray.DataArray inputs and not xarray.Dataset. Maybe some parametrized tests would be good, but I'm also wary of making this PR too long...

Comment on lines +57 to +58
X_batch = self.X_generator[idx].torch.to_tensor()
y_batch = self.y_generator[idx].torch.to_tensor()
Copy link
Member Author

@weiji14 weiji14 Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change works because the unit tests in test_torch_loaders.py are actually testing xarray.DataArray inputs only, and not xarray.Dataset. Ideally there would be unit tests for both xr.DataArray and xr.Dataset inputs, but this might expand the scope of the Pull Request a bit too much 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, the tests used xarray.DataArray inputs to the batch generator, but xarray.Dataset inputs to the dataloaders since the batches were coerced into datasets. So I think this would be an additional breaking change in expecting xarray.DataArray inputs.

For the unit tests, I opened #83 to keep track of improvements for subsequent PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on backwards compatibility here @weiji14? My impression is that the hardcoded xarray.Dataset variable names severely restricts the utility of the data loader. So, I think this is a worthwhile change since we'd be forced to break backwards compatibility anyways eventually for flexible variable names and that working from an xarray.DataArray implementation is better. But we could add an if; else block for dataset vs. dataarray if it's necessary to maintain the past behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that we should try to be backward compatible and support both xr.DataArray and xr.Dataset inputs. If you want, I can either:

  1. Add the if-then block, but write the comprehensive unit tests covering xr.Dataset/xr.DataArray cases in a separate PR
  2. Do the if-then block and unit-tests in a follow up PR, in order to keep this PR small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for option 2

@weiji14 weiji14 marked this pull request as ready for review August 15, 2022 23:49
@maxrjones maxrjones mentioned this pull request Aug 16, 2022
7 tasks
@maxrjones maxrjones merged commit 714b624 into xarray-contrib:main Aug 19, 2022
@maxrjones
Copy link
Member

Thanks for this contribution @weiji14!

@weiji14 weiji14 deleted the dims_to_sizes branch August 19, 2022 17:01
@weiji14
Copy link
Member Author

weiji14 commented Aug 19, 2022

Awesome, thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Input xarray.DataArray becomes xarray.Dataset
3 participants