-
Notifications
You must be signed in to change notification settings - Fork 29
Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Removes need for using `_as_xarray_dataset` so xr.DataArray inputs are preserved as xr.DataArray objects on the returned output.
Finalize tutorial by converting chips from xarray.Dataset to torch.Tensor and stacking them per mini-batch! Debated on whether to have the xarray collate function in the codebase, but let's wait for updates on xbatcher's end (xarray-contrib/xbatcher#71). Also renamed the tutorial file from batching to chipping and added more emojis to the intro section.
* 🚧 Walkthrough on creating batches of data Initial draft tutorial on creating batches of chipped data from full-size satellite scenes! Will be working with Sentinel-1 GRD GeoTIFFs, let's see how far this will go. * 💡 Demo XbatcherSlicer to get 512x512 chips from larger scene Walkthrough how to cut up a large satellite scene into multiple smaller chips of size 512 pixels by 512 pixels. Heavy lifting done by xbatcher which handles slicing along dimensions and overlapping strides. Needed a hacky workaround in XbatcherSlicer to fix a ValueError due to the xarray.DataArray name not being set (though it should be). * 💚 Install xbatcher for documentation build Fix readthedocs build failure because xbatcher was not installed. * 🗃️ Collate chips into mini-batches Finalize tutorial by converting chips from xarray.Dataset to torch.Tensor and stacking them per mini-batch! Debated on whether to have the xarray collate function in the codebase, but let's wait for updates on xbatcher's end (xarray-contrib/xbatcher#71). Also renamed the tutorial file from batching to chipping and added more emojis to the intro section.
Thanks @weiji14! I agree that returning With this change, https://github.com/pangeo-data/xbatcher/blob/18d94619f8605ffffdf41286cf58904f745e99bb/xbatcher/generators.py#L10-L15 is no longer used and could be removed to maintain code coverage. The pytorch data loader expects a which causes the tests to fail in this PR. Let me know if you'd like to work on fixing that. If not, two options would be for me to submit a PR with suggested changes to this branch or to reduce the scope of this PR to not remove the coercion to dataset, while still updating As for release infrastructure, I'll get started on an automated changelog. Issues/PRs for other things you notice missing would be much appreciated 😃 |
Good point, done in 50f8b7d
Yeah, I noticed the hardcoded values and the reliance on the dataset requiring a name. Needed a workaround at https://github.com/weiji14/zen3geo/blob/71e886d95454de70651cc31ea6dedc33e929145c/zen3geo/datapipes/xbatcher.py#L97-L101. Feel free to push changes to this branch if you've got a good solution.
👍 |
The |
Codecov Report
@@ Coverage Diff @@
## main #71 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 5 5
Lines 179 173 -6
Branches 40 37 -3
=========================================
- Hits 179 173 -6
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
closed and re-opened to trigger CI. |
Looking at this now, I'm wondering how important keeping the flexible input type (Dataset and DataArray) in Xbatcher, or more narrowly in the torch dataloader, is? Is there an argument to be made in favor of only supporting DataArrays? |
Would argue for keeping both Dataset and DataArray. Yes, xr.DataArray is easier to convert directly into tensor compared to xr.Dataset, but you'll open up a bunch of issues with type casting if you convert Dataset inputs to DataArray (since data variables can be of different dtypes). Oh, and I've got a project too using xbatcher to slice xr.Dataset objects (it's easier and more efficient to slice one xr.Dataset with many data variables than many xr.DataArrays in a for-loop). |
Not sure if this is correct because the unit test is quite vague on what is being tested, but the tests pass now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to make all the unit tests pass, but to be honest, the unit tests were not well written. They seem to only test xarray.DataArray inputs and not xarray.Dataset. Maybe some parametrized tests would be good, but I'm also wary of making this PR too long...
X_batch = self.X_generator[idx].torch.to_tensor() | ||
y_batch = self.y_generator[idx].torch.to_tensor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change works because the unit tests in test_torch_loaders.py
are actually testing xarray.DataArray
inputs only, and not xarray.Dataset
. Ideally there would be unit tests for both xr.DataArray
and xr.Dataset
inputs, but this might expand the scope of the Pull Request a bit too much 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, the tests used xarray.DataArray
inputs to the batch generator, but xarray.Dataset
inputs to the dataloaders since the batches were coerced into datasets. So I think this would be an additional breaking change in expecting xarray.DataArray
inputs.
For the unit tests, I opened #83 to keep track of improvements for subsequent PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on backwards compatibility here @weiji14? My impression is that the hardcoded xarray.Dataset
variable names severely restricts the utility of the data loader. So, I think this is a worthwhile change since we'd be forced to break backwards compatibility anyways eventually for flexible variable names and that working from an xarray.DataArray
implementation is better. But we could add an if; else
block for dataset vs. dataarray if it's necessary to maintain the past behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that we should try to be backward compatible and support both xr.DataArray
and xr.Dataset
inputs. If you want, I can either:
- Add the if-then block, but write the comprehensive unit tests covering
xr.Dataset
/xr.DataArray
cases in a separate PR - Do the if-then block and unit-tests in a follow up PR, in order to keep this PR small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for option 2
Co-authored-by: Max Jones <[email protected]>
Also testing not just the batch size, but the actual shape of the output x_batch and y_batch tensor.
Thanks for this contribution @weiji14! |
Awesome, thanks for the review! |
Removes need for using
_as_xarray_dataset
so xr.DataArray inputs are preserved as xr.DataArray objects on the returned output.Does so by changing
.dims
(which returns a different thing betweenxr.Dataset
andxr.DataArray
) to.sizes
which is a property added inxarray
v0.9.0 (25 January 2017), see also pydata/xarray#921.Note that this change is backward incompatible, so an xbatcher v0.2.0 release might be needed. Happy to help out with setting up some of the release infrastructure if needed, e.g. if there's time during the SciPy sprint 😄
Fixes #70