Skip to content

Adding mask to open_rasterio #2618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Adding mask to open_rasterio #2618

wants to merge 6 commits into from

Conversation

jsignell
Copy link
Contributor

Not sure if this is the right approach @snowman2

@pep8speaks
Copy link

pep8speaks commented Dec 18, 2018

Hello @jsignell! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 3138:80: E501 line too long (81 > 79 characters)

Comment last updated at 2019-04-26 15:27:42 UTC

@snowman2
Copy link
Contributor

@jsignell, this looks like it would work - I like the implementation. Some things I would recommend to add is checks for when the nodata value varies across bands (maybe skip the mask and raise a warning) and when the nodata value is already NaN. Also, you should probably set the nodatavals to NaN if masked to resemble the behavior of xarray elsewhere.

I think @fmaussion would be a good reviewer for this PR.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mask=True to be the default? That's how we handle netCDF files.

@@ -298,6 +300,10 @@ def open_rasterio(filename, parse_coordinates=None, chunks=None, cache=None,
result = DataArray(data=data, dims=('band', 'y', 'x'),
coords=coords, attrs=attrs)

if mask:
for nodataval in attrs.get('nodatavals', ()):
result = result.where(result != nodataval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of this approach is that unlike how to decode metadata for netCDF files, this won't be lazy.

See CFMaskCoder in xarray/coding/variables.py for who we do this with netCDF files. In particular, it probably makes sense to use lazy_elemwise_func and to move nodatavals from attrs to encoding.

@fmaussion
Copy link
Member

I wonder if nodatavals is the only attribute for masking - in the doc (https://rasterio.readthedocs.io/en/latest/topics/masks.html) they also mention nodata, which in my experience is also used

@snowman2
Copy link
Contributor

snowman2 commented Dec 19, 2018

I think the solution would be to add a masked argument and property to the class RasterioArrayWrapper here.

And pass in the (masked=masked) argument here

That would pass riods.read( ..., masked=self.masked) when reading data from rasterio here

@jsignell
Copy link
Contributor Author

I was wondering if we should just pass the mask option to rasterio like that. I'll try it out.

@jsignell
Copy link
Contributor Author

Hmm. Do we really want a np.masked_array or do we just want an array with np.Nans in it? I prefer nans, but I guess the downside is type conversion for ints?

@jsignell
Copy link
Contributor Author

I did as @snowman2 suggested but filled the mask with nans at the end. Since it is now using rasterio directly this should address @fmaussion's concern about nodata (which I think was ending up in the nodatavals attr anyways). I also moved the nodatavals from attrs to encoding when mask==True, but I didn't set it to true by default since I didn't want to break backwards compatibility.

np.nan if nodataval is None else nodataval
for nodataval in riods.nodatavals)
if mask:
encoding['nodatavals'] = nodatavals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thing setting the encoding should inline with the other if statements and should be something like:

if mask and riods.nodata is not None:
    encoding['_FillValue'] = riods.nodata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer is this what you intended or did you want to keep it as @jsignell modified it to be?

@fmaussion
Copy link
Member

Do we really want a np.masked_array or do we just want an array with np.Nans in it? I prefer nans, but I guess the downside is type conversion for ints?

Masked arrays come with huge performance drops, I don't think we should use these.

@shoyer
Copy link
Member

shoyer commented Dec 20, 2018 via email

@fmaussion
Copy link
Member

I'm sorry this has stalled now - I think this is a nice addition.

@jsignell : would you have time to finish this up?

@jsignell
Copy link
Contributor Author

Yeah I have some time to finish it up. Looking back at it now it seems like masked would be a better kwarg for open_rasterio. Is there anything else that should be changed?

@snowman2
Copy link
Contributor

I think it would be good to write the dataset to a netcdf, and read it back in with xarray.open_dataset to ensure everything encodes/decides properly.

@jsignell
Copy link
Contributor Author

I think it would be good to write the dataset to a netcdf, and read it back in with xarray.open_dataset to ensure everything encodes/decides properly.

Do you mean write a test to do this, or just check it locally?

@snowman2
Copy link
Contributor

Good question, I was thinking adding tests.

@@ -39,6 +39,7 @@ def __init__(self, manager, lock, vrt_params=None):
if not np.all(np.asarray(dtypes) == dtypes[0]):
raise ValueError('All bands should have the same dtype')
self._dtype = np.dtype(dtypes[0])
Copy link
Contributor

@snowman2 snowman2 Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._dtype = np.dtype(dtypes[0])
self._dtype = np.float64 if self.masked else np.dtype(dtypes[0])

out = riods.read(band_key, window=window)
out = riods.read(band_key, window=window, masked=self.masked)
if self.masked:
out = np.ma.filled(out, np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
out = np.ma.filled(out, np.nan)
out = np.ma.filled(out.astype(np.float64), np.nan)


# this lets you write arrays loaded with rasterio
data = indexing.CopyOnWriteArray(data)
if cache and chunks is None:
data = indexing.MemoryCachedArray(data)

result = DataArray(data=data, dims=('band', 'y', 'x'),
coords=coords, attrs=attrs)
coords=coords, attrs=attrs, encoding=encoding)
Copy link
Contributor

@snowman2 snowman2 Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
coords=coords, attrs=attrs, encoding=encoding)
coords=coords, attrs=attrs)
result.encoding = encoding

@snowman2
Copy link
Contributor

snowman2 commented Nov 6, 2019

@jsignell, the masked kwarg is in rioxarray if you need it: https://corteva.github.io/rioxarray/html/examples/clip_geom.html

@keewis keewis closed this Jun 23, 2021
@keewis keewis deleted the branch pydata:master June 23, 2021 16:14
@keewis
Copy link
Collaborator

keewis commented Jun 24, 2021

this has been unintentionally closed when renaming the branch to merge into to main, but given that we consider deprecating open_rasterio in favor of the rasterio engine provided by rioxarray (which seems to already support the mask kwarg) that's probably not a mistake. If you disagree feel free to reopen this PR.

@jsignell
Copy link
Contributor Author

I'm fine with this being closed. TBH I had totally forgotten about it 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants