Skip to content

Reading and writing a zarr dataset multiple times casts bools to int8 #4826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amatsukawa opened this issue Jan 19, 2021 · 10 comments · Fixed by #7720
Closed

Reading and writing a zarr dataset multiple times casts bools to int8 #4826

amatsukawa opened this issue Jan 19, 2021 · 10 comments · Fixed by #7720
Labels
bug contrib-help-wanted topic-zarr Related to zarr storage library

Comments

@amatsukawa
Copy link
Contributor

amatsukawa commented Jan 19, 2021

What happened:

Reading and writing zarr dataset multiple times into different paths changes bool dtype arrays to int8. I think this issue is related to #2937.

What you expected to happen:

My array's dtype in numpy/dask should not change, even if certain storage backends store dtypes a certain way.

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np

ds = xr.Dataset({
    "bool_field": xr.DataArray(
        np.random.randn(5) < 0.5, 
        dims=('g'), 
        coords={'g': np.arange(5)}
    )
})
ds.to_zarr('test.zarr', mode="w")

d2 = xr.open_zarr('test.zarr')
print(d2.bool_field.dtype)
print(d2.bool_field.encoding)
d2.to_zarr("test2.zarr", mode="w")

d3 = xr.open_zarr('test2.zarr')
print(d3.bool_field.dtype)

The above snippet prints the following. In d3, the dtype of bool_field is int8, presumably because d3 inherited d2's encoding and it says int8, despite the array having a bool dtype.

bool
{'chunks': (5,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int8')}
int8

Anything else we need to know?:

Currently workaround is to explicitly set encodings. This fixes the problem:

encoding = {k: {"dtype": d2[k].dtype} for k in d2}
d2.to_zarr('test2.zarr', mode="w", encoding=encoding)

Environment:

Output of xr.show_versions()
# I'll update with the the full output of xr.show_versions() soon.
In [4]: xr.__version__                                                                                                                                     
Out[4]: '0.16.2'

In [2]: zarr.__version__                                                                                                                                      
Out[2]: '2.6.1'
@amatsukawa
Copy link
Contributor Author

Apparently my proposed fix broke a bunch of other things, eg. some writing of timedeltas with units and such.

Deleting the "dtype" key in the .encoding of the boolean variable also seems to do the trick. The issue is that bools are not encoded correctly if the .encoding field already has a dtype:

https://github.com/pydata/xarray/blob/v0.16.2/xarray/conventions.py#L119

@amatsukawa
Copy link
Contributor Author

amatsukawa commented Jan 22, 2021

OK here's the other side of the problem. The original dtype (which is i8) is set in the encoding:

https://github.com/pydata/xarray/blob/v0.16.2/xarray/conventions.py#L350

@amatsukawa
Copy link
Contributor Author

Tagging a few maintainers: @dcherian @shoyer.

Sorry to tag you directly, hope that's ok. I think I've found the issue here and would like to provide a PR to fix, but need some input on what you think would be best.

To summarize, the current behavior leading to the bug is:

  1. When a bool dtype is initially written, the maybe_encode_bool function us used to convert it the bool to a i1 with a vars.attr that says it is actually a bool. It would appear from encoding of boolean dtype in zarr #2937 that it ends up an i8 somehow anyway.
    def maybe_encode_bools(var):
  2. When this is loaded the first time, the i8 is correctly identified as actually being a bool using the attributes.
    if "dtype" in attributes and attributes["dtype"] == "bool":
  3. However, 2 lines above that, there is a encoding.setdefault("dtype", original_dtype) so this Variable object now has .encoding["dtype"] which is i8.
  4. When I try to save this again, it tries maybe_encode_bool from step 1 again, but this time the function is bypassed because of step 3 above.
  5. The dataset I write from step 4 now does not have the attribute identifying it as a bool, and so it's an i8 when I load it back.

I can think of a few fixes:

  • Drop the .encoding dict on load for bools. Presumably these .encodings are kept such that datasets attempts to preserve the compressor, chunk size, etc. of its source. However, given that .encoding seems to be dropped when I eg. do a .astype("bool") maybe this is OK for bools.
  • Set var.attrs["dtype"] = np.dtype("bool") on load. This would preserve what we'd get out of maybe_encode_bool. The challenge with this is that .attrs are not always preserved?
  • Change maybe_encode_bool to not ignore when .encoding["dtype"] exists. I suppose there would be a need to check that the compressor is still compatible, etc?

As a local fix while we consider these options, can you confirm that, as the docs state, the .encoding is only used for serializing and not deserializing arrays, and therefore if I drop the .encoding on my DataArrays as a temporary fix I wouldn't break anything (if I am ok with xarray not preserving my compressors, etc).

@slevang
Copy link
Contributor

slevang commented Feb 17, 2021

I ran into this as well with the basic netcdf backends:

import xarray as xr

ds = xr.Dataset(
    data_vars={"foo":(["x"], [False, True, False])},
    coords={"x": [1, 2, 3]},
)
ds.to_netcdf('test.nc')
ds = xr.load_dataset('test.nc')
print(ds.foo.dtype)
ds.to_netcdf('test.nc')
ds = xr.load_dataset('test.nc')
print(ds.foo.dtype)

Gives:

bool
int8

@shoyer
Copy link
Member

shoyer commented Feb 17, 2021

There are at least two issues here:

  • For both netCDF and zarr, we should definitely "round trip" preserve dtypes. There are a few ways this might be done. I would have to look at a PR to say for sure, but from the looks of it "Change maybe_encode_bool to not ignore when .encoding["dtype"] exists" is the right fix.
  • Zarr supports bool data, so we don't need conversion from bool -> int8. We should just save data as bool.

@dcherian dcherian added contrib-help-wanted topic-zarr Related to zarr storage library bug labels Apr 19, 2021
@andersy005
Copy link
Member

andersy005 commented Apr 20, 2021

I think this issue is related to #2937.

This is the same problem reported in #4386. All these issues (#2937, #4826, #4386, #5192) should be closable once there's a fix.

@shaunc
Copy link

shaunc commented Apr 20, 2021

Closed #5192 in favor of this as I think it's a duplicate. Just NB that it can occur with h5netcdf as well as netcdf4. (Thanks, @andersy005 )

@kmuehlbauer
Copy link
Contributor

Please check #7720 if that fixes the conversion problems. Thanks.

@JoerivanEngelen
Copy link

@kmuehlbauer your fix fixes both the problems set up by @slevang and @amatsukawa on my laptop.

I furthermore tested the double roundtrip with load_dataset and to_netcdf, with the three possible engines, they all worked.

@kmuehlbauer
Copy link
Contributor

@JoerivanEngelen Thanks for taking the time. Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug contrib-help-wanted topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants