Reading and writing a zarr dataset multiple times casts bools to int8 #4826

amatsukawa · 2021-01-19T22:02:15Z

What happened:

Reading and writing zarr dataset multiple times into different paths changes bool dtype arrays to int8. I think this issue is related to #2937.

What you expected to happen:

My array's dtype in numpy/dask should not change, even if certain storage backends store dtypes a certain way.

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np

ds = xr.Dataset({
    "bool_field": xr.DataArray(
        np.random.randn(5) < 0.5, 
        dims=('g'), 
        coords={'g': np.arange(5)}
    )
})
ds.to_zarr('test.zarr', mode="w")

d2 = xr.open_zarr('test.zarr')
print(d2.bool_field.dtype)
print(d2.bool_field.encoding)
d2.to_zarr("test2.zarr", mode="w")

d3 = xr.open_zarr('test2.zarr')
print(d3.bool_field.dtype)

The above snippet prints the following. In d3, the dtype of bool_field is int8, presumably because d3 inherited d2's encoding and it says int8, despite the array having a bool dtype.

bool
{'chunks': (5,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int8')}
int8

Anything else we need to know?:

Currently workaround is to explicitly set encodings. This fixes the problem:

encoding = {k: {"dtype": d2[k].dtype} for k in d2}
d2.to_zarr('test2.zarr', mode="w", encoding=encoding)

Environment:

Output of xr.show_versions()

# I'll update with the the full output of xr.show_versions() soon.
In [4]: xr.__version__                                                                                                                                     
Out[4]: '0.16.2'

In [2]: zarr.__version__                                                                                                                                      
Out[2]: '2.6.1'

The text was updated successfully, but these errors were encountered:

amatsukawa · 2021-01-22T23:27:57Z

Apparently my proposed fix broke a bunch of other things, eg. some writing of timedeltas with units and such.

Deleting the "dtype" key in the .encoding of the boolean variable also seems to do the trick. The issue is that bools are not encoded correctly if the .encoding field already has a dtype:

https://github.com/pydata/xarray/blob/v0.16.2/xarray/conventions.py#L119

amatsukawa · 2021-01-22T23:37:47Z

OK here's the other side of the problem. The original dtype (which is i8) is set in the encoding:

https://github.com/pydata/xarray/blob/v0.16.2/xarray/conventions.py#L350

amatsukawa · 2021-01-25T17:19:19Z

Tagging a few maintainers: @dcherian @shoyer.

Sorry to tag you directly, hope that's ok. I think I've found the issue here and would like to provide a PR to fix, but need some input on what you think would be best.

To summarize, the current behavior leading to the bug is:

When a bool dtype is initially written, the maybe_encode_bool function us used to convert it the bool to a i1 with a vars.attr that says it is actually a bool. It would appear from encoding of boolean dtype in zarr #2937 that it ends up an i8 somehow anyway.

xarray/xarray/conventions.py

Line 119 in cc53a77

def maybe_encode_bools(var):
When this is loaded the first time, the i8 is correctly identified as actually being a bool using the attributes.

xarray/xarray/conventions.py

Line 352 in cc53a77

if "dtype" in attributes and attributes["dtype"] == "bool":
However, 2 lines above that, there is a encoding.setdefault("dtype", original_dtype) so this Variable object now has .encoding["dtype"] which is i8.
When I try to save this again, it tries maybe_encode_bool from step 1 again, but this time the function is bypassed because of step 3 above.
The dataset I write from step 4 now does not have the attribute identifying it as a bool, and so it's an i8 when I load it back.

I can think of a few fixes:

Drop the .encoding dict on load for bools. Presumably these .encodings are kept such that datasets attempts to preserve the compressor, chunk size, etc. of its source. However, given that .encoding seems to be dropped when I eg. do a .astype("bool") maybe this is OK for bools.
Set var.attrs["dtype"] = np.dtype("bool") on load. This would preserve what we'd get out of maybe_encode_bool. The challenge with this is that .attrs are not always preserved?
Change maybe_encode_bool to not ignore when .encoding["dtype"] exists. I suppose there would be a need to check that the compressor is still compatible, etc?

As a local fix while we consider these options, can you confirm that, as the docs state, the .encoding is only used for serializing and not deserializing arrays, and therefore if I drop the .encoding on my DataArrays as a temporary fix I wouldn't break anything (if I am ok with xarray not preserving my compressors, etc).

slevang · 2021-02-17T22:36:08Z

I ran into this as well with the basic netcdf backends:

import xarray as xr

ds = xr.Dataset(
    data_vars={"foo":(["x"], [False, True, False])},
    coords={"x": [1, 2, 3]},
)
ds.to_netcdf('test.nc')
ds = xr.load_dataset('test.nc')
print(ds.foo.dtype)
ds.to_netcdf('test.nc')
ds = xr.load_dataset('test.nc')
print(ds.foo.dtype)

Gives:

bool
int8

shoyer · 2021-02-17T23:04:03Z

There are at least two issues here:

For both netCDF and zarr, we should definitely "round trip" preserve dtypes. There are a few ways this might be done. I would have to look at a PR to say for sure, but from the looks of it "Change maybe_encode_bool to not ignore when .encoding["dtype"] exists" is the right fix.
Zarr supports bool data, so we don't need conversion from bool -> int8. We should just save data as bool.

andersy005 · 2021-04-20T01:01:46Z

I think this issue is related to #2937.

This is the same problem reported in #4386. All these issues (#2937, #4826, #4386, #5192) should be closable once there's a fix.

shaunc · 2021-04-20T05:21:40Z

Closed #5192 in favor of this as I think it's a duplicate. Just NB that it can occur with h5netcdf as well as netcdf4. (Thanks, @andersy005 )

kmuehlbauer · 2023-04-05T18:56:23Z

Please check #7720 if that fixes the conversion problems. Thanks.

JoerivanEngelen · 2023-04-06T09:57:38Z

@kmuehlbauer your fix fixes both the problems set up by @slevang and @amatsukawa on my laptop.

I furthermore tested the double roundtrip with load_dataset and to_netcdf, with the three possible engines, they all worked.

kmuehlbauer · 2023-04-06T09:59:42Z

@JoerivanEngelen Thanks for taking the time. Much appreciated.

dcherian added contrib-help-wanted topic-zarr Related to zarr storage library bug labels Apr 19, 2021

andersy005 mentioned this issue Apr 20, 2021

Writing np.bool8 data array reads in as int8 #5192

Closed

d70-t mentioned this issue Jun 18, 2021

Nan/ changed values in output when only reading data, saving and reading again #5490

Closed

dougrichardson mentioned this issue Aug 26, 2021

Writing and reopening introduces bad values #5739

Closed

dcherian mentioned this issue Apr 6, 2023

preserve boolean dtype in encoding #7720

Merged

3 tasks

Illviljan closed this as completed in #7720 Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading and writing a zarr dataset multiple times casts bools to int8 #4826

Reading and writing a zarr dataset multiple times casts bools to int8 #4826

amatsukawa commented Jan 19, 2021 •

edited

Loading

amatsukawa commented Jan 22, 2021

amatsukawa commented Jan 22, 2021 •

edited

Loading

amatsukawa commented Jan 25, 2021

slevang commented Feb 17, 2021

shoyer commented Feb 17, 2021

andersy005 commented Apr 20, 2021 •

edited

Loading

shaunc commented Apr 20, 2021 •

edited

Loading

kmuehlbauer commented Apr 5, 2023

JoerivanEngelen commented Apr 6, 2023

kmuehlbauer commented Apr 6, 2023

Reading and writing a zarr dataset multiple times casts bools to int8 #4826

Reading and writing a zarr dataset multiple times casts bools to int8 #4826

Comments

amatsukawa commented Jan 19, 2021 • edited Loading

amatsukawa commented Jan 22, 2021

amatsukawa commented Jan 22, 2021 • edited Loading

amatsukawa commented Jan 25, 2021

slevang commented Feb 17, 2021

shoyer commented Feb 17, 2021

andersy005 commented Apr 20, 2021 • edited Loading

shaunc commented Apr 20, 2021 • edited Loading

kmuehlbauer commented Apr 5, 2023

JoerivanEngelen commented Apr 6, 2023

kmuehlbauer commented Apr 6, 2023

amatsukawa commented Jan 19, 2021 •

edited

Loading

amatsukawa commented Jan 22, 2021 •

edited

Loading

andersy005 commented Apr 20, 2021 •

edited

Loading

shaunc commented Apr 20, 2021 •

edited

Loading