Skip to content

Zarr chunks would overlap multiple dask chunks error #5286

Closed
@eric-czech

Description

@eric-czech

Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?

This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr. I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks. There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?

import xarray as xr
import dask.array as da

ds = xr.Dataset(dict(
    x=(('a', 'b'), da.ones(shape=(10, 10), chunks=(5, 10))),
)).assign(a=list(range(10)))
ds
# <xarray.Dataset>
# Dimensions:  (a: 10, b: 10)
# Coordinates:
#   * a        (a) int64 0 1 2 3 4 5 6 7 8 9
# Dimensions without coordinates: b
# Data variables:
#     x        (a, b) float64 dask.array<chunksize=(5, 10), meta=np.ndarray>

# Write the dataset out
!rm -rf /tmp/test.zarr
ds.to_zarr('/tmp/test.zarr')

# Read it back in, subset to 1 record in two different chunks (two rows total), write back out
!rm -rf /tmp/test2.zarr
xr.open_zarr('/tmp/test.zarr').sel(a=[0, 11]).to_zarr('/tmp/test2.zarr')
# NotImplementedError: Specified zarr chunks encoding['chunks']=(5, 10) for variable named 'x' would overlap multiple dask chunks ((1, 1), (10,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Also what is the difference between "deleting or modifying encoding['chunks']"
and "specify safe_chunks=False"? That wasn't clear to me in #5056.

Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting encoding['chunks'] in these situations?

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-cloud-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.18.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.1
conda: None
pytest: 6.2.4
IPython: 7.23.1
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions