Skip to content

Zarr chunks would overlap multiple dask chunks error #5286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eric-czech opened this issue May 10, 2021 · 3 comments
Closed

Zarr chunks would overlap multiple dask chunks error #5286

eric-czech opened this issue May 10, 2021 · 3 comments

Comments

@eric-czech
Copy link

Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?

This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr. I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks. There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?

import xarray as xr
import dask.array as da

ds = xr.Dataset(dict(
    x=(('a', 'b'), da.ones(shape=(10, 10), chunks=(5, 10))),
)).assign(a=list(range(10)))
ds
# <xarray.Dataset>
# Dimensions:  (a: 10, b: 10)
# Coordinates:
#   * a        (a) int64 0 1 2 3 4 5 6 7 8 9
# Dimensions without coordinates: b
# Data variables:
#     x        (a, b) float64 dask.array<chunksize=(5, 10), meta=np.ndarray>

# Write the dataset out
!rm -rf /tmp/test.zarr
ds.to_zarr('/tmp/test.zarr')

# Read it back in, subset to 1 record in two different chunks (two rows total), write back out
!rm -rf /tmp/test2.zarr
xr.open_zarr('/tmp/test.zarr').sel(a=[0, 11]).to_zarr('/tmp/test2.zarr')
# NotImplementedError: Specified zarr chunks encoding['chunks']=(5, 10) for variable named 'x' would overlap multiple dask chunks ((1, 1), (10,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Also what is the difference between "deleting or modifying encoding['chunks']"
and "specify safe_chunks=False"? That wasn't clear to me in #5056.

Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting encoding['chunks'] in these situations?

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-cloud-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.18.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.1
conda: None
pytest: 6.2.4
IPython: 7.23.1
sphinx: None

@shoyer
Copy link
Member

shoyer commented May 11, 2021

The short answer is that you should delete encoding['chunks'] here for now. This operation is totally safe, but Xarray is mistakenly trying to re-use the chunks from the source dataset when writing the data to disk.

The long answer is that we really should fix this in Xarray. See #5219 for discussion.

@eric-czech
Copy link
Author

Thanks @shoyer, good to know!

@max-sixty
Copy link
Collaborator

I'll close this now that it's linked to #5219. Thanks @eric-czech & @shoyer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants