Description
Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?
This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr. I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks. There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?
import xarray as xr
import dask.array as da
ds = xr.Dataset(dict(
x=(('a', 'b'), da.ones(shape=(10, 10), chunks=(5, 10))),
)).assign(a=list(range(10)))
ds
# <xarray.Dataset>
# Dimensions: (a: 10, b: 10)
# Coordinates:
# * a (a) int64 0 1 2 3 4 5 6 7 8 9
# Dimensions without coordinates: b
# Data variables:
# x (a, b) float64 dask.array<chunksize=(5, 10), meta=np.ndarray>
# Write the dataset out
!rm -rf /tmp/test.zarr
ds.to_zarr('/tmp/test.zarr')
# Read it back in, subset to 1 record in two different chunks (two rows total), write back out
!rm -rf /tmp/test2.zarr
xr.open_zarr('/tmp/test.zarr').sel(a=[0, 11]).to_zarr('/tmp/test2.zarr')
# NotImplementedError: Specified zarr chunks encoding['chunks']=(5, 10) for variable named 'x' would overlap multiple dask chunks ((1, 1), (10,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
Also what is the difference between "deleting or modifying encoding['chunks']
"
and "specify safe_chunks=False
"? That wasn't clear to me in #5056.
Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting encoding['chunks']
in these situations?
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-cloud-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.18.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.1
conda: None
pytest: 6.2.4
IPython: 7.23.1
sphinx: None