You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?
This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr. I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks. There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?
importxarrayasxrimportdask.arrayasdads=xr.Dataset(dict(
x=(('a', 'b'), da.ones(shape=(10, 10), chunks=(5, 10))),
)).assign(a=list(range(10)))
ds# <xarray.Dataset># Dimensions: (a: 10, b: 10)# Coordinates:# * a (a) int64 0 1 2 3 4 5 6 7 8 9# Dimensions without coordinates: b# Data variables:# x (a, b) float64 dask.array<chunksize=(5, 10), meta=np.ndarray># Write the dataset out
!rm-rf/tmp/test.zarrds.to_zarr('/tmp/test.zarr')
# Read it back in, subset to 1 record in two different chunks (two rows total), write back out
!rm-rf/tmp/test2.zarrxr.open_zarr('/tmp/test.zarr').sel(a=[0, 11]).to_zarr('/tmp/test2.zarr')
# NotImplementedError: Specified zarr chunks encoding['chunks']=(5, 10) for variable named 'x' would overlap multiple dask chunks ((1, 1), (10,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
Also what is the difference between "deleting or modifying encoding['chunks']"
and "specify safe_chunks=False"? That wasn't clear to me in #5056.
Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting encoding['chunks'] in these situations?
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-cloud-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
The short answer is that you should delete encoding['chunks'] here for now. This operation is totally safe, but Xarray is mistakenly trying to re-use the chunks from the source dataset when writing the data to disk.
The long answer is that we really should fix this in Xarray. See #5219 for discussion.
Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?
This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr. I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks. There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?
Also what is the difference between "deleting or modifying
encoding['chunks']
"and "specify
safe_chunks=False
"? That wasn't clear to me in #5056.Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting
encoding['chunks']
in these situations?Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-cloud-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.18.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.1
conda: None
pytest: 6.2.4
IPython: 7.23.1
sphinx: None
The text was updated successfully, but these errors were encountered: