-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send correct blocks. #53
base: master
Are you sure you want to change the base?
Conversation
else: | ||
arr = self._ds.data | ||
if isinstance(arr, np.ndarray): | ||
return arr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so, to be sure, this was the problem - we sent the whole array. I suppose we could have figured out which chunk to send at this point? But maybe indeed better to use Dask's internal logic to do it for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a good semi-internal dask function to be used here, to extract a block from an array that is already in memory. I think we might be paying a significant tokenization cost by using dask.array(...)
and should avoid that, but not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, actually. Dask will assign a token based on the content of the data, which would be slow for a big array, but you can specify the token so that it doesn't do this. I doubt there is a function for quite what is needed here.
# dask array | ||
return arr.blocks[i].compute() | ||
# Make a dask.array so that we can return the appropriate block. | ||
arr = dask.array.from_array(arr, chunks=self._chunks[variable]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we have a dask array here, can we be sure that it has the same chunks as the one that the client will have? It looks like encoding a numpy-backed xarray.Dataset
with to_zarr(...)
prompts zarr to automatically choose a chunking for us. If we serialized a dask -backed xarray.Dataset
, will to_zarr(...)
respect the existing chunking? If yes, then we have no problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_zarr
uses the current chunking; rechunking is optional.
I haven't had a chance to investigate the failure |
The subclasses that override |
This fix affects access via the server.
The client side constructs an
xarray.Dataset
backed by dask arrays with somechunking. When it loads data, it requests partitions specified by a variable
name and a block "part", as in
('x', 0, 0, 1)
.If, on the server side, the
DataSourceMixin
subclass is holding a plainnumpy array, not a dask array, then it ignores the "part" and always sends
the whole array for the requested variable.
On the client side, this manifests as a mismatch between the dask array's shape
(the shape of the data it is expected) and the shape of the numpy array that it
receives, leading to errors like
where data that arrives is larger than the data expected.
I expect it's worth refining this to make it more efficient before merging, and
it needs a test. This is just a request for comments and suggestions.