Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send correct blocks. #53

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

danielballan
Copy link
Member

@danielballan danielballan commented Sep 17, 2019

This fix affects access via the server.

The client side constructs an xarray.Dataset backed by dask arrays with some
chunking. When it loads data, it requests partitions specified by a variable
name and a block "part", as in ('x', 0, 0, 1).

If, on the server side, the DataSourceMixin subclass is holding a plain
numpy array, not a dask array, then it ignores the "part" and always sends
the whole array for the requested variable.

On the client side, this manifests as a mismatch between the dask array's shape
(the shape of the data it is expected) and the shape of the numpy array that it
receives, leading to errors like

ValueError: replacement data must match the Variable's shape


> /sdcc/u/dallan/venv/test-databroker/lib64/python3.6/site-packages/xarray/core/variable.py(301)data()
    299         if data.shape != self.shape:
    300             raise ValueError(
--> 301                 "replacement data must match the Variable's shape")
    302         self._data = data
    303 

ipdb>  data.shape
(164, 1, 4000, 3840)
ipdb>  self.shape
(41, 1, 1000, 960)

where data that arrives is larger than the data expected.

I expect it's worth refining this to make it more efficient before merging, and
it needs a test. This is just a request for comments and suggestions.

else:
arr = self._ds.data
if isinstance(arr, np.ndarray):
return arr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so, to be sure, this was the problem - we sent the whole array. I suppose we could have figured out which chunk to send at this point? But maybe indeed better to use Dask's internal logic to do it for us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a good semi-internal dask function to be used here, to extract a block from an array that is already in memory. I think we might be paying a significant tokenization cost by using dask.array(...) and should avoid that, but not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, actually. Dask will assign a token based on the content of the data, which would be slow for a big array, but you can specify the token so that it doesn't do this. I doubt there is a function for quite what is needed here.

# dask array
return arr.blocks[i].compute()
# Make a dask.array so that we can return the appropriate block.
arr = dask.array.from_array(arr, chunks=self._chunks[variable])
Copy link
Member Author

@danielballan danielballan Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we have a dask array here, can we be sure that it has the same chunks as the one that the client will have? It looks like encoding a numpy-backed xarray.Dataset with to_zarr(...) prompts zarr to automatically choose a chunking for us. If we serialized a dask -backed xarray.Dataset, will to_zarr(...) respect the existing chunking? If yes, then we have no problem here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_zarr uses the current chunking; rechunking is optional.

@martindurant
Copy link
Member

I haven't had a chance to investigate the failure

@danielballan
Copy link
Member Author

The subclasses that override _get_schema override _get_schema in the base class DataSourceMixin without calling super(), so self._chunks is never defined. It looks like there is a fair amount of copy paste between the base class and its subclasses, so the easiest fix might be to remove that and use super(). Can't get to this today, but can revisit later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants