-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream uarray integration #241
Comments
Hopefully I'll be able to say more soon, but as of now, the only use is as a
Sounds like a reasonable request.
What I'd suggest, more or less, is that you define your own multimethods. There are docs on doing that here. What can then be done on the side of the backend providers is to provide a The reason I don't suggest otherwise is, what if the user does the following in their code: with ua.set_backend(CupyBackend), ua.set_backend(SKAllelDaskBackend):
pass It'd almost certainly be bad for performance. The other (in my mind, suboptimal) approach is that we write something like a
It is too early for that, currently. One I know of is |
If you have any follow-up questions don't be afraid to ask. |
Nice! I'll check it out.
I think I see what you mean there, but I believe I'm looking for something else. For example, I'd like a user to be able to do this: import genetics_api
ds = genetics_api.read_genetic_data_file(...) # -> xr.Dataset
# I think this needs to activate our SKAllelDaskBackend backend and the one
# used by unumpy -- it may be a better idea to force use of
# ua.set_global_backend / ua.set_backend directly but I'm not sure
genetics_api.set_global_backend('dask')
# or with genetics_api.set_backend('dask'):
# This function would now do stuff using the unumpy API in a lot places,
# but it will also need to use the dask API directly for things like `map_overlap`
# (we don't want to override any of the functionality already in unumpy.DaskBackend)
ds_res = genetics_api.run_analysis(ds) What I hope that makes more clear is that we wouldn't want to add implementations of existing things in the Dask API. We know that when the user wants "Dask mode", that all of our methods will need to use things that exist only in the Dask API alongside many that can easily be dispatched to Dask via unumpy. I'm thinking something like: # in genetics_apy.py:
def run_analysis(ds: xr.Dataset) -> xr.Dataset:
new_var_1 = genetics_algo_1(ds['data_var'])
new_var_2 = ... # some other genetics function
return xr.Dataset({'var1': new_var_1, ...})
# This would have different backends and work on array arguments
genetics_algo_1 = ua.generate_multimethod(...)
# Somewhere as part of the SKAllelDaskBackend (downstream from
# xr.DataArray -> da.Array coercion):
def genetics_algo_1(arr: DaskArray, mask: DuckArray) -> DaskArray:
import dask
import unumpy
# Do stuff you can only do with the Dask API and that is super critical for performance
arr = arr.rechunk(...)
res = dask.overlap.map_overlap(arr, ...)
# Also do stuff that would coerce to Dask and return Dask results,
# but is covered by np API
# * This could be a Dask * Numpy multiplication because mask is only 1D, but I
# definitely want the result to remain a Dask array (so I need unumpy.DaskBackend on)
res = unumpy.multiply(res, mask)
return res Since we would already need to be using the Dask API directly, it seems like it would be a good idea to force use of the unumpy Dask backend as well there no? In that example above, I suppose it would make sense to also keep using the Dask API for things like the Thanks for the time @hameerabbasi. |
@eric-czech Yes, perhaps I can explain a bit more concretely what I meant. You would do the following, ideally, in this case (The # Note the "numpy." prefix -- It means any NumPy backend would also dispatch to your function.
genetics_algo_1 = ua.generate_multimethod(..., domain="numpy.sk_allel")
def genetics_algo_1_dask(arr, mask):
import dask.array as da
import unumpy as np
# Do stuff you can only do with Dask
arr = arr.rechunk(...)
res = dask.overlap.map_overlap(arr, ...)
# Do generic stuff
return np.multiply(res, mask)
# This would say to the Dask backend -- "I have my own API, but I'd like to specialize a function in it for Dask"
# Dask would need to provide this, but it's easy to do.
DaskBackend.register(genetics_algo_1, genetics_algo_1_dask) Whereas, in the second case that I mentioned, it's a bit different: def genetics_algo_1(arr, mask):
import unumpy as np
# `determine_backend` is an upcoming feature
if isinstance(np.determine_backend(arr), DaskBackend):
return genetics_algo_1_dask(arr, mask)
def genetics_algo_1_dask(arr, mask):
import dask.array as da
import unumpy as np
# Do stuff you can only do with Dask
arr = arr.rechunk(...)
res = dask.overlap.map_overlap(arr, ...)
# Do generic stuff
return np.multiply(res, mask) |
This way, just setting the |
Ah I see now, thanks! I'll go that route instead. |
I noticed a thread between @hameerabbasi and some Xarray folks in pydata/xarray#1938 (comment) and was curious if you guys would be willing to talk a little bit about the state of uarray integration in other PyData projects. It looks like this was all pretty nascent then and I was disappointed to see that that there aren't any more open issues about working it into Xarray (or so it seems).
I like the idea a lot and I'm trying to understand how to best think about it's usage by scikit developers. More specifically, we work in statistical genetics and are coordinating with a few other developers in the space to help think about the next generation of tools like scikit-allel. A big question for me in the context of uarray is how the Backends would inter-operate between projects if they attempt to address similar problems.
For example, a minority of the functionality we need will be covered directly by the numpy API (e.g. row-wise/column-wise summary statistics) but the majority of it, or at least the harder, more interesting functionality, will involve fairly bespoke algorithms that are specific to the domain and can only dispatch partly through a numpy API, via something like unumpy. What we will need is a way to define how these algorithms work based on the underlying array type and we know that the implementations will be quite different when using Dask, CuPy, or in-memory arrays. I imagine we will need our own
DaskBackend
,CuPyBackend
, etc. implementations and though I noticed several warnings from you guys on not building backends that depend on other backends, this seems like an exception. In other words, our GeneticsDaskBackend would need to force use of the unumpy DaskBackend. Did you guys envision this working differently or am I on the right track?I was also wondering if you knew of projects specific to some scientific domain that build on uarray/unumpy to support dispatching. I suppose it's early for that, but I wanted to ask because it would be great to have a model to follow if one exists, rather than working through some prototypes myself.
Thanks!
The text was updated successfully, but these errors were encountered: