Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distribution of xarray #1031

Closed
ClaudiaComito opened this issue Sep 28, 2022 · 14 comments
Closed

Support distribution of xarray #1031

ClaudiaComito opened this issue Sep 28, 2022 · 14 comments

Comments

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Sep 28, 2022

See https://docs.xarray.dev/en/stable/

If I understand correctly, an xarray object is made up of the actual data array (np.ndarray), and 1-D coordinates arrays (dictionaries?) that map data dimensions and indices to meaningful physical quantities.

For example, if xarray is a matrix of coordinates (date, temperature), users will be able to perform

mean_temp = xarray['2010_01_01', '2010_12_31'].mean()

Feature functionality

Enable distribution of xarray object, allow named dimensions, keep track of coordinates arrays, one of which will be distributed.

Example, :

ht_xarray = ht.array(xarray, split="date")
ht_mean_temp = ht_xarray['2010_01_01', '2010_12_31'].mean() # memory-distr. operation

Check out Pytorch's named tensors functionality.

Additional context
Initiating collaboration with N. Koldunov @koldunovn at Alfred Wegener Institute (Helmholtz centre for polar and marine research).
Also interesting for @kleinert-f, @ben-bou
Tagging @bhagemeier for help with implementation.

@koldunovn
Copy link

Coordinates are not necessarily 1D arrays. For example for curvilinear grids covering the Earth's surface one would have to describe positions of (e.g. centers) of grid points as 2D arrays (one for lon and one for lat), and coordinate arrays then will be 2D.

@ClaudiaComito
Copy link
Contributor Author

Got it, thanks @koldunovn, edited original post.

@ClaudiaComito
Copy link
Contributor Author

ClaudiaComito commented Sep 28, 2022

Implementation discussed during devs meeting Sept 28. Preliminary scheme:

  • introduce dndarray.coordinates attribute (default=None)
  • ht.array() to accept xarray input, interpret string split, populate dndarray.coordinates, chunk data AND split coordinate
  • every operation needs to check if not coordinates: or similar, quantify and minimize performance hit

@Markus-Goetz would be good if you chime in

@ClaudiaComito ClaudiaComito added enhancement New feature or request dndarray labels Sep 28, 2022
@ClaudiaComito
Copy link
Contributor Author

ClaudiaComito commented Oct 18, 2022

tagging @Mystic-Slice as they have shown interest in working on this 👋

@ClaudiaComito
Copy link
Contributor Author

Interesting discussion over at array-api about single-node parallelism, but xarray also mentioned in distributed execution context.

@TomNicholas you might be interested in this effort as well.

@Mystic-Slice
Copy link
Collaborator

Mystic-Slice commented Apr 4, 2023

Hi everyone!
I have been thinking about this project for some time now and here are my initial thoughts.

DXarray:

Xarray uses np.ndarray for its data storage. And since numpy does not support computation on GPU, the xarray objects cannot be directly chunked and stored within the new DXarray object.

There are two solutions:

  1. Using CuPy arrays within xarray. see (cupy-xarray)
    • As far as I know, this must be very simple and straightforward to use.
    • This sounds like the best way to get most if not all the functionalities of xarray without having to re-implement.
    • My only problem with CuPy is that, it is strictly GPU-based. It will limit flexibility if we dont switch between numpy and cupy throughout the code. (But maybe cupy-xarray handles that internally and we dont have to worry about that?)
  2. Implement our own version of xarray using torch.Tensors.
    • Except for a few functionalities (like array broadcasting, grouping and data alignment), everything else must be easy enough to implement by just translating the labels to corresponding indices and redirecting to the existing DNDarray methods.
    • Example:
    >>> xarray = xr.DataArray(
            data=np.arange(48).reshape(4, 2, 6),
            dims=("u", "v", "time"),
            coords={
                "u": [-3.2, 2.1, 5.3, 6.5],
                "v": [-1, 2.6],
                "time": pd.date_range("2009-01-05", periods=6, freq="M"),
            },
        )
    
    # Single select
    >>> arr.sel(u=5.3, time="2009-04-30") # array([27, 33])
        # Translated to 
    >>> arr[2, :, 3] # array([27, 33])
    
    # Multi select
    >>> arr.sel(u=[-3.2, 6.5], time=slice("2009-02-28", "2009-03-31")) # array([[[ 1,  2], [ 7,  8]], [[37, 38], [43, 44]]])
        # Translated to
    >>> arr[[0, 3], :, slice(1, 2 + 1)] # array([[[ 1,  2], [ 7,  8]], [[37, 38], [43, 44]]])
    • But this means we will have to implement any new feature/functionality that is released by Xarray in the future.

I would love to know what you all think. Any suggestion is highly appreciated!

@TomNicholas
Copy link

Hi everyone, thanks for tagging me here!

I'm a bit unclear what you would like to achieve here - are you talking about:

(a) making DNDArray wrap xarray objects?
(b) wrapping DNDArray inside xarray objects? (related Q: does DNDArray obey the python array API standard?)
(c) wrapping DNDArray inside xarray objects but also with an understanding of chunks? (similar to dask - this is what the issue you linked to is discussing @ClaudiaComito)
(d) just creating DNDArray objects from xarray objects?

Some miscellaneous comments:

It will limit flexibility if we dont switch between numpy and cupy throughout the code.

This is the point of the get_array_module pattern - you should then not have to have separate code paths for working with GPU vs CPU data.

Implement our own version of xarray using torch.Tensors

There have been some discussions of wrapping pytorch tensors in xarray. We also have an ongoing project to publicly expose xarray.Variable, a lower-level abstraction better-suited for use as a building block of efforts to make named tensors like torch.Tensor. In a better-coordinated world this would have already happened, and pytorch would now have an optional dependency on xarray. 🤷‍♂️

@Mystic-Slice
Copy link
Collaborator

Mystic-Slice commented Apr 7, 2023

Hi everyone, thanks for tagging me here!

I'm a bit unclear what you would like to achieve here - are you talking about:

(a) making DNDArray wrap xarray objects? (b) wrapping DNDArray inside xarray objects? (related Q: does DNDArray obey the python array API standard?) (c) wrapping DNDArray inside xarray objects but also with an understanding of chunks? (similar to dask - this is what the issue you linked to is discussing @ClaudiaComito) (d) just creating DNDArray objects from xarray objects?

Some miscellaneous comments:

It will limit flexibility if we dont switch between numpy and cupy throughout the code.

This is the point of the get_array_module pattern - you should then not have to have separate code paths for working with GPU vs CPU data.

Implement our own version of xarray using torch.Tensors

There have been some discussions of wrapping pytorch tensors in xarray. We also have an ongoing project to publicly expose xarray.Variable, a lower-level abstraction better-suited for use as a building block of efforts to make named tensors like torch.Tensor. In a better-coordinated world this would have already happened, and pytorch would now have an optional dependency on xarray. 🤷‍♂️

Hi @TomNicholas!
We want to create a Distributed-Xarray. This should be able to be chunked and distributed to different processes. Most users would also prefer to use GPU-based computation.
So, I guess we are trying to achieve option (a) or create a new distributed data structure of our own that mimics Xarray (sounds like a lot of work😅).

This should allow users to manipulate huge amounts of data while also being able to work with the more-intuitive Xarray API.

@ClaudiaComito should be able to shed more light on this.
Thanks for taking the time to speak with us!

@koldunovn
Copy link

Would the experience of xarray with Dask make creation of the data structure you want? Also there are implementations with GPU support https://xarray.dev/blog/xarray-kvikio

@mrfh92
Copy link
Collaborator

mrfh92 commented Jul 6, 2023

Meanwhile I have implemented a basic idea of DXarray in the corresponding branch. In the near future I plan to go on with:

  • DXarray.to_xarray(): convert distributed DXarray to a xarray.DataArray (similar to DNDarray.to_numpy())
  • resplit for DXarray
  • from_xarray': convert a non-distributed xarray.DataArrayto distributedDXarray`
  • parallel I/O for DXarray (depending on how this can be done for an xarray-like structure)
  • arithmetic and statistical operations that act on the value-array only (such as mean, etc.)

Things that might get a bit more complicated:

  • pandas-dataframes as coordinates
  • masked value-arrays
  • operations that involve values and coordinates and thus cannot be copied directly from routines for DNDarray applied to the value array

@TomNicholas
Copy link

@ClaudiaComito in #1154 it seems you started wrapping heat objects inside xarray, which is awesome! I recently improved the documentation on wrapping numpy-like xarrays with xarray objects (pydata/xarray#7911 and pydata/xarray#7951).

Those extra pages in the docs aren't released yet, but for now you can view them here (on wrapping numpy-like arrays) and here (on wrapping distributed numpy-like arrays).

@mrfh92
Copy link
Collaborator

mrfh92 commented Jul 10, 2023

In the branch 1031-support-distribution-of-xarray there is now available:

  • a class DXarray implementing sth similar as xarray but using DNDarrays as underlying objects
  • balace and resplit for DXarray
  • printing and conversion to and from xarray.DataArrays's
    (still missing: unit tests...)

I will stop here, until we have discussed in the team the "wrapping-approach" proposed by TomNicholas, because such an approach would be much easier to implement (if applicable to Heat).

@mrfh92 mrfh92 linked a pull request Jul 26, 2023 that will close this issue
5 tasks
Copy link
Contributor

This issue is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 60 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants