Vectorized lazy indexing #1899

fujiisoup · 2018-02-09T11:22:01Z

Closes Vectorized indexing with cache=False #1897
Tests added (for all bug fixes or enhancements)
Tests passed (for all non-documentation changes)
Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

I tried to support lazy vectorised indexing inspired by #1897.
More tests would be necessary but I want to decide whether it is worth to continue.

My current implementation is

For outer/basic indexers, we combine successive indexers (as we are doing now).
For vectorised indexers, we just store them as is and index sequentially when the evaluation.

The implementation was simpler than I thought, but it has a clear limitation.
It requires to load array before the vectorised indexing (I mean, the evaluation time).
If we make a vectorised indexing for a large array, the performance significantly drops and it is not noticeable until the evaluation time.

I appreciate any suggestions.

fujiisoup · 2018-02-09T14:04:16Z

I noticed the lazy vectorized indexing can be (sometimes) optimized by decomposing the vectorized indexers into successive outer and vectorized indexers, so that the size of the array to be loaded into memory is minimized.

shoyer · 2018-02-09T19:07:39Z

I figured out how to consolidate two vectorized indexers, as long as they don't include any slice objects:

import numpy as np

def index_vectorized_indexer(old_indexer, applied_indexer):
    return tuple(o[applied_indexer] for o in np.broadcast_arrays(*old_indexer))

for x, old, applied in [
    (np.arange(10), (np.arange(2, 7),), (np.array([3, 2, 1]),)),
    (np.arange(10), (np.arange(6).reshape(2, 3),), (np.arange(2), np.arange(1, 3))),
    (-np.arange(1, 21).reshape(4, 5), (np.arange(3)[:, None], np.arange(4)[None, :]), (np.arange(3), np.arange(3))),
]:
    new_key = index_vectorized_indexer(old, applied)
    np.testing.assert_array_equal(x[old][applied], x[new_key])

We could probably make this work with VectorizedIndexer if we converted the slice objects to arrays. I think we might even already have some code to do that conversion somewhere. So another option would be to convert BasicIndexer and OuterIndexer -> VectorizedIndexer if necessary and then use this path.

fujiisoup · 2018-02-09T21:28:26Z

Thanks, @shoyer
Do you think it is possible to consolidate transpose also?
We need it to keep our logic in Variable._broadcast_indexing.

I am wondering what computation cost we want to avoid by the lazy indexing.

The indexing itself is expensive so we want to minimize the number of indexing operation?
The original data is too large to fit into memory, and we want to load the smallest subset of the original array by the lazy indexing?

If the reason 2 is the common case, I think it is not a good idea to consolidate all the lazy indexing as VectorizedIndexer, since most of the backend does not support vectorized indexing, which means we need to load all the array into memory before any indexing operation.
(But still it would be valuable to consolidate all the indexers after the first vectorized indexer, since we can decompose any VectorizedIndexer into successive outer- and smaller vectorized-indexers pair.)

And I am also wondering as pointed out in #1725, what I am doing now was already implemented in dask.

shoyer · 2018-02-09T21:30:40Z

Reason 2 is the primary one. We want to load the minimum amount of data possible into memory, mostly because pulling data from disk is slow.

shoyer · 2018-02-09T22:10:43Z

I think the design choice here really comes down to whether we want to enable VectorizedIndexing on arbitrary data on disk or not:

Is it better to:

Always allow vectorized indexing by means of (lazily) loading all indexed data into memory as a single chunk. This could potentially be very expensive for IO or memory in hard to predict ways.
Or to only allow vectorized indexing if a backend supports it directly. This ensures that when vectorized indexing works it works efficiently. Vectorized indexing is still possibly but you have to explicitly write .compute()/.load().

I think I slightly prefer option (2) but I can see the merits in either decision.

fujiisoup · 2018-02-10T01:47:54Z

I am inclined to the option 1, as there are some benefit even for backend without the vectorized-indexing support,
e.g.
in case we want to get three diagonal elements (1, 1), (2, 2), (3, 3) from a 1000x1000 array. What we want is
array[[1, 2, 3], [1, 2, 3]].
It can be decomposed to
array[1: 4, 1:4][[0, 1, 2], [0, 1, 2]].
We only need to load 3 x 3 part of the 1000 x 1000 array, then take its diagonal elements.

A drawback is that it is difficult for users to predict how large memory is necessary.

shoyer · 2018-02-10T04:33:44Z

in case we want to get three diagonal elements (1, 1), (2, 2), (3, 3) from a 1000x1000 array. What we want is
array[[1, 2, 3], [1, 2, 3]].
It can be decomposed to
array[1: 4, 1:4][[0, 1, 2], [0, 1, 2]].
We only need to load 3 x 3 part of the 1000 x 1000 array, then take its diagonal elements.

OK, this is pretty clever.

There are some obvious fail cases, e.g., if they want to pull out indices array[[1, -1], [1, -1]], in which case the entire array needs to be sliced. I wonder if we should try to detect these with some heuristics, e.g., if the size of the result is much (maybe 10x or 100x) smaller than the size of sliced arrays.

Also, we would want to avoid separating basic/vectorized for backends that support efficient vectorized indexing (scipy and zarr).

fujiisoup · 2018-02-10T04:47:04Z

There are some obvious fail cases, e.g., if they want to pull out indices array[[1, -1], [1, -1]], in which case the entire array needs to be sliced.

If the backend supports the orthogonal indexing (not only the basic indexing),
we can do array[[1, -1]][:, [1, -1]], load the 2x2 array, then apply the vectorized indexing [[0, 1], [0, 1]].

But if we want a full diagonal, we need a full slice anyway...

Also, we would want to avoid separating basic/vectorized for backends that support efficient vectorized indexing (scipy and zarr).

OK. Agreed. We may need a flag that can be accessed from the array wrapper.

fujiisoup · 2018-02-11T14:25:40Z

Based on the suggestion, I implemented the lazy vectorized indexing with index-consolidation.

Now, every backend is virtually compatible to all the indexer types, i.e. basic-, outer- and vectorized-indexers.

It sometimes consume large amount of memory if the indexer is unable to decompose efficiently, but it is always better than loading the full slice.
The drawback is the unpredictability of how many data will be loaded.

jhamman

@fujiisoup - this is impressive. Thanks for putting this together.

The failing test here seems to be dask related and I don't think you need to worry about it right now.

My comments are mostly cosmetic. I wish I had reviewed your prior Indexing PRs so I had better context to what you are doing but I generally like the approach you've taken. The memory issue is potentially a problem but my thought is that consistency between backends and in memory datasets is more important right now. Am I correct in my understanding that certain backends could/do support vectorized indexing directly and in those cases, the memory issue wouldn't be as prevalent (depending on how the specific backend did their implementation)?

jhamman · 2018-02-12T01:00:29Z

xarray/core/indexing.py

+        return type(self)(self.array, key)
+
+    def __setitem__(self, key, value):
+        raise NotImplementedError


let's add a more informative error message here. Even NotImplementedError('__setitem__ has not yet been implemented for LazilyVectorizedIndexedArray') would be okay.

jhamman · 2018-02-12T06:01:10Z

xarray/core/indexing.py

    shape : tuple
        Shape of the array subject to the indexing.

    Returns
    -------
    tuple
        Tuple suitable for use to index a NumPy array with vectorized indexing.
-        Each element is an integer or array: broadcasting them together gives
-        the shape of the result.
+        Each element is array: broadcasting them together gives the shape of


Each element is an array

jhamman · 2018-02-12T06:02:45Z

xarray/core/indexing.py

+def _decompose_vectorized_indexer(indexer):
+    """ Decompose vectorized indexer to outer and vectorized indexers,
+    array[indexer] == array[oindex][vindex]
+    such that array[oindex].shape becomes smallest.


becomes smallest what?

Just trying to make sure I understand the intent here. Is this algorithm trying to optimize (minimize) the size of array[oindex]?

…wrap_explicit_indexers.

fujiisoup

Now, every backend uses the most efficient indexing method it supports for any given indexers.

fujiisoup · 2018-02-14T05:44:51Z

xarray/core/indexing.py

+    """
+    Decompose vectorized indexer to the successive two indexers, where the
+    first indexer will be used to index backend arrays, while the second one
+    is used to index loaded on-memory np.ndarray.


A detailed description of decompose_indexer is given here.

fujiisoup · 2018-02-14T09:12:20Z

xarray/backends/rasterio_.py

@@ -41,8 +41,10 @@ def shape(self):
        return self._shape

    def __getitem__(self, key):
-        key = indexing.unwrap_explicit_indexer(
-            key, self, allow=(indexing.BasicIndexer, indexing.OuterIndexer))
+        # TODO support indexing.decompose_indexer


I think I need someone's help to update indexing scheme for rasterio backend.

rasterio is strictly a slicing approach for fetching data, or pulling out individual data points. In this case, you want to avoid converting the slices into numpy arrays of indexes. Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices? So, if we were indexing [1, 3, 4, 5, 2, 6], we would get back a slice(1, 7)? Maybe even make it smart enough to take the chunk hints so that indexes like [1, 4, 5] is kept as slice(1, 6), rather than (slice(1, 2), slice(4, 6))?

@WeatherGod Thanks for the review.

Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices?

Agreed, it would be an improvement. But I am wondering the combination will explode if the dimension increases. Some heuristics would be also necessary how many chunks we used.
I do not have yet a clear solution. I am open to any idea for the improvement.

EDIT: the chunking would also improve the efficiency if we want to get a full diagonal from a large 2d-array. In my current implementation, we need to load a full slice, but it could be reduced by the factor 2 if we use 2x2 chunks.

Yeah, we certainly have time to think about it. This feature doesn't seem to change anything for indexing that worked before, so I don't think we have to worry about regressions. Improvements to indexing can be thought of holistically as a separate task.

For our immediate needs though, for rasterio, we probably should just create a failback indexer that would just load up that entire dimension and index it as given if the original indexer is anything but scalar integers or slices.

@fmaussion can you help me for this?
What I want to do here is the 2-staged indexing.

index backend array by the 1st indexer.

index the loaded np.ndarray by the 2nd vectorized indexer.

These two indexers (for backend and np.ndarray) can be computed by

key, np_inds = indexing.decompose_indexer( key, self.shape, indexing.IndexingSupport.OUTER)

I am not familiar with rasterio data structure (I do not yet understand what is going on below...),
I would appreciate if you could instruct me.

Rasterio (as wrapped by xarray) supports a strange form of indexing:

Outer indexing for the first indexer element only.

Basic indexing for remaining two indexer elements (every rasterio array is 3-dimensional).

I think the simplest fix would be to extend our rasterio wrapper to support arbitrary outer indexing. This would entail changing the final clause below, where rasterio currently raises an error:

xarray/xarray/backends/rasterio_.py

Lines 72 to 78 in e0621c7

else:

k = np.asarray(k)

start = k[0]

stop = k[-1] + 1

ids = np.arange(start, stop)

if not ((k.shape == ids.shape) and np.all(k == ids)):

raise IndexError(_ERROR_MSG)

Instead, we would want to create a slice object based on the minimum/maximum elements, and then add another indexing to np_inds to pull out the listed elements (after subtracting the start from the slice).

WeatherGod · 2018-02-14T16:13:16Z

Oh, wow... this worked like a charm for the netcdf4 backend! I have a ~13GB (uncompressed) 4-D netcdf4 variable that was giving me trouble for slicing a 2D surface out of. Here is a snippet where I am grabbing data at random indices in the last dimension. First for a specific latitude, then for the entire domain.

>>> CD_subset = rough['CD'][0]
>>> wind_inds_decorated
<xarray.DataArray (latitude: 3501, longitude: 7001)>
array([[33, 15, 25, ..., 52, 66, 35],
       [ 6,  8, 55, ..., 59,  6, 50],
       [54,  2, 40, ..., 32, 19,  9],
       ..., 
       [53, 18, 23, ..., 19,  3, 43],
       [ 9, 11, 66, ..., 51, 39, 58],
       [21, 54, 37, ...,  3,  0, 65]])
Dimensions without coordinates: latitude, longitude
>>> foo = CD_subset.isel(latitude=0, wind_direction=wind_inds_decorated[0])
>>> foo
<xarray.DataArray 'CD' (longitude: 7001)>
array([ 0.004052,  0.005915,  0.002771, ...,  0.005604,  0.004715,  0.002756], dtype=float32)
Coordinates:
    scales          int16 60
    latitude        float64 54.99
  * longitude       (longitude) float64 -130.0 -130.0 -130.0 -130.0 -130.0 ...
    wind_direction  (longitude) int16 165 75 125 5 235 345 315 175 85 35 290 ...
>>> foo = CD_subset.isel(wind_direction=wind_inds_decorated)
>>> foo
<xarray.DataArray 'CD' (latitude: 3501, longitude: 7001)>
[24510501 values with dtype=float32]
Coordinates:
    scales          int16 60
  * latitude        (latitude) float64 54.99 54.98 54.97 54.96 54.95 54.95 ...
  * longitude       (longitude) float64 -130.0 -130.0 -130.0 -130.0 -130.0 ...
    wind_direction  (latitude, longitude) int64 165 75 125 5 235 345 315 175 ...

All previous attempts at this would result in having to load the entire 13GB array into memory just to get 93.5 MB out. Or, I would try to fetch each individual point, which took way too long. This worked faster than loading the entire thing into memory, and it used less memory, too (I think I maxed out at about 1.2GB of total usage, which is totally acceptable for my use case).

I will try out similar things with the pynio and rasterio backends, and get back to you. Thanks for this work!

WeatherGod

Cool beans! Going to continue testing this stuff out. Would be nice to get rasterio support going, but this is certainly an improvement over the pre-existing situation.

WeatherGod · 2018-02-14T16:33:34Z

xarray/core/indexing.py

+        return np.broadcast(*self.key.tuple).shape
+
+    def __array__(self, dtype=None):
+        return np.asarray(self.array[self.key], dtype=None)


Shouldn't this be np.asanyarray()? Could this interfere with dask arrays? Pint support?

This should be okay, neither xarray or dask support using masked arrays under the hood. See #1194.

WeatherGod · 2018-02-14T16:42:59Z

xarray/core/indexing.py

+    mode: str, one of ['basic' | 'outer' | 'outer_1vector' | 'vectorized']
+        basic: for backends that support onle basic indexer
+        outer: for backends that support basic / outer indexer
+        outer_1vector: for backends that support outer indexer inluding at most


"inluding" --> "including"

WeatherGod · 2018-02-14T16:44:39Z

xarray/core/indexing.py

+
+    backend_indexer = []
+    np_indexer = []
+    # posify


"posify"???

I'll take the blame for this one! I think I used this made-up word elsewhere :).

WeatherGod · 2018-02-14T16:48:04Z

xarray/core/indexing.py

+    for k in indexer:
+        if isinstance(k, slice):
+            backend_indexer.append(k)
+            np_indexer.append(slice(None))


Might help for future readers to add a bit of prose to each of these. Like: "If it is a slice, then we will slice it as-is (k) in the backend, and then use all of it (slice(None)) for the in-memory portion."

Good job with the prose above, but the other spots are going to need similar prose as well.

WeatherGod · 2018-02-14T16:51:24Z

xarray/core/indexing.py

+    """
+    Decompose ouer indexer to the successive two indexers, where the
+    first indexer will be used to index backend arrays, while the second one
+    is ised to index the loaded on-memory np.ndarray.


"ised" --> "used"

WeatherGod · 2018-02-14T16:52:29Z

xarray/core/indexing.py

+
+def _decompose_outer_indexer(indexer, shape, mode):
+    """
+    Decompose ouer indexer to the successive two indexers, where the


"ouer" --> "outer"

Also, the phrasing is odd. I would go with "Decompose outer indexer into two indexers" instead, as it is more concise.

WeatherGod · 2018-02-14T16:53:09Z

xarray/core/indexing.py

+    >>> array = array[backend_indexer]  # load subslice of the array
+    >>> np_indexer = OuterIndexer([0, 2, 1], [0, 1, 0])
+    >>> array[np_indexer]  # outer indexing for on-memory np.ndarray.
+    """


The modes aren't documented. Maybe a reference to a list of them?

WeatherGod · 2018-02-14T17:49:20Z

Hmm, came across a bug with the pynio backend. Working on making a reproducible example, but just for your own inspection, here is some logging output:

<xarray.Dataset>
Dimensions:    (xgrid_0: 1799, ygrid_0: 1059)
Coordinates:
    lv_HYBL0   float32 8.0
    longitude  (ygrid_0, xgrid_0) float32 ...
    latitude   (ygrid_0, xgrid_0) float32 ...
Dimensions without coordinates: xgrid_0, ygrid_0
Data variables:
    UGRD       (ygrid_0, xgrid_0) float32 ...
    VGRD       (ygrid_0, xgrid_0) float32 ...
DEBUG:hiresWind.downscale:shape of a data: (50, 1059, 1799)

The first bit is the repr of my DataSet. The last line is output of ds['UGRD'].values.shape. It is supposed to be 3D, not 2D.

If I revert back to v0.10.0, then the shape is (1059, 1799}, just as expected.

WeatherGod · 2018-02-14T17:52:24Z

I can also confirm that the shape comes out correctly using master, so this is definitely isolated to this PR.

WeatherGod · 2018-02-14T18:02:17Z

Ah, interesting... so, this dataset was created by doing an isel() on the original:

>>> ds['UGRD_P0_L105_GLC0']
<xarray.DataArray 'UGRD_P0_L105_GLC0' (lv_HYBL0: 50, ygrid_0: 1059, xgrid_0: 1799)>
[95257050 values with dtype=float32]
Coordinates:
  * lv_HYBL0   (lv_HYBL0) float32 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
    gridlat_0  (ygrid_0, xgrid_0) float32 ...
    gridlon_0  (ygrid_0, xgrid_0) float32 ...
Dimensions without coordinates: ygrid_0, xgrid_0

So, the original data has a 50x1059x1799 grid, and the new indexer isn't properly composing the indexer so that it fetches [7, slice(None), slice(None)] when I grab it's .values.

WeatherGod · 2018-02-14T18:14:23Z

xarray/backends/pynio_.py

-        key = indexing.unwrap_explicit_indexer(
-            key, target=self, allow=indexing.BasicIndexer)
+        key, np_inds = indexing.decompose_indexer(key, self.shape,
+                                                  mode='outer')


PyNIO actually has a very robust indexing interface. We might be over-simplifying here, and this could be leading to bugs. I am not exactly sure we even need to do all of this indirection. PyNIO allows for the creation of a "selection object" that can be passed to the pynio array objects. Perhaps we should be creating that object instead of forcing the pynio array object into a numby index adapter?

https://www.pyngl.ucar.edu/NioExtendedSelection.shtml

Thanks for the suggestion.
I am not familiar with pynio, but it looks their selection object is very specific to pynio...
I want to leave this improvement to another PR and focus on the common feature in this PR.

I'm not sure how selection objects would really help us here. I guess they can used to do a form of outer indexing, but putting integer indices into a string seems very inefficient. I agree that we should save it for later.

Are we actually sure that PyNIO even supports "outer" indexing? I had it marked down as only supporting basic indexing before. "Outer" indexing would include indexing with 1D arrays like array[[0, 1, 2], :], but I don't see that support documented anywhere. (Maybe it does work, but just isn't documented.)

WeatherGod · 2018-02-14T18:32:40Z

xarray/core/indexing.py

+    ----------
+    indexer: VectorizedIndexer
+    mode: str, one of ['basic' | 'outer' | 'outer_1vector' | 'vectorized']
+        basic: for backends that support onle basic indexer


"onle" --> "only"?

WeatherGod · 2018-02-14T18:55:43Z

Just did some more debugging, putting in some debug statements within NioArrayWrapper.__getitem__():

diff --git a/xarray/backends/pynio_.py b/xarray/backends/pynio_.py
index c7e0ddf..b9f7151 100644
--- a/xarray/backends/pynio_.py
+++ b/xarray/backends/pynio_.py
@@ -27,16 +27,24 @@ class NioArrayWrapper(BackendArray):
         return self.datastore.ds.variables[self.variable_name]
 
     def __getitem__(self, key):
+        import logging
+        logger = logging.getLogger(__name__)
+        logger.addHandler(logging.NullHandler())
+        logger.debug("initial key: %s", key)
         key, np_inds = indexing.decompose_indexer(key, self.shape,
                                                   mode='outer')
+        logger.debug("Decomposed indexers:\n%s\n%s", key, np_inds)
 
         with self.datastore.ensure_open(autoclose=True):
             array = self.get_array()
+            logger.debug("initial array: %r", array)
             if key == () and self.ndim == 0:
                 return array.get_value()
 
             for ind in np_inds:
+                logger.debug("indexer: %s", ind)
                 array = indexing.NumpyIndexingAdapter(array)[ind]
+                logger.debug("intermediate array: %r", array)
 
             return array

And here is the test script (data not included):

import logging
import xarray as xr
logging.basicConfig(level=logging.DEBUG)
fname1 = '../hrrr.t12z.wrfnatf02.grib2'
ds = xr.open_dataset(fname1, engine='pynio')
subset_isel = ds.isel(lv_HYBL0=7)
sp = subset_isel['UGRD_P0_L105_GLC0'].values.shape

And here is the relevant output:

DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339210>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339b90>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339d50>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339d90>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339190>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339190>
(50, 1059, 1799)

So, the BasicIndexer((7, slice(None, None, None), slice(None, None, None))) isn't getting decomposed correctly, it looks like?

WeatherGod · 2018-02-14T19:04:09Z

xarray/backends/pynio_.py

+            for ind in np_inds:
+                array = indexing.NumpyIndexingAdapter(array)[ind]
+
+            return array


Ah-hah! I see the bug! key never gets used!

So, this should probably be something like:

array = array[key.tuple] for ind in np_inds: array = indexing.NumpyIndexingAdapter(array)[ind] return array

Thanks for the debugging this!!
We certainly need test coverage for this.

WeatherGod · 2018-02-14T19:14:15Z

xarray/backends/h5netcdf_.py

        with self.datastore.ensure_open(autoclose=True):
-            return self.get_array()[key]
+            array = np.asarray(self.get_array()[key], dtype=self.dtype)


What's the reasoning behind doing asarray here, but not for other backends?

shoyer · 2018-02-18T05:11:04Z

xarray/core/indexing.py

@@ -543,6 +543,10 @@ def _updated_key(self, new_key):
        return _combine_indexers(self.key, self.shape, new_key)

    def __getitem__(self, indexer):
+        # load if the indexed array becomes a scalar


Is this actually a good idea? I think it's a little surprising to load data immediately just because the result is a scalar.

fujiisoup · 2018-02-19T08:30:01Z

This looks some backends do not support negative step slices.
I'm going to wrap this maybe this weekend.

WeatherGod · 2018-02-20T18:43:56Z

I did some more investigation into the memory usage problem I was having. I had assumed that the vectorized indexed result of a lazily indexed data array would be an in-memory array. So, when I then started to use the result, it was then doing a read of all the data at once, resulting in a near-complete load of the data into memory.

I have adjusted my code to chunk out the indexing in order to keep the memory usage under control at reasonable performance penalty. I haven't looked into trying to identify the ideal chunking scheme to follow for an arbitrary dataarray and indexing. Perhaps we can make that a task for another day. At this point, I am satisfied with the features (negative step-sizes aside, of course).

jhamman · 2018-02-26T04:00:24Z

@fujiisoup - is this ready for a final review? I see you have all the tests passing 💯 !

fujiisoup · 2018-02-26T04:16:03Z

I think it's ready :)

# Conflicts: # asv_bench/benchmarks/dataset_io.py

shoyer · 2018-03-01T02:54:25Z

xarray/core/indexing.py

@@ -483,13 +466,6 @@ def __init__(self, array, key=None):
        self.key = key



Maybe we should we rename the original LazilyIndexedArray to LazilyOuterIndexedArray? Also, should we add a transpose method?

shoyer · 2018-03-01T02:56:31Z

xarray/core/indexing.py

    shape : tuple
        Shape of the array subject to the indexing.

    Returns
    -------
-    tuple
+    VectorizedInexer


VectorizedIndexer

shoyer · 2018-03-01T02:58:49Z

xarray/core/indexing.py

+        return (OuterIndexer(tuple(backend_indexer)),
+                OuterIndexer(tuple(np_indexer)))
+
+    # basic


Let's add an explicit error check or at least an assertion here, e.g., assert indexing_support == IndexingSupport.BASIC.

shoyer · 2018-03-01T03:04:19Z

xarray/core/indexing.py

+            OuterIndexer(tuple(np_indexer)))
+
+
+def _arrayize_vectorized_indexer(indexer, shape):


Oops -- I wrote a basically identical helper function xarray.backends.zarr._replace_slices_with_arrays along with some unit tests.

Can you confirm that we can remove the zarr helper function, and that all its unit tests work with this function instead? I don't have a preference on which implementation to use, assuming that all tests pass with both.

…e in zarr.py

stickler-ci · 2018-03-02T14:21:22Z

xarray/backends/netCDF4_.py

@@ -277,7 +279,7 @@ def open(cls, filename, mode='r', format='NETCDF4', group=None,
    def open_store_variable(self, name, var):
        with self.ensure_open(autoclose=False):
            dimensions = var.dimensions
-            data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self))
+            data = indexing.LazilyOuterIndexedArray(NetCDF4ArrayWrapper(name, self))


E501 line too long (84 > 79 characters)

stickler-ci · 2018-03-02T14:21:22Z

xarray/core/indexing.py

+        return _combine_indexers(self.key, self.shape, new_key)
+
+    def __getitem__(self, indexer):
+        # If the indexed array becomes a scalar, return LazilyOuterIndexedArray.


E501 line too long (80 > 79 characters)

stickler-ci · 2018-03-02T14:21:22Z

xarray/tests/test_dataset.py

@@ -77,7 +77,7 @@ def get_variables(self):
        def lazy_inaccessible(k, v):
            if k in self._indexvars:
                return v
-            data = indexing.LazilyIndexedArray(InaccessibleArray(v.values))
+            data = indexing.LazilyOuterIndexedArray(InaccessibleArray(v.values))


E501 line too long (80 > 79 characters)

stickler-ci · 2018-03-02T14:21:22Z

xarray/tests/test_variable.py

@@ -16,7 +16,7 @@
 from xarray.core import indexing
 from xarray.core.common import full_like, ones_like, zeros_like
 from xarray.core.indexing import (
-    BasicIndexer, CopyOnWriteArray, DaskIndexingAdapter, LazilyIndexedArray,
+    BasicIndexer, CopyOnWriteArray, DaskIndexingAdapter, LazilyOuterIndexedArray,


E501 line too long (81 > 79 characters)

stickler-ci · 2018-03-02T14:21:22Z

xarray/tests/test_variable.py

@@ -1798,7 +1798,7 @@ def test_rolling_window(self):

 class TestAsCompatibleData(TestCase):
    def test_unchanged_types(self):
-        types = (np.asarray, PandasIndexAdapter, indexing.LazilyIndexedArray)
+        types = (np.asarray, PandasIndexAdapter, indexing.LazilyOuterIndexedArray)


E501 line too long (82 > 79 characters)

stickler-ci · 2018-03-02T14:21:22Z

xarray/tests/test_variable.py

        # doubly wrapping
        v = Variable(dims=('x', 'y'),
-                     data=LazilyIndexedArray(LazilyIndexedArray(self.d)))
+                     data=LazilyOuterIndexedArray(LazilyOuterIndexedArray(self.d)))


E501 line too long (83 > 79 characters)

stickler-ci · 2018-03-02T14:21:22Z

xarray/tests/test_variable.py

        self.check_orthogonal_indexing(v)
        # hierarchical wrapping
        v = Variable(dims=('x', 'y'),
-                     data=LazilyIndexedArray(NumpyIndexingAdapter(self.d)))
+                     data=LazilyOuterIndexedArray(NumpyIndexingAdapter(self.d)))


E501 line too long (80 > 79 characters)

fujiisoup · 2018-03-03T07:11:24Z

All done :)

shoyer

I think this is finally ready! Lazy transpose is a nice bonus :).

Let's wait a day or two before merging in case anyone has other comments to add.

shoyer · 2018-03-06T22:01:04Z

OK, in it goes. Thanks @fujiisoup !

fujiisoup · 2018-03-06T23:45:13Z

Thanks, @WeatherGod , for your feedback.
This is finally merged!

WeatherGod · 2018-03-07T01:08:36Z

🎉

dopplershift · 2018-03-19T20:01:29Z

So did this remove/rename LazilyIndexedArray in 0.10.2? Because I'm getting an attribute in the custom xarray backend I wrote: https://github.com/Unidata/siphon/blob/master/siphon/cdmr/xarray_support.py

I don't mind updating, but I wanted to make sure this was intentional.

fujiisoup · 2018-03-19T23:40:52Z

Yes,
LazilyIndexedArray was renamed to LazilyOuterIndexedArray and LazilyVectorizedIndexedArray was newly added.
These two backend arrays are selected depending on what kind of indexer is used.

* Added PNC backend to xarray PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface. * Added whats-new documentation * Updating pnc_ to remove DunderArrayMixin dependency * Adding basic tests for pnc Right now, pnc is simply being tested as a reader for NetCDF3 files * Updating for flake8 compliance * flake does not like unused e * Updating pnc to PseudoNetCDF * Remove outer except * Updating pnc to PseudoNetCDF * Added open and updated init Based on shoyer review * Updated indexing and test fix Indexing supports #1899 * Added PseudoNetCDF to doc/io.rst * Changing test subtype * Changing test subtype removing pdb * pnc test case requires netcdf3only For now, pnc is only supporting the classic data model * adding backend_kwargs default as dict This ensures **mapping is possible. * Upgrading tests to CFEncodedDataTest Some tests are bypassed. PseudoNetCDF string treatment is not currently compatible with xarray. This will be addressed soon. * Not currently supporting autoclose I do not fully understand the usecase, so I have not implemented these tests. * Minor updates for flake8 * Explicit skipping Using pytest.mark.skip to skip unsupported tests * removing trailing whitespace from pytest skip * Adding pip support * Addressing comments * Bypassing pickle, mask/scale, and object These tests cause errors that do not affect desired backend performance. * Added uamiv test PseudoNetCDF reads other formats. This adds a test of uamiv to the standard test for a backend and skips mask/scale, object, and boolean tests * Adding support for autoclose ensure open must be called before accessing variable data * Adding bakcend_kwargs to all backends Most backends currently take no keywords, so an empty ditionary is appropriate. * Small tweaks to PNC backend * remove warning and update whats-new * Separating isntall and io pnc doc and updating whats new * fixing line length in test * Tests now use non-netcdf files * Removing unknown meta-data netcdf support. * flake8 cleanup * Using python 2 and 3 compat testing * Disabling mask_and_scale by default prevents inadvertent double scaling in PNC formats * consistent with 3.0.0 Updates in 3.0.1 will fix close in uamiv. * Updating readers and line length * Updating readers and line length * Updating readers and line length * Adding open_mfdataset test Testing by opening same file twice and stacking it. * Using conda version of PseudoNetCDF * Removing xfail for netcdf Mask and scale with PseudoNetCDF and NetCDF4 is not supported, but not prevented. * Moving pseudonetcdf to v0.15 * Updating what's new * Fixing open_dataarray CF options mask_and_scale is None (diagnosed by open_dataset) and decode_cf should be True

fujiisoup added 3 commits February 9, 2018 10:01

Start working

dceb298

First support of lazy vectorized indexing.

c1b4b60

Some optimization.

d989a15

Use unique to decompose vectorized indexing.

218763c

fujiisoup added 2 commits February 10, 2018 18:20

Consolidate vectorizedIndexing

541fca3

Support vectorized_indexing in h5py

3e05a16

jhamman reviewed Feb 12, 2018

View reviewed changes

fujiisoup added 3 commits February 14, 2018 14:23

Refactoring backend array. Added indexing.decompose_indexers. Drop un…

030a2c4

…wrap_explicit_indexers.

typo

b9f97b4

bugfix and typo

850f29c

fujiisoup commented Feb 14, 2018

View reviewed changes

WeatherGod reviewed Feb 14, 2018

View reviewed changes

shoyer reviewed Feb 18, 2018

View reviewed changes

fujiisoup added 2 commits February 18, 2018 16:18

Use LazilyIndexedArray for scalar array instead of loading.

a90ac05

Support negative step slice in rasterio.

73f4958

shoyer mentioned this pull request Feb 19, 2018

Added PNC backend to xarray #1905

Merged

4 tasks

fujiisoup added 3 commits February 25, 2018 08:09

Make slice-step always positive

fd04966

Bugfix in slice-slice

b3c3d80

Add pydap support.

259f36c

Merge branch 'master' into vectorized_lazy_indexing

0e7eb2e

Merge branch 'master' into vectorized_lazy_indexing

0c2e31b

# Conflicts: # asv_bench/benchmarks/dataset_io.py

shoyer reviewed Mar 1, 2018

View reviewed changes

fujiisoup added 2 commits March 1, 2018 14:03

Merge branch 'master' into vectorized_lazy_indexing

d8421a5

Rename LazilyIndexedArray -> LazilyOuterIndexedArray. Remove duplicat…

7e0959c

…e in zarr.py

stickler-ci reviewed Mar 2, 2018

View reviewed changes

fujiisoup added 2 commits March 2, 2018 23:28

flake8

4fccdee

Added transpose to LazilyOuterIndexedArray

8e96710

shoyer approved these changes Mar 4, 2018

View reviewed changes

shoyer merged commit 54468e1 into pydata:master Mar 6, 2018

dopplershift mentioned this pull request Mar 19, 2018

no attribute 'LazilyIndexedArray' Unidata/siphon#192

Closed

fujiisoup deleted the vectorized_lazy_indexing branch June 8, 2018 01:21

	else:
	k = np.asarray(k)
	start = k[0]
	stop = k[-1] + 1
	ids = np.arange(start, stop)
	if not ((k.shape == ids.shape) and np.all(k == ids)):
	raise IndexError(_ERROR_MSG)

		@@ -483,13 +466,6 @@ def __init__(self, array, key=None):
		self.key = key

		OuterIndexer(tuple(np_indexer)))


		def _arrayize_vectorized_indexer(indexer, shape):

Uh oh!

Vectorized lazy indexing #1899

Vectorized lazy indexing #1899

Uh oh!

Conversation

fujiisoup commented Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fujiisoup commented Feb 9, 2018

Uh oh!

shoyer commented Feb 9, 2018

Uh oh!

fujiisoup commented Feb 9, 2018

Uh oh!

shoyer commented Feb 9, 2018

Uh oh!

shoyer commented Feb 9, 2018

Uh oh!

fujiisoup commented Feb 10, 2018

Uh oh!

shoyer commented Feb 10, 2018

Uh oh!

fujiisoup commented Feb 10, 2018

Uh oh!

fujiisoup commented Feb 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhamman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fujiisoup left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fujiisoup Feb 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeatherGod commented Feb 14, 2018

Uh oh!

WeatherGod left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fujiisoup commented Feb 9, 2018 •

edited

Loading

fujiisoup commented Feb 11, 2018 •

edited

Loading

fujiisoup Feb 14, 2018 •

edited

Loading

WeatherGod commented Feb 14, 2018 •

edited

Loading