-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized lazy indexing #1899
Vectorized lazy indexing #1899
Conversation
I noticed the lazy vectorized indexing can be (sometimes) optimized by decomposing the vectorized indexers into successive outer and vectorized indexers, so that the size of the array to be loaded into memory is minimized. |
I figured out how to consolidate two vectorized indexers, as long as they don't include any import numpy as np
def index_vectorized_indexer(old_indexer, applied_indexer):
return tuple(o[applied_indexer] for o in np.broadcast_arrays(*old_indexer))
for x, old, applied in [
(np.arange(10), (np.arange(2, 7),), (np.array([3, 2, 1]),)),
(np.arange(10), (np.arange(6).reshape(2, 3),), (np.arange(2), np.arange(1, 3))),
(-np.arange(1, 21).reshape(4, 5), (np.arange(3)[:, None], np.arange(4)[None, :]), (np.arange(3), np.arange(3))),
]:
new_key = index_vectorized_indexer(old, applied)
np.testing.assert_array_equal(x[old][applied], x[new_key]) We could probably make this work with |
Thanks, @shoyer I am wondering what computation cost we want to avoid by the lazy indexing.
If the reason 2 is the common case, I think it is not a good idea to consolidate all the lazy indexing as And I am also wondering as pointed out in #1725, what I am doing now was already implemented in dask. |
Reason 2 is the primary one. We want to load the minimum amount of data possible into memory, mostly because pulling data from disk is slow. |
I think the design choice here really comes down to whether we want to enable VectorizedIndexing on arbitrary data on disk or not: Is it better to:
I think I slightly prefer option (2) but I can see the merits in either decision. |
I am inclined to the option 1, as there are some benefit even for backend without the vectorized-indexing support, A drawback is that it is difficult for users to predict how large memory is necessary. |
OK, this is pretty clever. There are some obvious fail cases, e.g., if they want to pull out indices Also, we would want to avoid separating basic/vectorized for backends that support efficient vectorized indexing (scipy and zarr). |
If the backend supports the orthogonal indexing (not only the basic indexing), But if we want a full diagonal, we need a full slice anyway...
OK. Agreed. We may need a flag that can be accessed from the array wrapper. |
Based on the suggestion, I implemented the lazy vectorized indexing with index-consolidation. Now, every backend is virtually compatible to all the indexer types, i.e. basic-, outer- and vectorized-indexers. It sometimes consume large amount of memory if the indexer is unable to decompose efficiently, but it is always better than loading the full slice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fujiisoup - this is impressive. Thanks for putting this together.
The failing test here seems to be dask related and I don't think you need to worry about it right now.
My comments are mostly cosmetic. I wish I had reviewed your prior Indexing PRs so I had better context to what you are doing but I generally like the approach you've taken. The memory issue is potentially a problem but my thought is that consistency between backends and in memory datasets is more important right now. Am I correct in my understanding that certain backends could/do support vectorized indexing directly and in those cases, the memory issue wouldn't be as prevalent (depending on how the specific backend did their implementation)?
xarray/core/indexing.py
Outdated
return type(self)(self.array, key) | ||
|
||
def __setitem__(self, key, value): | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a more informative error message here. Even NotImplementedError('__setitem__ has not yet been implemented for LazilyVectorizedIndexedArray')
would be okay.
xarray/core/indexing.py
Outdated
shape : tuple | ||
Shape of the array subject to the indexing. | ||
|
||
Returns | ||
------- | ||
tuple | ||
Tuple suitable for use to index a NumPy array with vectorized indexing. | ||
Each element is an integer or array: broadcasting them together gives | ||
the shape of the result. | ||
Each element is array: broadcasting them together gives the shape of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each element is an array
xarray/core/indexing.py
Outdated
def _decompose_vectorized_indexer(indexer): | ||
""" Decompose vectorized indexer to outer and vectorized indexers, | ||
array[indexer] == array[oindex][vindex] | ||
such that array[oindex].shape becomes smallest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
becomes smallest what?
Just trying to make sure I understand the intent here. Is this algorithm trying to optimize (minimize) the size of array[oindex]
?
…wrap_explicit_indexers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, every backend uses the most efficient indexing method it supports for any given indexers.
xarray/core/indexing.py
Outdated
""" | ||
Decompose vectorized indexer to the successive two indexers, where the | ||
first indexer will be used to index backend arrays, while the second one | ||
is used to index loaded on-memory np.ndarray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A detailed description of decompose_indexer
is given here.
xarray/backends/rasterio_.py
Outdated
@@ -41,8 +41,10 @@ def shape(self): | |||
return self._shape | |||
|
|||
def __getitem__(self, key): | |||
key = indexing.unwrap_explicit_indexer( | |||
key, self, allow=(indexing.BasicIndexer, indexing.OuterIndexer)) | |||
# TODO support indexing.decompose_indexer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need someone's help to update indexing scheme for rasterio backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rasterio is strictly a slicing approach for fetching data, or pulling out individual data points. In this case, you want to avoid converting the slices into numpy arrays of indexes. Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices? So, if we were indexing [1, 3, 4, 5, 2, 6]
, we would get back a slice(1, 7)
? Maybe even make it smart enough to take the chunk hints so that indexes like [1, 4, 5]
is kept as slice(1, 6)
, rather than (slice(1, 2), slice(4, 6))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeatherGod Thanks for the review.
Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices?
Agreed, it would be an improvement. But I am wondering the combination will explode if the dimension increases. Some heuristics would be also necessary how many chunks we used.
I do not have yet a clear solution. I am open to any idea for the improvement.
EDIT: the chunking would also improve the efficiency if we want to get a full diagonal from a large 2d-array. In my current implementation, we need to load a full slice, but it could be reduced by the factor 2 if we use 2x2 chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we certainly have time to think about it. This feature doesn't seem to change anything for indexing that worked before, so I don't think we have to worry about regressions. Improvements to indexing can be thought of holistically as a separate task.
For our immediate needs though, for rasterio, we probably should just create a failback indexer that would just load up that entire dimension and index it as given if the original indexer is anything but scalar integers or slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmaussion can you help me for this?
What I want to do here is the 2-staged indexing.
- index backend array by the 1st indexer.
- index the loaded np.ndarray by the 2nd vectorized indexer.
These two indexers (for backend and np.ndarray) can be computed by
key, np_inds = indexing.decompose_indexer(
key, self.shape, indexing.IndexingSupport.OUTER)
I am not familiar with rasterio data structure (I do not yet understand what is going on below...),
I would appreciate if you could instruct me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rasterio (as wrapped by xarray) supports a strange form of indexing:
- Outer indexing for the first indexer element only.
- Basic indexing for remaining two indexer elements (every rasterio array is 3-dimensional).
I think the simplest fix would be to extend our rasterio wrapper to support arbitrary outer indexing. This would entail changing the final clause below, where rasterio currently raises an error:
xarray/xarray/backends/rasterio_.py
Lines 72 to 78 in e0621c7
else: | |
k = np.asarray(k) | |
start = k[0] | |
stop = k[-1] + 1 | |
ids = np.arange(start, stop) | |
if not ((k.shape == ids.shape) and np.all(k == ids)): | |
raise IndexError(_ERROR_MSG) |
Instead, we would want to create a slice object based on the minimum/maximum elements, and then add another indexing to np_inds
to pull out the listed elements (after subtracting the start
from the slice).
Oh, wow... this worked like a charm for the netcdf4 backend! I have a ~13GB (uncompressed) 4-D netcdf4 variable that was giving me trouble for slicing a 2D surface out of. Here is a snippet where I am grabbing data at random indices in the last dimension. First for a specific latitude, then for the entire domain.
All previous attempts at this would result in having to load the entire 13GB array into memory just to get 93.5 MB out. Or, I would try to fetch each individual point, which took way too long. This worked faster than loading the entire thing into memory, and it used less memory, too (I think I maxed out at about 1.2GB of total usage, which is totally acceptable for my use case). I will try out similar things with the pynio and rasterio backends, and get back to you. Thanks for this work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool beans! Going to continue testing this stuff out. Would be nice to get rasterio support going, but this is certainly an improvement over the pre-existing situation.
return np.broadcast(*self.key.tuple).shape | ||
|
||
def __array__(self, dtype=None): | ||
return np.asarray(self.array[self.key], dtype=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be np.asanyarray()
? Could this interfere with dask arrays? Pint support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be okay, neither xarray or dask support using masked arrays under the hood. See #1194.
xarray/core/indexing.py
Outdated
mode: str, one of ['basic' | 'outer' | 'outer_1vector' | 'vectorized'] | ||
basic: for backends that support onle basic indexer | ||
outer: for backends that support basic / outer indexer | ||
outer_1vector: for backends that support outer indexer inluding at most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"inluding" --> "including"
xarray/core/indexing.py
Outdated
|
||
backend_indexer = [] | ||
np_indexer = [] | ||
# posify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"posify"???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take the blame for this one! I think I used this made-up word elsewhere :).
xarray/core/indexing.py
Outdated
for k in indexer: | ||
if isinstance(k, slice): | ||
backend_indexer.append(k) | ||
np_indexer.append(slice(None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might help for future readers to add a bit of prose to each of these. Like: "If it is a slice, then we will slice it as-is (k
) in the backend, and then use all of it (slice(None)
) for the in-memory portion."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job with the prose above, but the other spots are going to need similar prose as well.
xarray/core/indexing.py
Outdated
""" | ||
Decompose ouer indexer to the successive two indexers, where the | ||
first indexer will be used to index backend arrays, while the second one | ||
is ised to index the loaded on-memory np.ndarray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ised" --> "used"
xarray/core/indexing.py
Outdated
|
||
def _decompose_outer_indexer(indexer, shape, mode): | ||
""" | ||
Decompose ouer indexer to the successive two indexers, where the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ouer" --> "outer"
Also, the phrasing is odd. I would go with "Decompose outer indexer into two indexers" instead, as it is more concise.
>>> array = array[backend_indexer] # load subslice of the array | ||
>>> np_indexer = OuterIndexer([0, 2, 1], [0, 1, 0]) | ||
>>> array[np_indexer] # outer indexing for on-memory np.ndarray. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modes aren't documented. Maybe a reference to a list of them?
Hmm, came across a bug with the pynio backend. Working on making a reproducible example, but just for your own inspection, here is some logging output:
The first bit is the repr of my DataSet. The last line is output of If I revert back to v0.10.0, then the shape is (1059, 1799}, just as expected. |
I can also confirm that the shape comes out correctly using master, so this is definitely isolated to this PR. |
Ah, interesting... so, this dataset was created by doing an isel() on the original:
So, the original data has a 50x1059x1799 grid, and the new indexer isn't properly composing the indexer so that it fetches [7, slice(None), slice(None)] when I grab it's |
xarray/backends/pynio_.py
Outdated
key = indexing.unwrap_explicit_indexer( | ||
key, target=self, allow=indexing.BasicIndexer) | ||
key, np_inds = indexing.decompose_indexer(key, self.shape, | ||
mode='outer') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyNIO actually has a very robust indexing interface. We might be over-simplifying here, and this could be leading to bugs. I am not exactly sure we even need to do all of this indirection. PyNIO allows for the creation of a "selection object" that can be passed to the pynio array objects. Perhaps we should be creating that object instead of forcing the pynio array object into a numby index adapter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
I am not familiar with pynio, but it looks their selection object is very specific to pynio...
I want to leave this improvement to another PR and focus on the common feature in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how selection objects would really help us here. I guess they can used to do a form of outer indexing, but putting integer indices into a string seems very inefficient. I agree that we should save it for later.
Are we actually sure that PyNIO even supports "outer" indexing? I had it marked down as only supporting basic indexing before. "Outer" indexing would include indexing with 1D arrays like array[[0, 1, 2], :]
, but I don't see that support documented anywhere. (Maybe it does work, but just isn't documented.)
xarray/core/indexing.py
Outdated
---------- | ||
indexer: VectorizedIndexer | ||
mode: str, one of ['basic' | 'outer' | 'outer_1vector' | 'vectorized'] | ||
basic: for backends that support onle basic indexer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"onle" --> "only"?
Just did some more debugging, putting in some debug statements within
And here is the test script (data not included):
And here is the relevant output:
So, the |
xarray/backends/pynio_.py
Outdated
for ind in np_inds: | ||
array = indexing.NumpyIndexingAdapter(array)[ind] | ||
|
||
return array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah-hah! I see the bug! key
never gets used!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this should probably be something like:
array = array[key.tuple]
for ind in np_inds:
array = indexing.NumpyIndexingAdapter(array)[ind]
return array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the debugging this!!
We certainly need test coverage for this.
xarray/backends/h5netcdf_.py
Outdated
with self.datastore.ensure_open(autoclose=True): | ||
return self.get_array()[key] | ||
array = np.asarray(self.get_array()[key], dtype=self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind doing asarray
here, but not for other backends?
xarray/core/indexing.py
Outdated
@@ -543,6 +543,10 @@ def _updated_key(self, new_key): | |||
return _combine_indexers(self.key, self.shape, new_key) | |||
|
|||
def __getitem__(self, indexer): | |||
# load if the indexed array becomes a scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually a good idea? I think it's a little surprising to load data immediately just because the result is a scalar.
This looks some backends do not support negative step slices. |
I did some more investigation into the memory usage problem I was having. I had assumed that the vectorized indexed result of a lazily indexed data array would be an in-memory array. So, when I then started to use the result, it was then doing a read of all the data at once, resulting in a near-complete load of the data into memory. I have adjusted my code to chunk out the indexing in order to keep the memory usage under control at reasonable performance penalty. I haven't looked into trying to identify the ideal chunking scheme to follow for an arbitrary dataarray and indexing. Perhaps we can make that a task for another day. At this point, I am satisfied with the features (negative step-sizes aside, of course). |
@fujiisoup - is this ready for a final review? I see you have all the tests passing 💯 ! |
I think it's ready :) |
# Conflicts: # asv_bench/benchmarks/dataset_io.py
@@ -483,13 +466,6 @@ def __init__(self, array, key=None): | |||
self.key = key | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should we rename the original LazilyIndexedArray
to LazilyOuterIndexedArray
? Also, should we add a transpose
method?
xarray/core/indexing.py
Outdated
shape : tuple | ||
Shape of the array subject to the indexing. | ||
|
||
Returns | ||
------- | ||
tuple | ||
VectorizedInexer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorizedIndexer
xarray/core/indexing.py
Outdated
return (OuterIndexer(tuple(backend_indexer)), | ||
OuterIndexer(tuple(np_indexer))) | ||
|
||
# basic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an explicit error check or at least an assertion here, e.g., assert indexing_support == IndexingSupport.BASIC
.
OuterIndexer(tuple(np_indexer))) | ||
|
||
|
||
def _arrayize_vectorized_indexer(indexer, shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops -- I wrote a basically identical helper function xarray.backends.zarr._replace_slices_with_arrays
along with some unit tests.
Can you confirm that we can remove the zarr helper function, and that all its unit tests work with this function instead? I don't have a preference on which implementation to use, assuming that all tests pass with both.
xarray/backends/netCDF4_.py
Outdated
@@ -277,7 +279,7 @@ def open(cls, filename, mode='r', format='NETCDF4', group=None, | |||
def open_store_variable(self, name, var): | |||
with self.ensure_open(autoclose=False): | |||
dimensions = var.dimensions | |||
data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self)) | |||
data = indexing.LazilyOuterIndexedArray(NetCDF4ArrayWrapper(name, self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (84 > 79 characters)
xarray/core/indexing.py
Outdated
return _combine_indexers(self.key, self.shape, new_key) | ||
|
||
def __getitem__(self, indexer): | ||
# If the indexed array becomes a scalar, return LazilyOuterIndexedArray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (80 > 79 characters)
xarray/tests/test_dataset.py
Outdated
@@ -77,7 +77,7 @@ def get_variables(self): | |||
def lazy_inaccessible(k, v): | |||
if k in self._indexvars: | |||
return v | |||
data = indexing.LazilyIndexedArray(InaccessibleArray(v.values)) | |||
data = indexing.LazilyOuterIndexedArray(InaccessibleArray(v.values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (80 > 79 characters)
xarray/tests/test_variable.py
Outdated
@@ -16,7 +16,7 @@ | |||
from xarray.core import indexing | |||
from xarray.core.common import full_like, ones_like, zeros_like | |||
from xarray.core.indexing import ( | |||
BasicIndexer, CopyOnWriteArray, DaskIndexingAdapter, LazilyIndexedArray, | |||
BasicIndexer, CopyOnWriteArray, DaskIndexingAdapter, LazilyOuterIndexedArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (81 > 79 characters)
xarray/tests/test_variable.py
Outdated
@@ -1798,7 +1798,7 @@ def test_rolling_window(self): | |||
|
|||
class TestAsCompatibleData(TestCase): | |||
def test_unchanged_types(self): | |||
types = (np.asarray, PandasIndexAdapter, indexing.LazilyIndexedArray) | |||
types = (np.asarray, PandasIndexAdapter, indexing.LazilyOuterIndexedArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (82 > 79 characters)
xarray/tests/test_variable.py
Outdated
# doubly wrapping | ||
v = Variable(dims=('x', 'y'), | ||
data=LazilyIndexedArray(LazilyIndexedArray(self.d))) | ||
data=LazilyOuterIndexedArray(LazilyOuterIndexedArray(self.d))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (83 > 79 characters)
xarray/tests/test_variable.py
Outdated
self.check_orthogonal_indexing(v) | ||
# hierarchical wrapping | ||
v = Variable(dims=('x', 'y'), | ||
data=LazilyIndexedArray(NumpyIndexingAdapter(self.d))) | ||
data=LazilyOuterIndexedArray(NumpyIndexingAdapter(self.d))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (80 > 79 characters)
All done :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is finally ready! Lazy transpose is a nice bonus :).
Let's wait a day or two before merging in case anyone has other comments to add.
OK, in it goes. Thanks @fujiisoup ! |
Thanks, @WeatherGod , for your feedback. |
🎉 |
So did this remove/rename I don't mind updating, but I wanted to make sure this was intentional. |
Yes, |
* Added PNC backend to xarray PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface. * Added whats-new documentation * Updating pnc_ to remove DunderArrayMixin dependency * Adding basic tests for pnc Right now, pnc is simply being tested as a reader for NetCDF3 files * Updating for flake8 compliance * flake does not like unused e * Updating pnc to PseudoNetCDF * Remove outer except * Updating pnc to PseudoNetCDF * Added open and updated init Based on shoyer review * Updated indexing and test fix Indexing supports #1899 * Added PseudoNetCDF to doc/io.rst * Changing test subtype * Changing test subtype removing pdb * pnc test case requires netcdf3only For now, pnc is only supporting the classic data model * adding backend_kwargs default as dict This ensures **mapping is possible. * Upgrading tests to CFEncodedDataTest Some tests are bypassed. PseudoNetCDF string treatment is not currently compatible with xarray. This will be addressed soon. * Not currently supporting autoclose I do not fully understand the usecase, so I have not implemented these tests. * Minor updates for flake8 * Explicit skipping Using pytest.mark.skip to skip unsupported tests * removing trailing whitespace from pytest skip * Adding pip support * Addressing comments * Bypassing pickle, mask/scale, and object These tests cause errors that do not affect desired backend performance. * Added uamiv test PseudoNetCDF reads other formats. This adds a test of uamiv to the standard test for a backend and skips mask/scale, object, and boolean tests * Adding support for autoclose ensure open must be called before accessing variable data * Adding bakcend_kwargs to all backends Most backends currently take no keywords, so an empty ditionary is appropriate. * Small tweaks to PNC backend * remove warning and update whats-new * Separating isntall and io pnc doc and updating whats new * fixing line length in test * Tests now use non-netcdf files * Removing unknown meta-data netcdf support. * flake8 cleanup * Using python 2 and 3 compat testing * Disabling mask_and_scale by default prevents inadvertent double scaling in PNC formats * consistent with 3.0.0 Updates in 3.0.1 will fix close in uamiv. * Updating readers and line length * Updating readers and line length * Updating readers and line length * Adding open_mfdataset test Testing by opening same file twice and stacking it. * Using conda version of PseudoNetCDF * Removing xfail for netcdf Mask and scale with PseudoNetCDF and NetCDF4 is not supported, but not prevented. * Moving pseudonetcdf to v0.15 * Updating what's new * Fixing open_dataarray CF options mask_and_scale is None (diagnosed by open_dataset) and decode_cf should be True
whats-new.rst
for all changes andapi.rst
for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)I tried to support lazy vectorised indexing inspired by #1897.
More tests would be necessary but I want to decide whether it is worth to continue.
My current implementation is
The implementation was simpler than I thought, but it has a clear limitation.
It requires to load array before the vectorised indexing (I mean, the evaluation time).
If we make a vectorised indexing for a large array, the performance significantly drops and it is not noticeable until the evaluation time.
I appreciate any suggestions.