Opening dataset without loading any indexes? #6633

TomNicholas · 2022-05-24T19:06:09Z

Is your feature request related to a problem?

Within pangeo-forge's internals we would like to call open_dataset, then to_dict(), and end up with a schema-like representation of the contents of the dataset. This works, but it also has the side-effect of loading all indexes into memory, even if we are loading the data values "lazily".

Describe the solution you'd like

@benbovy do you think it would be possible to (perhaps optionally) also avoid loading indexes upon opening a dataset, so that we actually don't load anything? The end result would act a bit like ncdump does.

Describe alternatives you've considered

Otherwise we might have to try using xarray-schema or something but the suggestion here would be much neater and more flexible.

xref: pangeo-forge/pangeo-forge-recipes#256

cc @rabernat @jhamman @cisaacstern

The text was updated successfully, but these errors were encountered:

shoyer · 2022-05-25T17:10:04Z

Early versions of Xarray used to have lazy loading of data for indexes, but we removed this for the sake of simplicity. In principle we could restore lazy indexes, but another option (post explicit index refactor) might be an option for opening a dataset without creating indexes for 1D coordinates along dimensions.

Another way to solve this sort of challenges might be to load index data in parallel when using Dask. Right now I believe the data corresponding to indexes is always loaded eagerly, without using Dask.

All that said -- Do you have a specific example where this has been problematic? In my experience it has been pretty reasonable to use xarray.Dataset objects for schema-like templates, even with index data needing to be loaded eagerly. Possibly another Zarr chunking scheme for your index data could be more efficient?

benbovy · 2022-05-25T18:47:14Z

but another option (post explicit index refactor) might be an option for opening a dataset without creating indexes for 1D coordinates along dimensions.

It might indeed be worth considering this case too in #6392. Maybe indexes=None (default) to create default indexes for 1D coordinates and indexes={} (empty dictionary) to explicitly skip creating indexes?

TomNicholas · 2022-05-25T19:07:50Z

Thanks for replying both.

All that said -- Do you have a specific example where this has been problematic?

I'll have to defer to the others I tagged for the gory details. Perhaps one of them can cross-link to the specific issue they were having?

indexes={} (empty dictionary) to explicitly skip creating indexes?

I would probably do indexes=False just to avoid using a mutable default, but an option like this sounds good to me.

shoyer · 2022-05-25T19:12:40Z

but another option (post explicit index refactor) might be an option for opening a dataset without creating indexes for 1D coordinates along dimensions.

It might indeed be worth considering this case too in #6392. Maybe indexes=None (default) to create default indexes for 1D coordinates and indexes={} (empty dictionary) to explicitly skip creating indexes?

+1 this syntax makes sense to me!

rabernat · 2022-05-25T20:34:30Z

Here is an example that really highlights the performance cost of always loading dimension coordinates:

import zarr
store = zarr.storage.FSStore("s3://mur-sst/zarr/", anon=True)
%time list(zarr.open_consolidated(store)) # -> Wall time: 86.4 ms
%time ds = xr.open_dataset(store, engine='zarr') # -> Wall time: 17.1 s

%prun confirms that Xarray is spending most of its time just loading data for the time axis, which you can reproduce at the zarr level as:

zgroup = zarr.open_consolidated(store)
%time _ = zgroup['time'][:] # -> Wall time: 14.7 s

Obviously this example is pretty extreme. There are things that could be done to optimize it, etc. But it really highlights the costs of eagerly loading dimension coordinates. If I don't care about label-based indexing for this dataset, I would rather have my 17s back!

👍 to "indexes={} (empty dictionary) to explicitly skip creating indexes".

shoyer · 2022-05-25T20:55:14Z

Looking at this mur-sst dataset in particular, it stores time in chunks of size 5. That means fetching the 6443 time values requires 1288 separate HTTP requests -- no wonder it's so slow! If the time axis were instead stored in a single chunk of 51 KB, Xarray would only need 3 small size HTTP requests to load the lat, lon and time indexes, which would probably complete in a fraction of a second.

That said, I agree that this would be nice to have in general.

rabernat · 2022-05-25T21:10:44Z

Yes it is definitely a pathological example. 💣 But the fact remains that there are many cases where we just want to discover dataset contents as quickly as possible and want to avoid the cost of loading coordinates and creating indexes.

dcherian · 2022-05-31T16:50:21Z

This would also fix #2233

dcherian · 2023-07-16T15:18:14Z

Here's one from @lsetiawan that can't be opened because it has a 75GB time dimension coordinate.

import zarr

zarr.open_group(
    "s3://ooi-data/RS03ECAL-MJ03E-06-BOTPTA302-streamed-botpt_nano_sample",
    mode="r",
    storage_options=dict(anon=True),
)

itcarroll · 2024-02-23T05:31:29Z

Stumbled into this issue while experimenting with page_buf_size on the h5netcdf backend and looking for ways to get XArray closer to the speed of h5py for loading variables when coordinates are too much baggage. As an alternative to #8051, I would like to submit for consideration an "open_variable" method as a fast path from a store to an xarray.Variable (or a Mapping if given either a list of variables in the store or None for all variables).

TomNicholas added topic-backends topic-indexing topic-lazy array enhancement labels May 24, 2022

benbovy mentioned this issue Jun 27, 2022

Explicit indexes: next steps #6293

Open

49 tasks

benbovy mentioned this issue Oct 25, 2022

Pass indexes directly to the DataArray and Dataset constructors #7214

Closed

5 tasks

benbovy mentioned this issue Dec 15, 2022

Expose "Coordinates" as part of Xarray's public API #7368

Merged

8 tasks

abarciauskas-bgse mentioned this issue Jul 28, 2023

Investigate if it is possible to avoid reading all coordinate chunks when opening a dataset with xarray developmentseed/tile-benchmarking#34

Closed

benbovy linked a pull request Aug 7, 2023 that will close this issue

Allow setting (or skipping) new indexes in open_dataset #8051

Open

4 tasks

TomNicholas mentioned this issue Feb 2, 2024

Refactor MultiZarrToZarr into multiple functions fsspec/kerchunk#377

Open

This was referenced Mar 26, 2024

Opening via xarray backendentrypoint zarr-developers/VirtualiZarr#35

Open

open_virtual_dataset with and without indexes zarr-developers/VirtualiZarr#52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opening dataset without loading any indexes? #6633

Opening dataset without loading any indexes? #6633

TomNicholas commented May 24, 2022 •

edited

Loading

shoyer commented May 25, 2022

benbovy commented May 25, 2022

TomNicholas commented May 25, 2022

shoyer commented May 25, 2022

rabernat commented May 25, 2022 •

edited

Loading

shoyer commented May 25, 2022

rabernat commented May 25, 2022

dcherian commented May 31, 2022

dcherian commented Jul 16, 2023 •

edited

Loading

itcarroll commented Feb 23, 2024 •

edited

Loading

Opening dataset without loading any indexes? #6633

Opening dataset without loading any indexes? #6633

Comments

TomNicholas commented May 24, 2022 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

shoyer commented May 25, 2022

benbovy commented May 25, 2022

TomNicholas commented May 25, 2022

shoyer commented May 25, 2022

rabernat commented May 25, 2022 • edited Loading

shoyer commented May 25, 2022

rabernat commented May 25, 2022

dcherian commented May 31, 2022

dcherian commented Jul 16, 2023 • edited Loading

itcarroll commented Feb 23, 2024 • edited Loading

TomNicholas commented May 24, 2022 •

edited

Loading

rabernat commented May 25, 2022 •

edited

Loading

dcherian commented Jul 16, 2023 •

edited

Loading

itcarroll commented Feb 23, 2024 •

edited

Loading