Tips on opening/selecting data from over 10000 files. #4944

CptOrange16 · 2021-02-23T10:50:23Z

CptOrange16
Feb 23, 2021

Hello everyone
I've been using xarray for a few months and it has served me great for processing data from NETCDF files.
I normally process aggregations of climate data with under 50 files. I now need to process aggregations of climate data which sometimes require over 11200 files.
Basically what I do is: open a set of files into a dataset, select the data that matches an array of timestamps, use that data for calculations (means, medians, etc).

Currently I'm using the same process as I use for the smaller reads, which is what I found on the docs here:

dataset = xr.open_mfdataset(paths = path, mask_and_scale = True, parallel = True, concat_dim="time",
                  data_vars='minimal', coords='minimal', compat='override')
dataset = dataset .sel(time = valid_years)
# valid_years is an array of datetimes

This takes about 4,6 hours to run.

Does anyone have any suggestions on how to speed up this process? By using other configurations for open_mfdataset, or a whole different method of opening/processing the files.

Any other info, feel free to ask
Thanks.

raybellwaves · 2021-02-24T16:13:17Z

raybellwaves
Feb 24, 2021

You could do an ETL on each of the files to make them faster to work with next time around. Unfortunately this means making another copy of your data but it might speed up things in the future.

@dask.delayed
def ETL(file):
    ds = xr.open_dataset(file)
    ...
    ds.to_zarr(file.zarr)

output = []
for file in files:
    tasks = delayed(ETL)(file)
    output.append(tasks)
dask.compute(output)

0 replies

dcherian · 2021-02-24T16:27:37Z

dcherian
Feb 24, 2021
Maintainer

Can you reduce the number of paths instead of doing a .sel later? This avoids touching all the files.

Otherwise you can make your netCDF collection look like a zarr store: https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935 & https://github.com/intake/fsspec-reference-maker

This will save time spent on the mfdataset step. I am not sure that is the bottleneck with your workflow though.

0 replies

kj-stein · 2021-10-15T03:14:54Z

kj-stein
Oct 15, 2021

My (possibly hacky) solution is to do my subselecting with "preprocess" function instead of with .sel.
For example, instead of selecting year 2015 of daily data with ds.sel(time='2015'), I do something like:

 def ds_select(ds_in,year):
        ds_out = ds_in.isel(time=slice(0,365))
        return ds_out

ds = xr.open_mfdataset(ens_files,
                                        combine='nested',
                                        concat_dim=[[*ens_numbers]],
                                        parallel='True',
                                        preprocess=ds_select,
                                        decode_cf = False,
                                        decode_times = True)

Where I do the selecting with indexes because decode_cf = False is faster.

I've found that in general the lazy loading with mf_dataset does not work as I expect it to.
The only solution I can see is to maybe create a new multi file data set object, that uses preprocess under the hood for .sel operations.

@dcherian any problems with my approach, and what do you think about the fix?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tips on opening/selecting data from over 10000 files. #4944

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tips on opening/selecting data from over 10000 files. #4944

Uh oh!

CptOrange16 Feb 23, 2021

Replies: 3 comments

Uh oh!

raybellwaves Feb 24, 2021

Uh oh!

dcherian Feb 24, 2021 Maintainer

Uh oh!

Uh oh!

kj-stein Oct 15, 2021

CptOrange16
Feb 23, 2021

raybellwaves
Feb 24, 2021

dcherian
Feb 24, 2021
Maintainer

kj-stein
Oct 15, 2021