Tips on opening/selecting data from over 10000 files. #4944
Replies: 3 comments
-
You could do an ETL on each of the files to make them faster to work with next time around. Unfortunately this means making another copy of your data but it might speed up things in the future.
|
Beta Was this translation helpful? Give feedback.
-
Can you reduce the number of paths instead of doing a Otherwise you can make your netCDF collection look like a zarr store: https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935 & https://github.com/intake/fsspec-reference-maker This will save time spent on the |
Beta Was this translation helpful? Give feedback.
-
My (possibly hacky) solution is to do my subselecting with "preprocess" function instead of with .sel.
Where I do the selecting with indexes because decode_cf = False is faster. I've found that in general the lazy loading with mf_dataset does not work as I expect it to. @dcherian any problems with my approach, and what do you think about the fix? |
Beta Was this translation helpful? Give feedback.
-
Hello everyone
I've been using xarray for a few months and it has served me great for processing data from NETCDF files.
I normally process aggregations of climate data with under 50 files. I now need to process aggregations of climate data which sometimes require over 11200 files.
Basically what I do is: open a set of files into a dataset, select the data that matches an array of timestamps, use that data for calculations (means, medians, etc).
Currently I'm using the same process as I use for the smaller reads, which is what I found on the docs here:
This takes about 4,6 hours to run.
Does anyone have any suggestions on how to speed up this process? By using other configurations for open_mfdataset, or a whole different method of opening/processing the files.
Any other info, feel free to ask
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions