Skip to content

Note about using xarray open_mfdataset #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
agstephens opened this issue Nov 16, 2020 · 1 comment
Open

Note about using xarray open_mfdataset #88

agstephens opened this issue Nov 16, 2020 · 1 comment

Comments

@agstephens
Copy link
Collaborator

agstephens commented Nov 16, 2020

DKRZ are loading CMIP6 into Zarr. Here are some of their experiences with xarray.open_mfdataset:

One problem with the following line:

    ds = xarray.open_mfdataset(catvar.df["path"].to_list(), use_cftime=True, combine="by_coords")

Xarray does not interpret the bounds keyword so that the corresponding lat and lon bounds are listed as data variables. That might not cause any problem, but on top of that, xarray adds a time dimension to that variables:

    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(1826, 192, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(1826, 384, 2), meta=np.ndarray>

DKRZ used:

xarray.open_mfdataset(catvar.df["path"].to_list(),
                               decode_cf=True,
                               concat_dim="time",
                               data_vars='minimal', 
                               coords='minimal', 
                               compat='override')

From the xarray tutorial so that there is no time dimension anymore for the bnds. They had not included use_cftime , which might cause other problems as I saw now when reconverting it to netCDF.

@sol1105
Copy link
Contributor

sol1105 commented Mar 4, 2022

The problem with the added time dimension for bounds variables can be avoided using the parameter decode_coords="all": ds = xarray.open_mfdataset("/path/to/files/*.nc", decode_coords="all")

However, there is another problem related to xarray.open_mfdataset:
The encoding dictionary gets lost somewhere during the merging operation of the datasets of the respective files (pydata/xarray#2436).

This leads to problems for example with cf-xarray when trying to detect coordinates or bounds, but also leads to problems related to the time axis encoding apparently (as seen in the linked issue). I managed at least to avoid the problems for cf-xarray bounds and coordinates detection by using the decode functionality of xarray only after the datasets have been read in (leaving however the unnecessary time dimension in place ...):

ds = xarray.open_mfdataset("/path/to/files/*.nc")
ds = xarray.decode_cf(ds, decode_coords="all")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants