Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing NetCDFtoZarrSequentialRecipe on a few CMIP6 datasets #47

Closed
naomi-henderson opened this issue Jan 23, 2021 · 5 comments
Closed
Labels
recipe enhancement Solving this requires us to enhance the recipe classes

Comments

@naomi-henderson
Copy link
Contributor

@rabernat, I have begun to test how far we can get with your basic NetCDFtoZarrSequentialRecipe. The tutorial is a great start for learning how to use such a recipe! I was initially confused on two silly points - but might be worth a comment. First, one must also specify how many time slices are in each file - 1 in your example. The other was the:

👀 Inspect the Xarray HTML repr carefully by clicking on the buttons to expand the different sections.

✅ Is the shape of the variable what we expect?
✅ Is time going in the right order?
✅ Do the variable attributes make sense?

because I initially thought the 'buttons' were the green check marks and I wondered what sort of odd notebook extension you were using !? Then I realized that 'Xarray HTML repr' just meant the stuff printed out in the prior cell with 'ds_chunk'!

Anyway, I could only find one CMIP6 dataset which would work (except couldn't do pre-process to move *_bnds to coordinates) all the way through. You can see my attempts and their difficulties in this notebook.

But altogether very promising!

@rabernat
Copy link
Contributor

Naomi this is incredibly helpful. I'm just going to paste your main bullet points here for the record:

there are very few datasets which have netcdf files with constant number of time slices in each file
almost all datasets need a working cftime environment (the same notebook works with our usual open_mfdataset call
pre-processing is not available in open_dataset
most individual netcdf files are already too big for one chunk - need to be split up
tests[0]:

This is the only test case which works with the basic recipe
I can drop the height coordinate using xarray_open_kwargs
can't do pre-processing to make the *_bnds Data Variables into Coordinates since open_dataset does not take a pre-processing argument
tests[1]:
The last netcdf file has fewer time slices
each netcdf file is 1.5G - so chunking by whole netcdf files is not great
the single chunk reports that it is the size of the whole dataset (ds_chunk.nbytes = 13449.395536 MB)?
cftime is not working properly - issue with 'hours since 1850-01-16 12:00:00.000000' with "calendar 'noleap'"
tests[2]
The last netcdf file has fewer time slices (does it work if we drop the last file?)
each netcdf file is about 600M - not great
unable to decode time units
tests[3]
each netcdf file is about 2.5G - not great
unable to decode time units 'hours since 0001-01-16 12:00:00.000000' with "calendar 'noleap'"

There are about 5 distinct issues surfaced by your tests. I'll work in converting them into specific issues in this repo. Everything you raise seems solvable, except for this one:

there are very few datasets which have netcdf files with constant number of time slices in each file

I can see this coming up a lot, and it will introduce significant new complexity into the structure of recipes. It means we have to do a scan through all the input files first and actually open them to see what's inside. Then we have to propagate this information through the pipeline in order to prepare the target and figure each chunk's target region.

But we can do it! 💪

@rabernat rabernat added the recipe enhancement Solving this requires us to enhance the recipe classes label Jan 24, 2021
@rabernat
Copy link
Contributor

except couldn't do pre-process to move *_bnds to coordinates

You should check out pydata/xarray#2844. It's going to do the job for you!

@rabernat
Copy link
Contributor

The last netcdf file has fewer time slices (does it work if we drop the last file?)

Can you clarify this comment? What happens when you call store_chunk on the final chunk? What specific error do you get?

@naomi-henderson
Copy link
Contributor Author

The last netcdf file has fewer time slices (does it work if we drop the last file?)

Can you clarify this comment? What happens when you call store_chunk on the final chunk? What specific error do you get?

For example, with the following 4 netcdf source files: (note that the last file is only 14 years long - others are 51 years long)

['ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_185001-190012.nc',
 'ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_190101-195012.nc',
 'ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_195101-200012.nc',
 'ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_200101-201412.nc']

So, when I recipe.store_chunk(all_chunks[0]) followed by recipe.store_chunk(all_chunks[-1]):

/site-packages/xarray/backends/api.py in _validate_append_dim_and_encoding(ds_to_append, store, append_dim, region, encoding, **open_kwargs)
1386 if existing_sizes != new_sizes:
1387 raise ValueError(
-> 1388 f"variable {var_name!r} already exists with different "
1389 f"dimension sizes: {existing_sizes} != {new_sizes}. "
1390 f"to_zarr() only supports changing dimension sizes when "

ValueError: variable 'time' already exists with different dimension sizes: {'time': 612} != {'time': 168}. to_zarr() only supports changing dimension sizes when explicitly appending, but append_dim=None.

So perhaps the solution lies in setting append_dim?

@naomi-henderson
Copy link
Contributor Author

I have now re-run the four tests with the latest version of pangeo-forge and am happy to report that these issues have been resolved. See 51. We can now handle variable length netcdf files, large netcdf files can be chunked and the cftime issues have disappeared. Progress!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
recipe enhancement Solving this requires us to enhance the recipe classes
Projects
None yet
Development

No branches or pull requests

2 participants