testing NetCDFtoZarrSequentialRecipe on a few CMIP6 datasets #47

naomi-henderson · 2021-01-23T21:04:58Z

@rabernat, I have begun to test how far we can get with your basic NetCDFtoZarrSequentialRecipe. The tutorial is a great start for learning how to use such a recipe! I was initially confused on two silly points - but might be worth a comment. First, one must also specify how many time slices are in each file - 1 in your example. The other was the:

👀 Inspect the Xarray HTML repr carefully by clicking on the buttons to expand the different sections.

✅ Is the shape of the variable what we expect?
✅ Is time going in the right order?
✅ Do the variable attributes make sense?

because I initially thought the 'buttons' were the green check marks and I wondered what sort of odd notebook extension you were using !? Then I realized that 'Xarray HTML repr' just meant the stuff printed out in the prior cell with 'ds_chunk'!

Anyway, I could only find one CMIP6 dataset which would work (except couldn't do pre-process to move *_bnds to coordinates) all the way through. You can see my attempts and their difficulties in this notebook.

But altogether very promising!

The text was updated successfully, but these errors were encountered:

rabernat · 2021-01-24T01:52:41Z

Naomi this is incredibly helpful. I'm just going to paste your main bullet points here for the record:

there are very few datasets which have netcdf files with constant number of time slices in each file
almost all datasets need a working cftime environment (the same notebook works with our usual open_mfdataset call
pre-processing is not available in open_dataset
most individual netcdf files are already too big for one chunk - need to be split up
tests[0]:

This is the only test case which works with the basic recipe
I can drop the height coordinate using xarray_open_kwargs
can't do pre-processing to make the *_bnds Data Variables into Coordinates since open_dataset does not take a pre-processing argument
tests[1]:
The last netcdf file has fewer time slices
each netcdf file is 1.5G - so chunking by whole netcdf files is not great
the single chunk reports that it is the size of the whole dataset (ds_chunk.nbytes = 13449.395536 MB)?
cftime is not working properly - issue with 'hours since 1850-01-16 12:00:00.000000' with "calendar 'noleap'"
tests[2]
The last netcdf file has fewer time slices (does it work if we drop the last file?)
each netcdf file is about 600M - not great
unable to decode time units
tests[3]
each netcdf file is about 2.5G - not great
unable to decode time units 'hours since 0001-01-16 12:00:00.000000' with "calendar 'noleap'"

There are about 5 distinct issues surfaced by your tests. I'll work in converting them into specific issues in this repo. Everything you raise seems solvable, except for this one:

there are very few datasets which have netcdf files with constant number of time slices in each file

I can see this coming up a lot, and it will introduce significant new complexity into the structure of recipes. It means we have to do a scan through all the input files first and actually open them to see what's inside. Then we have to propagate this information through the pipeline in order to prepare the target and figure each chunk's target region.

But we can do it! 💪

rabernat · 2021-01-24T05:00:46Z

except couldn't do pre-process to move *_bnds to coordinates

You should check out pydata/xarray#2844. It's going to do the job for you!

rabernat · 2021-01-24T05:11:40Z

The last netcdf file has fewer time slices (does it work if we drop the last file?)

Can you clarify this comment? What happens when you call store_chunk on the final chunk? What specific error do you get?

naomi-henderson · 2021-01-24T17:47:53Z

The last netcdf file has fewer time slices (does it work if we drop the last file?)

Can you clarify this comment? What happens when you call store_chunk on the final chunk? What specific error do you get?

For example, with the following 4 netcdf source files: (note that the last file is only 14 years long - others are 51 years long)

['ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_185001-190012.nc',
 'ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_190101-195012.nc',
 'ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_195101-200012.nc',
 'ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_200101-201412.nc']

So, when I recipe.store_chunk(all_chunks[0]) followed by recipe.store_chunk(all_chunks[-1]):

/site-packages/xarray/backends/api.py in _validate_append_dim_and_encoding(ds_to_append, store, append_dim, region, encoding, **open_kwargs)
1386 if existing_sizes != new_sizes:
1387 raise ValueError(
-> 1388 f"variable {var_name!r} already exists with different "
1389 f"dimension sizes: {existing_sizes} != {new_sizes}. "
1390 f"to_zarr() only supports changing dimension sizes when "

ValueError: variable 'time' already exists with different dimension sizes: {'time': 612} != {'time': 168}. to_zarr() only supports changing dimension sizes when explicitly appending, but append_dim=None.

So perhaps the solution lies in setting append_dim?

naomi-henderson · 2021-04-07T10:22:58Z

I have now re-run the four tests with the latest version of pangeo-forge and am happy to report that these issues have been resolved. See 51. We can now handle variable length netcdf files, large netcdf files can be chunked and the cftime issues have disappeared. Progress!

rabernat added the recipe enhancement Solving this requires us to enhance the recipe classes label Jan 24, 2021

This was referenced Jan 24, 2021

Input / chunk preprocessing #48

Closed

Allow for subsetting large input files #49

Closed

Handle inputs with variable items per input #50

Closed

rabernat mentioned this issue Jan 24, 2021

Problem with cftime coordinates on sequence_dim #51

Closed

naomi-henderson closed this as completed Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing NetCDFtoZarrSequentialRecipe on a few CMIP6 datasets #47

testing NetCDFtoZarrSequentialRecipe on a few CMIP6 datasets #47

naomi-henderson commented Jan 23, 2021

rabernat commented Jan 24, 2021

rabernat commented Jan 24, 2021

rabernat commented Jan 24, 2021

naomi-henderson commented Jan 24, 2021

naomi-henderson commented Apr 7, 2021

testing NetCDFtoZarrSequentialRecipe on a few CMIP6 datasets #47

testing NetCDFtoZarrSequentialRecipe on a few CMIP6 datasets #47

Comments

naomi-henderson commented Jan 23, 2021

rabernat commented Jan 24, 2021

rabernat commented Jan 24, 2021

rabernat commented Jan 24, 2021

naomi-henderson commented Jan 24, 2021

naomi-henderson commented Apr 7, 2021