Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derived CMIP6 data recipe builder (WIP) #252

Closed
wants to merge 2 commits into from
Closed

Conversation

jbusecke
Copy link
Contributor

@jbusecke jbusecke commented Dec 3, 2021

@cisaacstern and I had a really productive hack today and I think we made some good progress towards using pangeo-forge to derive datasets from existing ARCO data.

Our Goal

This application of pangeo-forge represents a bit of a deviation from the core or initial mission of migrating legacy datasets into the cloud, but I believe it could really boost the adaptation of "cloud-first" workflows in many science contexts. Other attempts have been made to achieve this functionality (#176, #205), but this represents our most successful effort to date.

We had the goal of producing a derived dataset from arbitrary CMIP6 and decided to start with a weighted mean of a surface variable to keep computation short, not get into too much trouble with dask chunking (averaging over lateral dimension in time chunks parallelizes nicely), but also produce something that would be actually useful in a science context.

Successes

We added a new builder directory containing a module which could be extended to have several generalizable functions that build recipes but use logic that is specific to cmip6 and are inspired by the workflow developed in cmip6_preprocessing(this provided the advantage of an easy transition from interactive work to a recipe).

We were able to programmatically build a set of recipes for different variables (sea surface temperature and salinity) and two examples models, execute these locally and plot the resulting data.

The builder function relies on 'facets' which are used to query an intake-esm catalog that is also queried to get available weights (in this case surface area areacello) and find the best match to the data (see here for details).

We made a little demo notebook (Big Shoutout to @yuvipanda for the amazing notebooksharing.space 🔥)

@cisaacstern
Copy link
Member

cisaacstern commented Dec 3, 2021

A few notes on implementation.

Proposed changes in #242 (see also #242 (comment)) may make more complex derivations possible, but we were pleased to see that for the weighted mean example demonstrated in the above-linked notebook, that refactor was actually not strictly necessary. The two issues we did encounter with existing code were:

  1. XarrayZarrRecipe assumes inputs are openable with:

    with xr.open_dataset(f, **kw) as ds:

    where f is typically an fsspec OpenFile, but can also be a string. AFAICT, the only way to get f to be passed as a string here is by setting is_opendap=True on the FilePattern instance. We ended up doing this (and commenting out some is_opendap related conditionals), because it seems that xr.open_dataset doesn't want to open Zarr stores from OpenFiles (even when passing kw=dict(engine="zarr")). Maybe I'm missing something here, but passing f as a string is our temporary workaround.

  2. We are not caching the input for this recipe because, as previously mentioned, it already exists as a Zarr store. Currently, there is no standalone stage for caching metadata (it happens along with cache_input). This is fine for our situation if we are not coarsening: we can just open the input dataset and pass nitems_per_file=len(ds.time) to the FilePattern. With coarsening, however, this becomes trickier. For this reason, we pulled out some of the internal metadata caching code and put it in our notebook. See Skip re-computing metadata cache. #243 (comment) and @rabernat's response below that for further conversation on this issue.

@yuvipanda
Copy link
Contributor

So happy to see it be used in the wild, @jbusecke!

@duncanwp
Copy link

Hi @jbusecke! This recipe looks exactly what I need for pangeo-forge/staged-recipes#134 but seems to have stalled, is it likely to be merged or is there another approach I should follow?

@cisaacstern
Copy link
Member

👋 @duncanwp, thanks for checking in here. This issue was an early attempt that @jbusecke and I made to work on this issue of generalizing recipes derived from ESGF holdings. We have since moved on, and simply forgot to close this issue. The current state of these efforts is well-summarized by Julius in pangeo-data/pangeo-cmip6-cloud#31 (comment). We don't have an end-to-end solution deployed for this today, but have made a lot of headway since this issue, and welcome your collaboration on this. I'll close this issue now, as its gone stale, but please do follow up on the issue I've linked here, or any of the other issues that it links to. Look forward to working together on this!

@duncanwp
Copy link

Perfect, thanks @cisaacstern! I'll dive in to that and see if / how I can help

@andersy005 andersy005 deleted the cmip-builder branch October 21, 2022 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants