ClimateBench dataset #43

duncanwp · 2023-08-03T00:50:14Z

Dataset Name

ClimateBench

Dataset URL

Description

I propose to create a pipeline so that more climate models and variables, at a higher temporal resolution can be easily ingested into ClimateBench transparently and efficiently. These consist of post-processed and harmonized CMIP6 input and output files split across experiments/scenarios. This would allow others to expand upon the ClimateBench protocol and apply climate model emulation more generally.

Size

Roughly 10Gb files totaling around 1Tb (depending on storage availability)

License

Unknown

Data Format

NetCDF

Data Format (other)

No response

Access protocol

HTTP(S)

Source File Organization

The CMIP6 data is organized in ESGF into time sharded files per experiment, per model, per variable. I'm not sure how they're stored in Pangeo, which might be a more natural source.

Example URLs

No response

Authorization

No; data are fully public

Transformation / Processing

The data needs to be put onto a common grid, the piControl subtracted, and units harmonized. Some common statistics may be used for some variables (e.g. 99th percentile of precipitation).

Target Format

Zarr

Comments

Original Pangeo-Forge recipe: pangeo-forge/staged-recipes#134
Pull request regarding processing daily data with pangeo-forge-esgf: jbusecke/pangeo-forge-esgf#9

It might make sense to create this recipe as just a 'pointer' to the underlying CMIP data rather than storing it, but it depends on the compute/storage costs I guess.

Thanks!

duncanwp · 2023-12-12T22:11:56Z

Hey, I'm standing with @cisaacstern and keen to figure out how to make this work 😊

jbusecke · 2023-12-14T16:56:46Z

Fantastic. Thanks for pinging this issue. I propose that we formalize this effort into a working group.

I would like to include @SammyAgrawal in this group, since I think this is a perfect way to learn more about apache beam and fits right into sammys project work of building reproducible ML/climate science pipelines. Sammy I hope this works well for you, I think this is a real great ✨synergy🤗.

Proposed next steps:
- Schedule a Meeting (if possible early January?)
- Pre Meeting TODO:
  - Make a repository (https://github.com/leap-stc/science_orchestration) to keep our efforts neat and organized
  - Add a minimal example of the processing for a single ClimateBench dataset to the repo (|Duncan| might have provided this at some point, but I think I lost track)
- Milestones:
  - Build an Apache Beam Pipeline that can achieve the minimal example provided.
  - Expand the pipeline to all datasets available in the public bucket
- What I hope we will learn:
  - How to build a reproducible and efficient pipeline that executes xarray commands.
    - Are we going to use https://github.com/google/xarray-beam for this?
  - How to best parameterize this over different inputs (in this case different models, experiements etc)

duncanwp · 2023-12-19T01:18:25Z

Perfect, thanks @jbusecke! Please send a doodle for January.

I have a single-model version pulling from ESGF here: https://github.com/duncanwp/ClimateBench/blob/main/prepare_data.py

jbusecke · 2023-12-19T18:14:15Z

When2Meet Poll

jbusecke · 2024-01-09T22:33:54Z

Just wanted to ping this thread. @cisaacstern could you fill out your availability? @SammyAgrawal could you double check the dates next week? It would be fantastic if we could make Tue (16th) or Wed (17th) work.

Also a quick check back to @duncanwp: I think the code you provided up top is only to download from ESGF. I was under the assumption we want to avoid this step and load directly from the cloud zarr stores? What code is used to process the final output? Is that this notebook: https://github.com/duncanwp/ClimateBench/blob/main/prep_input_data.ipynb

Excited to push this ahead!

jbusecke · 2024-01-10T19:17:38Z

I sent an invite for Tue (Jan 16) 3-4PM EST!

duncanwp · 2024-01-10T19:36:12Z

Hi all,Apologies but that time slot is now filled for me and I can’t move it… was there another that worked?Cheers,DuncanSent from my iPhone, at a time that suits me. Please feel free to respond at a time that suits you.On Jan 10, 2024, at 11:17, Julius Busecke ***@***.***> wrote: I sent an invite for Tue (Jan 16) 3-4PM EST! —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

jbusecke · 2024-01-11T15:41:03Z

The ranges that worked according to the poll were:

Tue 16th Jan 2-4pm EST
Wed 17th Jan 3-5PM EST

Do you have availability in those ranges?

duncanwp · 2024-01-11T16:59:26Z

Yes, any of the times on Wednesday would still work!

jbusecke · 2024-01-15T23:46:31Z

I moved it to wed 3pm! Looking forward.

duncanwp added the dataset label Aug 3, 2023

duncanwp mentioned this issue Aug 3, 2023

Add support for daily output frequency jbusecke/pangeo-forge-esgf#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClimateBench dataset #43

ClimateBench dataset #43

duncanwp commented Aug 3, 2023

duncanwp commented Dec 12, 2023

jbusecke commented Dec 14, 2023

duncanwp commented Dec 19, 2023

jbusecke commented Dec 19, 2023

jbusecke commented Jan 9, 2024

jbusecke commented Jan 10, 2024

duncanwp commented Jan 10, 2024 via email

jbusecke commented Jan 11, 2024

duncanwp commented Jan 11, 2024

jbusecke commented Jan 15, 2024