Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClimateBench dataset #43

Open
duncanwp opened this issue Aug 3, 2023 · 10 comments
Open

ClimateBench dataset #43

duncanwp opened this issue Aug 3, 2023 · 10 comments
Labels

Comments

@duncanwp
Copy link

duncanwp commented Aug 3, 2023

Dataset Name

ClimateBench

Dataset URL

https://zenodo.org/record/7064308

Description

I propose to create a pipeline so that more climate models and variables, at a higher temporal resolution can be easily ingested into ClimateBench transparently and efficiently. These consist of post-processed and harmonized CMIP6 input and output files split across experiments/scenarios. This would allow others to expand upon the ClimateBench protocol and apply climate model emulation more generally.

Size

Roughly 10Gb files totaling around 1Tb (depending on storage availability)

License

Unknown

Data Format

NetCDF

Data Format (other)

No response

Access protocol

HTTP(S)

Source File Organization

The CMIP6 data is organized in ESGF into time sharded files per experiment, per model, per variable. I'm not sure how they're stored in Pangeo, which might be a more natural source.

Example URLs

No response

Authorization

No; data are fully public

Transformation / Processing

The data needs to be put onto a common grid, the piControl subtracted, and units harmonized. Some common statistics may be used for some variables (e.g. 99th percentile of precipitation).

Target Format

Zarr

Comments

Original Pangeo-Forge recipe: pangeo-forge/staged-recipes#134
Pull request regarding processing daily data with pangeo-forge-esgf: jbusecke/pangeo-forge-esgf#9

It might make sense to create this recipe as just a 'pointer' to the underlying CMIP data rather than storing it, but it depends on the compute/storage costs I guess.

Thanks!

@duncanwp
Copy link
Author

Hey, I'm standing with @cisaacstern and keen to figure out how to make this work 😊

@jbusecke
Copy link
Contributor

Fantastic. Thanks for pinging this issue. I propose that we formalize this effort into a working group.

I would like to include @SammyAgrawal in this group, since I think this is a perfect way to learn more about apache beam and fits right into sammys project work of building reproducible ML/climate science pipelines. Sammy I hope this works well for you, I think this is a real great ✨synergy🤗.

  • Proposed next steps:
    • Schedule a Meeting (if possible early January?)
    • Pre Meeting TODO:
      • Make a repository (https://github.com/leap-stc/science_orchestration) to keep our efforts neat and organized
      • Add a minimal example of the processing for a single ClimateBench dataset to the repo (|Duncan| might have provided this at some point, but I think I lost track)
    • Milestones:
      • Build an Apache Beam Pipeline that can achieve the minimal example provided.
      • Expand the pipeline to all datasets available in the public bucket
    • What I hope we will learn:
      • How to build a reproducible and efficient pipeline that executes xarray commands.
      • How to best parameterize this over different inputs (in this case different models, experiements etc)

@duncanwp
Copy link
Author

Perfect, thanks @jbusecke! Please send a doodle for January.

I have a single-model version pulling from ESGF here: https://github.com/duncanwp/ClimateBench/blob/main/prepare_data.py

@jbusecke
Copy link
Contributor

When2Meet Poll

@jbusecke
Copy link
Contributor

jbusecke commented Jan 9, 2024

Just wanted to ping this thread. @cisaacstern could you fill out your availability? @SammyAgrawal could you double check the dates next week? It would be fantastic if we could make Tue (16th) or Wed (17th) work.

Also a quick check back to @duncanwp: I think the code you provided up top is only to download from ESGF. I was under the assumption we want to avoid this step and load directly from the cloud zarr stores? What code is used to process the final output? Is that this notebook: https://github.com/duncanwp/ClimateBench/blob/main/prep_input_data.ipynb

Excited to push this ahead!

@jbusecke
Copy link
Contributor

I sent an invite for Tue (Jan 16) 3-4PM EST!

@duncanwp
Copy link
Author

duncanwp commented Jan 10, 2024 via email

@jbusecke
Copy link
Contributor

The ranges that worked according to the poll were:

  • Tue 16th Jan 2-4pm EST
  • Wed 17th Jan 3-5PM EST

Do you have availability in those ranges?

@duncanwp
Copy link
Author

Yes, any of the times on Wednesday would still work!

@jbusecke
Copy link
Contributor

I moved it to wed 3pm! Looking forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants