Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example pipeline for GFS Archive #50

Open
raybellwaves opened this issue Jun 17, 2021 · 10 comments
Open

Example pipeline for GFS Archive #50

raybellwaves opened this issue Jun 17, 2021 · 10 comments

Comments

@raybellwaves
Copy link

raybellwaves commented Jun 17, 2021

Source Dataset

  • Link to the website: https://rda.ucar.edu/datasets/ds084.1/ (full archive ~2015 - present). Another source is s3://noaa-gfs-bdp-pds/gfs.* although it's ~20210226 onwards
  • The file format: opendap / grib
  • How are the source files organized? one file per forecast hour
  • How are the source files accessed? pydap or download the grib files
  • provide an example link if possible:
import xarray as xr

url = "https://rda.ucar.edu/thredds/dodsC/files/g/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
ds = xr.open_dataset(url)

or

import requests
import cfgrib

login_url = "https://rda.ucar.edu/cgi-bin/login"
ret = requests.post(
    login_url,
    data={"email": EMAIL, "passwd": PASSWD, "action": "login"},
)
file = "https://rda.ucar.edu/data/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
req = requests.get(file, cookies=ret.cookies, allow_redirects=True)
open("gfs.0p25.2020020100.f000.grib2", "wb").write(req.content)
dss = cfgrib.open_datasets("gfs.0p25.2020020100.f000.grib2")

or

import s3fs
fs = s3fs.S3FileSystem()
fs.get("s3://noaa-gfs-bdp-pds/gfs.20210914/12/atmos/gfs.t12z.pgrb2.0p25.f000", "gfs.0p25.2021091412.f000.grib2")
dss = cfgrib.open_datasets("gfs.0p25.2021091412.f000.grib2")

Transformation / Alignment / Merging

Concat along reftime (init time) and time

Output Dataset

zarr store.

I imagine one giant zarr store would be crazy so could be stored for one init time and all forecast times. Ideally with init time an expanded dim so it can be concatenated later.

@cisaacstern
Copy link
Member

@raybellwaves thanks for opening this issue, and apologies for the delay in responding. Just tagging a few others who may be more familiar with grib-specific considerations.

Does anyone of @rabernat, @TomAugspurger, or @martindurant know if we can handle .grib2 inputs at this time?

@martindurant
Copy link

martindurant commented Jun 25, 2021 via email

@rabernat
Copy link
Contributor

Noting the similarity to #17 and #18.

We should have no problem with grib, as long as xarray can open the files (which the example code above already illustrates). For this to work, you will need to set copy_input_to_local_file=File in XarryZarrRecipe.

@cisaacstern
Copy link
Member

cisaacstern commented Jun 25, 2021

@raybellwaves, looks like we're good to go. 😄

Are you interested in learning how to develop recipes yourself? If so, I'd be delighted, and will be happy to guide you through the process. Like conda-forge, the strength of this initiative will ultimately grow from the community of recipe developers who've learned to use these tools, and it would be great to have you onboard.

The first step would be for you to make a PR to this repo, which contains a new draft recipe under recipes/NCEP_GFS/ncep_gfs_recipe.py. This module will instantiate a pangeo_forge_recipes.recipes.XarrayZarrRecipe object, and can be your best guess of how to approach that based on the docs here. Once you've pushed a first commit, I can jump in and start making suggestions and/or commits to your PR.

@adair-kovac
Copy link

Is writing a separate zarr store for every init time a good idea? I've been struggling a lot with how to build timeseries from the hrrrzarr data which was written that way. Opening thousands of hours in a loop can take hours and isn't nearly as simple or efficient to parallelize as if time were just a dimension in the dataset from the getgo.

That said, that dataset isn't optimized for xarray––e.g. because of the way it's written as a zarr hierarchy, the .zmetadata isn't visible to xarray so I can't use the consolidated option. But I noticed that both #17 and #18 are making init time a dimension rather than creating separate stores (IIUC). How would you decide between the two approaches?

@rabernat
Copy link
Contributor

Is writing a separate zarr store for every init time a good idea?

I don't think so. I think we want init_time and lead_time both as dimension. In order for this to work, we need to resolve pangeo-forge/pangeo-forge-recipes#140.

@adair-kovac
Copy link

@rabernat Is there any concern with xarray's handling of the time dimension for continuously-updating data sets? I assume the GFS (like the HRRR and GEFS) produces new model runs frequently. Some of my colleagues have been avoiding creating a time dimension in these situations because of cases where it's been painful but it's not clear to me if any of those apply to situations like this. Does the .zmetadata get updated efficiently when you just append data?

Also do we actually need 140 for this one? Shouldn't you be able to just do it in stages, look at a single init_time and concat over lead_time, then concat the result over init_time? Or do recipes have to be 1 stage?

@raybellwaves
Copy link
Author

Good question. I imagine there are open questions regarding one giant zarr store versus smaller zarr stores which can be concatenated. It may be use case driven. There are probably lessons learnt from what people do with tabular (parquet) which can also be stored as separate files or appended (row wise or as a new row group partition e.g. partition on reftime). I imagine a step beyond which people do with tabular data would be streaming data. I imagine once you get it out of grib into a zarr store of some kind you can iterate through these questions quicker.

@martindurant
Copy link

Quick note that the "reference" views I have been working with could provide both, without having to copy or reorganise the data. It can be used to produce a single logical zarr over many zarr datasets.

@adair-kovac
Copy link

@martindurant Where would I get started if I wanted to try that out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants