Example pipeline for GFS Archive #50

raybellwaves · 2021-06-17T03:05:41Z

Source Dataset

Link to the website: https://rda.ucar.edu/datasets/ds084.1/ (full archive ~2015 - present). Another source is s3://noaa-gfs-bdp-pds/gfs.* although it's ~20210226 onwards
The file format: opendap / grib
How are the source files organized? one file per forecast hour
How are the source files accessed? pydap or download the grib files
provide an example link if possible:

import xarray as xr

url = "https://rda.ucar.edu/thredds/dodsC/files/g/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
ds = xr.open_dataset(url)

or

import requests
import cfgrib

login_url = "https://rda.ucar.edu/cgi-bin/login"
ret = requests.post(
    login_url,
    data={"email": EMAIL, "passwd": PASSWD, "action": "login"},
)
file = "https://rda.ucar.edu/data/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
req = requests.get(file, cookies=ret.cookies, allow_redirects=True)
open("gfs.0p25.2020020100.f000.grib2", "wb").write(req.content)
dss = cfgrib.open_datasets("gfs.0p25.2020020100.f000.grib2")

or

import s3fs
fs = s3fs.S3FileSystem()
fs.get("s3://noaa-gfs-bdp-pds/gfs.20210914/12/atmos/gfs.t12z.pgrb2.0p25.f000", "gfs.0p25.2021091412.f000.grib2")
dss = cfgrib.open_datasets("gfs.0p25.2021091412.f000.grib2")

Any special steps required to access the data (e.g. password required) - Yes have to sign up at https://rda.ucar.edu/. authentification can be done as shown here https://stackoverflow.com/a/66179413/6046019

Transformation / Alignment / Merging

Concat along reftime (init time) and time

Output Dataset

zarr store.

I imagine one giant zarr store would be crazy so could be stored for one init time and all forecast times. Ideally with init time an expanded dim so it can be concatenated later.

The text was updated successfully, but these errors were encountered:

cisaacstern · 2021-06-24T23:57:18Z

@raybellwaves thanks for opening this issue, and apologies for the delay in responding. Just tagging a few others who may be more familiar with grib-specific considerations.

Does anyone of @rabernat, @TomAugspurger, or @martindurant know if we can handle .grib2 inputs at this time?

martindurant · 2021-06-25T01:01:39Z

I don't see why not - xarray can load them, so long as they are cached on a local filesystem. On June 24, 2021 7:57:29 PM EDT, Charles Stern ***@***.***> wrote: ***@***.*** thanks for opening this issue, and apologies for the

…

delay in responding. Just tagging a few others who may be more familiar with grib-specific considerations. Does anyone of @rabernat, @TomAugspurger, or @martindurant know if we can handle `.grib2` inputs at this time? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #50 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

rabernat · 2021-06-25T13:47:01Z

Noting the similarity to #17 and #18.

We should have no problem with grib, as long as xarray can open the files (which the example code above already illustrates). For this to work, you will need to set copy_input_to_local_file=File in XarryZarrRecipe.

cisaacstern · 2021-06-25T15:57:15Z

@raybellwaves, looks like we're good to go. 😄

Are you interested in learning how to develop recipes yourself? If so, I'd be delighted, and will be happy to guide you through the process. Like conda-forge, the strength of this initiative will ultimately grow from the community of recipe developers who've learned to use these tools, and it would be great to have you onboard.

The first step would be for you to make a PR to this repo, which contains a new draft recipe under recipes/NCEP_GFS/ncep_gfs_recipe.py. This module will instantiate a pangeo_forge_recipes.recipes.XarrayZarrRecipe object, and can be your best guess of how to approach that based on the docs here. Once you've pushed a first commit, I can jump in and start making suggestions and/or commits to your PR.

adair-kovac · 2021-07-27T21:05:30Z

Is writing a separate zarr store for every init time a good idea? I've been struggling a lot with how to build timeseries from the hrrrzarr data which was written that way. Opening thousands of hours in a loop can take hours and isn't nearly as simple or efficient to parallelize as if time were just a dimension in the dataset from the getgo.

That said, that dataset isn't optimized for xarray––e.g. because of the way it's written as a zarr hierarchy, the .zmetadata isn't visible to xarray so I can't use the consolidated option. But I noticed that both #17 and #18 are making init time a dimension rather than creating separate stores (IIUC). How would you decide between the two approaches?

rabernat · 2021-07-28T09:12:28Z

Is writing a separate zarr store for every init time a good idea?

I don't think so. I think we want init_time and lead_time both as dimension. In order for this to work, we need to resolve pangeo-forge/pangeo-forge-recipes#140.

adair-kovac · 2021-07-28T16:56:15Z

@rabernat Is there any concern with xarray's handling of the time dimension for continuously-updating data sets? I assume the GFS (like the HRRR and GEFS) produces new model runs frequently. Some of my colleagues have been avoiding creating a time dimension in these situations because of cases where it's been painful but it's not clear to me if any of those apply to situations like this. Does the .zmetadata get updated efficiently when you just append data?

Also do we actually need 140 for this one? Shouldn't you be able to just do it in stages, look at a single init_time and concat over lead_time, then concat the result over init_time? Or do recipes have to be 1 stage?

raybellwaves · 2021-07-28T17:59:45Z

Good question. I imagine there are open questions regarding one giant zarr store versus smaller zarr stores which can be concatenated. It may be use case driven. There are probably lessons learnt from what people do with tabular (parquet) which can also be stored as separate files or appended (row wise or as a new row group partition e.g. partition on reftime). I imagine a step beyond which people do with tabular data would be streaming data. I imagine once you get it out of grib into a zarr store of some kind you can iterate through these questions quicker.

martindurant · 2021-07-28T18:01:54Z

Quick note that the "reference" views I have been working with could provide both, without having to copy or reorganise the data. It can be used to produce a single logical zarr over many zarr datasets.

adair-kovac · 2021-07-28T20:48:23Z

@martindurant Where would I get started if I wanted to try that out?

rabernat added the proposed recipe label Sep 9, 2021

sharkinsspatial mentioned this issue Oct 17, 2022

Amazon Sustainability Data Initiative ARCO Project #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example pipeline for GFS Archive #50

Example pipeline for GFS Archive #50

raybellwaves commented Jun 17, 2021 •

edited

Loading

cisaacstern commented Jun 24, 2021

martindurant commented Jun 25, 2021 via email

rabernat commented Jun 25, 2021

cisaacstern commented Jun 25, 2021 •

edited

Loading

adair-kovac commented Jul 27, 2021

rabernat commented Jul 28, 2021

adair-kovac commented Jul 28, 2021

raybellwaves commented Jul 28, 2021

martindurant commented Jul 28, 2021

adair-kovac commented Jul 28, 2021

Example pipeline for GFS Archive #50

Example pipeline for GFS Archive #50

Comments

raybellwaves commented Jun 17, 2021 • edited Loading

Source Dataset

Transformation / Alignment / Merging

Output Dataset

cisaacstern commented Jun 24, 2021

martindurant commented Jun 25, 2021 via email

rabernat commented Jun 25, 2021

cisaacstern commented Jun 25, 2021 • edited Loading

adair-kovac commented Jul 27, 2021

rabernat commented Jul 28, 2021

adair-kovac commented Jul 28, 2021

raybellwaves commented Jul 28, 2021

martindurant commented Jul 28, 2021

adair-kovac commented Jul 28, 2021

raybellwaves commented Jun 17, 2021 •

edited

Loading

cisaacstern commented Jun 25, 2021 •

edited

Loading