-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a test notebook comparing alternate methods for querying the NWM data #16
Comments
@rabernat @martindurant -- you are tagged in this issue because of your background. We welcome your input, but aren't asking for it yet, just trying to provide attribution. |
@CoreyKrewson-NOAA @karnesh @arpita0911patel @AndersNilssonNoaa -- thanks for the clarifying correction, which I think I've incorporated. |
I am happy to advise on parquet, zarr and kerchunk. However, I should begin by saying that benchmarking is hard, especially for very different formats like this. There will be datasets that suit a columnar format like parquet, and others that are naturally multi-dimensional. Furthermore, the specific encodings versus the content of the data, and the expected access patterns versus the chosen chunking structure. You might very well find that the choice of options available within any particular format make a bigger difference than the choice of format. |
I'm unfamiliar with Azavea's original plans for parquet. Those original plans may have changed. I'd also add that the proposed queries seem to skew toward times series at one (or a few) channel features. This makes sense because it is the most common use-case (or rather, the most commonly requested capability). However, it seems to leave the question of scalability unanswered. In my experience, the most common queries have been:
Evaluations have tended to involve either a comparison of (a) simulated vs observed, (a) forecast vs observed, or (c) forecast vs simulated. As @jameshalgren mentioned, there are multiple dimensions to the data including reference time (issue time), valid time, feature ID, model configuration (short range, medium range), and variable (streamflow, velocity, etc). |
Some (very) useful snippets, based on kerchunk, posted in the PR comments here: |
Closing this issue and moving to AlabamaWaterInstitute/data_access_examples#7 |
@danames -- here are a few thoughts on the queries that a NWM datastore might need to support. |
Context
Dealing with NWM data
To understand the directions we are exploring to better serve the National Water Model data out to users, it is helpful to give some context. Importantly, this is an exploration and is experimental. So, we probably are making mistakes. If you notice one, please email us: [email protected]
The Xarray/Zarr/Kerchunk options
In 2017, Ryan Abernathy (University of Columbia; professor of data science and advocate of open, reproducible science) started this thread years ago seeking a
I don't have all the pieces of the intervening development, but in 2022, the conversation was still clearly underway. Significant progress had been made. Critically, the
xarray
library (with many contributions from Dr. Abernathey) had matured to include significant improvments to multi-file access and remote-data access. Also, developments with cloud-friendly formats from other domains have matured and expanded. While the underlying question seems to be the same, there may be a light (or perhaps several lights) at the end of the tunnel, as can be seen in this discussion stemming from a question about the best (subjective) method for hosting cloud optimized data.Two excerpts from that discussion seem to capture most of the important gist:
From @martindurant on that thread
And also from @rabernat
So that seems to suggest options Zarr and Kerchunk.
TODO: find a better recipe-based link for Zarr + NWM... @jmunroe or `@jmccreight might have a suggestion; consider recipies already posted on AWS bucket site for NWM retrospective.
But what about Parquet?
This conversation and this discourse thread explore that option and seem to indicate that for lesser dimensional data, Parquet's columnar format will perform very well. On the other hand, As we explore this, one of the challenges specifically with the NWM is the additional cardinality introduced by the forecast cycle -- the data need to not only preserve a time-stamp from the valid-time, but also some sense of forecast time or issue time. That
maywill be challenging with a purely columnar format.TODO: link to the Azavea/NWC work with Parquet+NWM @aaraney or @jarq6c -- perhaps may be able to point here.
Re-index and re-publish the netcdfs
Basically, the entire challenge highlighted by the Azavea group (see below) is addressed for the operational output in a script, run at NCO as a post-processing step on the NWM output, which produces the re-indexed, concatenated version of the full NWM output here. The python scripts running in the background of water.noaa.gov/map are directly and interactively accessing these reordered files with relatively high performance.
And Database methods?
There are at least two database-backed methods that we have seen have some success. One such database, coupled with PostGIS, drives the backend for the data services shared via https://maps.water.noaa.gov/server/rest/services/nwm (See also the ArcGIS online portal page for those same services here). The other was associated with the evaluation method used to adjudicate the recent USBR/CEATI streamflow rodeo. (See some comments from a participant are here, the news release from the judges, and the topcoder leaderboard.
The Rodeo-backend method was essentially a datacube concept and there are several threads to explore there, perhaps through what is thrown out as a teaser here regarding the work by @cgentemann or the open data cubes concept, which also seems to head in specifically this direction, especially for the gridded data.
So what is the problem?
Very specifically aligned (but not affiliated) with the purposes of the CIROH, the Azavea group did a project exploring methods for faster access to the NWM datasets in the cloud. A presentation of their work made to the national water modeling group is linked
here (last accessed 2022-08-19)
Basically, they laid things out like this
National Water Model Reanalysis Dataset
National Water Model Queries
Real-time output
While the Azavea group was specifically exploring the retrospective dataset, Similar statements are true about the real-time outputs
Proposal
Goal:
Ultimately the goal is to create a notebook or collection of notebooks to
compare Zarr, Parquet, Kerchunk, Postgres (NWC method), Postgres (Opendatacube method) to access multiple hydrographs from the NWM output.
For each of those, there is a preprocessing step of some kind to prepare the data, then a query step. The query step could take a lot of forms, and the nature of the query (scope, frequency, etc.) might influence what is the best method or details of the method.
Queries to test:
Like before, we need to be able to compare to observed flows, where there are gages – so the NWIS data series is still relevant.
Pick the analysis and assimilation time series (Need to ask Corey which hour; start with
tm00
) for an arbitrary time-windowPick a single current forecast trace (e.g., short_range, t00z, f001-f018; or medium_range, t12z, f001-f208)
And plot it.
Pick a single historic forecast trace (see above) and plot it against and observation time series.
Pick a particular historic time-window and for all valid forecasts in the time window plot the envelope of forecasts against the analysis_assim or against observed (if that is available.)
A horizon-based skill metric requires slicing at a particular forecast range across multiple forecasts – like for hurricanes: http://www.hurricanescience.org/science/forecast/models/modelskill/
Assemble the AnA_No_DA time series for a window on a segment and compare to the regular AnA. (I believe the No_DA is only run once per day… The comparison to any other value on that day is still valid, it’s just updated and posted less frequently.)
Updates and actions
No response
The text was updated successfully, but these errors were encountered: