Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a test notebook comparing alternate methods for querying the NWM data #16

Closed
jameshalgren opened this issue Aug 24, 2022 · 7 comments

Comments

@jameshalgren
Copy link
Member

jameshalgren commented Aug 24, 2022

Context

Dealing with NWM data

To understand the directions we are exploring to better serve the National Water Model data out to users, it is helpful to give some context. Importantly, this is an exploration and is experimental. So, we probably are making mistakes. If you notice one, please email us: [email protected]

The Xarray/Zarr/Kerchunk options

In 2017, Ryan Abernathy (University of Columbia; professor of data science and advocate of open, reproducible science) started this thread years ago seeking a

path forward to a truly scalable [cloud] data store for xarray

I don't have all the pieces of the intervening development, but in 2022, the conversation was still clearly underway. Significant progress had been made. Critically, the xarray library (with many contributions from Dr. Abernathey) had matured to include significant improvments to multi-file access and remote-data access. Also, developments with cloud-friendly formats from other domains have matured and expanded. While the underlying question seems to be the same, there may be a light (or perhaps several lights) at the end of the tunnel, as can be seen in this discussion stemming from a question about the best (subjective) method for hosting cloud optimized data.

Two excerpts from that discussion seem to capture most of the important gist:

From @martindurant on that thread

In brief: converting to zarr gets you the goodness of cloud-friendly access, and lets you choose the compression and chunking, but duplicates the data, with the associated cost. kerchunk is designed to get you cloud-optimised access to the original data (no duplication, just a small index file), but you are limited to the original chunking/compression. kerchunking will not be possible for all data sets.

And also from @rabernat

I’m leading strongly toward the following recommendation: the primary data store on the cloud should be the original netCDF files + a kerchunk index. From this, it should be really easy to create copies / views / transformations of the data that are optimized for other use cases. For example, you could easily use rechunker to transform [a subset of] the data to support a different access pattern (e.g. timeseries analysis). Or you could put xpublish on top of the dataset to provide on-the-fly computation of derived quantities.

So that seems to suggest options Zarr and Kerchunk.

TODO: find a better recipe-based link for Zarr + NWM... @jmunroe or `@jmccreight might have a suggestion; consider recipies already posted on AWS bucket site for NWM retrospective.

But what about Parquet?

This conversation and this discourse thread explore that option and seem to indicate that for lesser dimensional data, Parquet's columnar format will perform very well. On the other hand, As we explore this, one of the challenges specifically with the NWM is the additional cardinality introduced by the forecast cycle -- the data need to not only preserve a time-stamp from the valid-time, but also some sense of forecast time or issue time. That may will be challenging with a purely columnar format.

TODO: link to the Azavea/NWC work with Parquet+NWM @aaraney or @jarq6c -- perhaps may be able to point here.

Re-index and re-publish the netcdfs

Basically, the entire challenge highlighted by the Azavea group (see below) is addressed for the operational output in a script, run at NCO as a post-processing step on the NWM output, which produces the re-indexed, concatenated version of the full NWM output here. The python scripts running in the background of water.noaa.gov/map are directly and interactively accessing these reordered files with relatively high performance.

And Database methods?

There are at least two database-backed methods that we have seen have some success. One such database, coupled with PostGIS, drives the backend for the data services shared via https://maps.water.noaa.gov/server/rest/services/nwm (See also the ArcGIS online portal page for those same services here). The other was associated with the evaluation method used to adjudicate the recent USBR/CEATI streamflow rodeo. (See some comments from a participant are here, the news release from the judges, and the topcoder leaderboard.

The Rodeo-backend method was essentially a datacube concept and there are several threads to explore there, perhaps through what is thrown out as a teaser here regarding the work by @cgentemann or the open data cubes concept, which also seems to head in specifically this direction, especially for the gridded data.

So what is the problem?

Very specifically aligned (but not affiliated) with the purposes of the CIROH, the Azavea group did a project exploring methods for faster access to the NWM datasets in the cloud. A presentation of their work made to the national water modeling group is linked
here
(last accessed 2022-08-19)

Basically, they laid things out like this

National Water Model Reanalysis Dataset

  • A 26 year (1993 - 2018) retrospective simulation that provides historical context to current real-time conditions
  • Stored as NetCDF files, 4 products for each day, 2 for every hour, 2 every four hours
  • Each file represents the state of the entire country at that time slice

Each NWM output file represents the state of the entire country at each model output time.

National Water Model Queries

  • Most real-world queries are concerned with particular regions (e.g. HUCs) that span a range of time
  • Currently, this requires downloading a very large amount of data (over the range of time) and then discarding most of it (all except for the particular region), and transform it, which is slow, inefficient, and expensive!
  • Ideally users should be able to specify the attributes, space, and time ranges, and get just that data, in a readily usable format

NWM output queries tend to be for a particular location across the range of valid times.

Real-time output

While the Azavea group was specifically exploring the retrospective dataset, Similar statements are true about the real-time outputs

Proposal

Goal:

Ultimately the goal is to create a notebook or collection of notebooks to
compare Zarr, Parquet, Kerchunk, Postgres (NWC method), Postgres (Opendatacube method) to access multiple hydrographs from the NWM output.

For each of those, there is a preprocessing step of some kind to prepare the data, then a query step. The query step could take a lot of forms, and the nature of the query (scope, frequency, etc.) might influence what is the best method or details of the method.

Queries to test:

Like before, we need to be able to compare to observed flows, where there are gages – so the NWIS data series is still relevant.
Pick the analysis and assimilation time series (Need to ask Corey which hour; start with tm00) for an arbitrary time-window

Pick a single current forecast trace (e.g., short_range, t00z, f001-f018; or medium_range, t12z, f001-f208)
And plot it.
Pick a single historic forecast trace (see above) and plot it against and observation time series.
Pick a particular historic time-window and for all valid forecasts in the time window plot the envelope of forecasts against the analysis_assim or against observed (if that is available.)
A horizon-based skill metric requires slicing at a particular forecast range across multiple forecasts – like for hurricanes: http://www.hurricanescience.org/science/forecast/models/modelskill/

Assemble the AnA_No_DA time series for a window on a segment and compare to the regular AnA. (I believe the No_DA is only run once per day… The comparison to any other value on that day is still valid, it’s just updated and posted less frequently.)

Updates and actions

No response

@jameshalgren
Copy link
Member Author

@rabernat @martindurant -- you are tagged in this issue because of your background. We welcome your input, but aren't asking for it yet, just trying to provide attribution.

@jameshalgren
Copy link
Member Author

jameshalgren commented Aug 24, 2022

@CoreyKrewson-NOAA @karnesh @arpita0911patel
Put this together to frame discussion.

@AndersNilssonNoaa -- thanks for the clarifying correction, which I think I've incorporated.

@martindurant
Copy link

I am happy to advise on parquet, zarr and kerchunk. However, I should begin by saying that benchmarking is hard, especially for very different formats like this. There will be datasets that suit a columnar format like parquet, and others that are naturally multi-dimensional. Furthermore, the specific encodings versus the content of the data, and the expected access patterns versus the chosen chunking structure. You might very well find that the choice of options available within any particular format make a bigger difference than the choice of format.

@jameshalgren jameshalgren changed the title Create a test notebook comparing alternate methods for downloading the NWM data Create a test notebook comparing alternate methods for querying the NWM data Aug 25, 2022
@jarq6c
Copy link

jarq6c commented Aug 26, 2022

I'm unfamiliar with Azavea's original plans for parquet. Those original plans may have changed.

I'd also add that the proposed queries seem to skew toward times series at one (or a few) channel features. This makes sense because it is the most common use-case (or rather, the most commonly requested capability). However, it seems to leave the question of scalability unanswered. In my experience, the most common queries have been:

  1. Long time series of simulated streamflow (ANA_NoDA) for an arbitrary channel reach. Reaches are not necessarily limited to USGS gage locations, since some municipalities and academic inquiries manage their own gauge networks, but know which NWM reaches correspond to their gauges.
  2. Multiple forecast streamflow time series (18 hours to 10 days) over a specified period for an arbitrary channel reach.
  3. Multiple simulated streamflow time series from a particular spatially contiguous region affected by a significant flooding event. Spatial scales can vary from a small town or US county to a third of the continent following wide scale tropical events.
  4. Same as 3, but forecast streamflow across multiple issue times.

Evaluations have tended to involve either a comparison of (a) simulated vs observed, (a) forecast vs observed, or (c) forecast vs simulated.

As @jameshalgren mentioned, there are multiple dimensions to the data including reference time (issue time), valid time, feature ID, model configuration (short range, medium range), and variable (streamflow, velocity, etc).

@jameshalgren
Copy link
Member Author

Some (very) useful snippets, based on kerchunk, posted in the PR comments here:
jmunroe/national-water-model#1

@jameshalgren
Copy link
Member Author

Closing this issue and moving to AlabamaWaterInstitute/data_access_examples#7

@jameshalgren
Copy link
Member Author

@danames -- here are a few thoughts on the queries that a NWM datastore might need to support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants