Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached FIM - Part 1a - Initial VPP Workflow Implementation #604

Merged
merged 21 commits into from
Dec 22, 2023

Conversation

TylerSchrag-NOAA
Copy link
Contributor

@TylerSchrag-NOAA TylerSchrag-NOAA commented Dec 20, 2023

This is the second of several PRs to implement the major Cached FIM Workflow enhancement. This PR includes most of the application changes, but is not fully tested with all active pipelines. Subsequent PRs will be submitted based on testing this with all pipelines on TI (some bugs expected), as well as at least several more minor components of the Cached FIM Workflow (notably the full implementation of special FIM configurations like AEP FIM, CatFIM, etc.)

Broadly speaking, the Cached FIM Enhancement utilizes a new AWS Redshift data warehouse DB (setup in part0 of this PR series) to store every HAND synthetic rating curve (hydrotable) step that our pipelines process, with the extent geometry of the upper 1-ft stage value (Note: This means that all produced FIM is now rounded up to the nearest stage ft). On subsequent FIM runs, the Redshift HAND cache is queried before HAND Processing takes place, and cached extent geometries are used when streamflow is within the range of a cached hydrotable step (just like the Ras2FIM steps were implemented by Corey before, although those steps have also been overhauled/generalized as part of this new process).

Initial tests show promising optimizations to both run times and lambda costs, with ~60% reductions to hand processing times, and ~90%+ reductions in hand_processing lambda costs.

General New VPP FIM Workflow
There are some technical limitations on how data is moved back and forth between RDS and Redshift databases, so this workflow is a little messier than ideal.

  1. Create four tables, if they don't already exist, on both RDS and Redshift. These tables replicate the schema of the HAND cache on Redshift, and are truncated and re-populated as part of each FIM run:
    a. ingest.{fim_config}_flows - this is a version of max_flows, with fim crosswalk columns added, as well as filtering for hight water threshold
    b. ingest.{fim_config} - this is the fim table, but without geometry
    c. ingest.{fim_config}_geo - this is the geometries for the fim table (one-to-many, since we're subdividing to keep geometries small for Redshift)
    d. ingest.{fim_config}_zero_stage - this table holds all of the fim features (hydro_table, feature_id, huc8, branch combinations) that have zero or NaN stage at the current discharge value
    e. ingest.{fim_config}_geo_view (RDS only) - this view subdivides the newly polygons in the inundation_geo table (because Redshift has a limit on the size of geometries)
    f. publish.{fim_config} (RDS only) - This is the finished publish table that gets copied to the EGIS service
  2. Populate the FIM flows table on RDS (from max_flows with some joins), then copy it to Redshift
  3. Query the HAND Cache on Redshift
    a. Query the HAND cache on Redshift, joining to the just-populated flows table, to populate the inundation, inundation_geo, and inundation_zero_stage tables on Redshift
  4. Populate the inundation tables on RDS
    a. Prioritize Ras2FIM by querying the Ras2FIM cache on RDS first #TODO
    b. Copy the FIM tables on Redshift (which were just populated from the HAND cache in 2a) into the inundation tables on RDS (skipping any records that were already added from Ras2FIM)
    c. HAND processing for any FIM features remaining in the inundation flows table, that have not been added to the inundation table from Ras2FIM or the HAND cache (not done here, but administered by the fim_data_prep lambda function
  5. Generate publish.inundation table on RDS, and copy it to the EGIS (done via the update_egis_data function)
    a. We can use a template to do this generically for most inland inundation configurations (e.g. NWM)
  6. Add any newly generated HAND features in this run into the Redshift HAND cache ( #TODO: it would be good to figure out how to do this in parallel outside of the fim_config map, so that this doesn't hold things up).
    a. Insert records from the RDS inundation, inundation_geo, and inundation_zero_stage tables/view into the Redshift HAND cache tables, only taking records generated by HAND Processing, and which the primary key does not already exist (hydro_id, feature_id, huc8, branch, rc_stage_ft)

Changes to Specific Components
This PR contains significant updates to the FIM Config steps of the VPP,:

  • viz_fim_data_prep lambda function - Refactor / simplification to just lookup features for hand processing and write huc processing group cvs to S3 (which are used by the fim processing step function to delegate hand processing jobs). Logic to get flows, and Ras2FIM caching template sql have been moved to the generalized postprocess_sql lambda function:
  • viz_postprocess_sql lambda function
    • Now contains a fim_caching_templates folder that executes various fim workflow steps mentioned above.
    • A fim_flows folder for getting flows for special fim configurations has also been added (this used to be in the data_sql folder of fim_data_prep lambda function).
    • Lambda function logic has been tweaked to allow for a list of sql statements to be executed in a single lambda invocation.
    • Lambda function logic has been tweaked to allow for new sql_templates_to_run parameter specified in the step function definition, which is used similaraly / in combination with the step parameter (more could be done here to optimize / abstract).
    • Also added a new optional check_dependencies parameter that can be specified by the step function, which will not use the check_required_tables_updated function if set to false (this is needed on several of the fim steps due to intentionally empty tables).
    • Abstraction of named discharge columns in max_flows tables
    • Dependent changes to product / summary sql files for all various changes listed above.
  • viz_fim_hand_processing lambda function
    • Minor changes to track/upload new columns required for cached fim (rc_previous_discharge_ft), for example.
    • Function now tracks zero-stage reaches / reaches that can't do a valid stage lookup, and uploads those to the fim_zero_stage table (these were previously just skipped altogether).
  • viz_initialize_pipeline lambda function - Updates to product configs to support new FIM workflows.
  • viz pipeline step function - New FIM configs workflow changes
  • Other minor bug fixes and enhancements (notably some improvements to db connection handling in several spots).

TODOs / Roadmap
I plan to take the following steps after this is deployed to TI, likely through 2-3 subsequent PRs as part of this series:

  1. Fully implement special FIM configuration (AEP FIM, CatFIM, etc.)
  2. Test all pipeline configurations / fix any bugs
  3. Evaluate / document potential future optimizations
  4. Plan for / document deployment strategy related to FIM hand cache (I need to create the cache tables on deployment, or within the lambda function... and we need to wipe the cache whenever we update HAND FIM versions, which can be done by truncating the cache tables on Redshift)
  5. Plan for / implement historic data request functionality.

…function-defined sql execution for cached fim operations
…_data_prep (was data_sql folder). Yet to be implemented.
…all product-sepecific sql files - these will now be integrated with postprocess_sql function.
…ving special fim config options for now (will add back in later)
Copy link
Collaborator

@shawncrawley shawncrawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaves me speechless....

@nickchadwick-noaa nickchadwick-noaa merged commit de4f3ac into ti Dec 22, 2023
1 check passed
@nickchadwick-noaa nickchadwick-noaa deleted the cached_fim_part1 branch December 22, 2023 16:26
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed domain variable for recurrence flows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update WHERE clause to use prc_status and remove joins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants