Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Fixing several issues, including 966 (bash octal issue); add new winter weather verification test with staged data #997

Merged
merged 22 commits into from
Jan 11, 2024

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Dec 21, 2023

DESCRIPTION OF CHANGES:

This started as a fairly simple change (don't they always) to fix Issue #966 and add a new winter weather verification test in support of the RRFS agile framework project and has since snowballed into solving a number of outstanding and newly discovered issues.

New test

  • The new test MET_ensemble_verification_winter_wx is added. This test will exercise a number of yet-untested capabilities in the workflow, including a 10-member ensemble, snowfall verification with staged data (so can be run on all platforms, not just Jet and Hera), and several SPP settings.
  • As part of this new test, snowfall observations will now be staged on all tier-1 platforms, as well as netCDF GFS data and other observation types, all for the date 2022020300

Resolved issues

Other fixes

  • Some old files for unsupported/removed CCPP suites are removed
  • Add some missing task dependencies for retrieving verification obs

General improvements

  • Many improvements to verification obs-pulling task
    • NDAS observations are now retrieved for forecast hour zero, and a better obs file is retrieved for major obs times (00z, 06z, 12z, 18z) per EMC guidance
    • Better in-line comments/documentation
    • Standardize order and messaging for file-on-disk checks across all observation types
  • Added explanatory comments for reflectivity field in diag_table files
  • Update diag_table.FV3_GFS_v17_p8 so that all output files will be hourly
  • Simplify task dependencies that rely on staged verification observations; these "get_obs" tasks should always be run (they check that the data exists before trying to retrieve it), so no need to make the dependency conditional
  • Add a check in monitor_jobs.py to ensure the yaml file does not contain duplicate experiment directories
  • Make sure the key in the experiment dictionary used by is unique by appending the current date/time to the exptdir name; additionally, set this key as the WORKFLOW_ID variable (so that it could be used in the workflow if necessary).

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

Ran fundamental tests on several platforms. Ran some manual tests for the above-mentioned WE2E and monitor_jobs features, including catting together several WE2E yaml files in a way that previously failed to ensure they now succeed.

Ran all verification tests (including new test) on Hera, Orion, and Derecho to ensure nothing changed unexpectedly. I also compared existing tests to develop on Hera, and confirmed that the additional NDAS obs are now being pulled and used for verification tasks, which resulted in expected differences in verification results.

  • hera.intel
    • Fundamental tests
    • All verification tests
  • orion.intel
    • Fundamental tests
    • All verification tests (except MET_verification_winter_wx and MET_ensemble_verification_only_vx_time_lag, which only work on Jet and Hera)
  • hercules.intel
  • cheyenne.intel
  • cheyenne.gnu
  • derecho.intel
    • Fundamental tests
    • All verification tests (except MET_verification_winter_wx and MET_ensemble_verification_only_vx_time_lag, which only work on Jet and Hera)
  • gaea.intel
  • gaeac5.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests

DEPENDENCIES:

None

DOCUMENTATION:

Working on updates to documentation (mostly for the new test); will merge when ready

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS:

Thanks to @EdwardSnyder-NOAA for help staging the new test data

…BCS are now "netcdf", and PREEXISTING_DIR_METHOD is now "quit"
…better way of handling output frequency in the app
…the accumulation period for diagnostic variables in the weather model, and it was set to 6 for GFS_v17_p8
…ity). This avoids having to do weird things with MET verification in case those fields don't exist
… to a separate PR, since that will require changing a LOT of test files
… on, so no need for conditional test. Also, add "get_obs_nohrsc" dependency for ASNOW tasks.
 - Update documentation for more clarity about obs types, and why some
things are done a certain way
 - Start forecast hour loop at zero, but skip retrieving obs at hour
zero for CCPA and NOHRSC (since these are accumulation vars)
 - Standardize order and messaging for file-on-disk checks across all ob
types
 - For NDAS, re-organize logic so that the later, more complete obs
files are pulled from HPSS, as well as getting the 0h obs files
@@ -154,6 +154,8 @@
"gfs_phys", "BC_AOD_550", "bc_aod550", "fv3_history2d", "all", .false., "none", 2
"gfs_phys", "OC_AOD_550", "oc_aod550", "fv3_history2d", "all", .false., "none", 2
"gfs_phys", "SS_AOD_550", "ss_aod550", "fv3_history2d", "all", .false., "none", 2
# Reflectivity from Thompson microphysics
Copy link
Collaborator

@JeffBeck-NOAA JeffBeck-NOAA Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 'GFDL microphysics'? This is a GFS SDF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GFSv17 P8 is testing with Thompson!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct @michelleharrold. I'm not sure if it was different previously, but from what I can see currently the FV3_GFS_v17_p8 suite uses Thompson microphysics: https://github.com/NOAA-EMC/fv3atm/blob/bba399053d3939241938f19ee598895eea54fd65/ccpp/suites/suite_FV3_GFS_v17_p8.xml#L76

checking snowfall verification statistics, and a test for custom domains
with staged netCDF data.
checking snowfall verification statistics using observations retrieved from
HPSS, and a test for custom domains with RAP data retrieved from HPSS
user:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could HRRR ICs be used for this domain too? Just curious why RAP is mentioned by itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was simply making the description more verbose: this test specifically retrieves RAP data, but GFS and HRRR (and GEFS/GDAS for that matter) can also be retrieved.

Now that you mention it though I'm pretty sure this custom domain might extend a bit north of the HRRR domain, so those ICs might not be possible for this specific domain.

@@ -36,7 +36,7 @@ def get_crontab_contents(called_from_cron, machine, debug):
# call the system version of crontab at /usr/bin/crontab.
#
crontab_cmd = "crontab"
if machine == "CHEYENNE" or machine == "DERECHO":
if machine == "CHEYENNE":
if called_from_cron:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no Derecho support for the App at the moment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The App currently runs on Derecho. Given the decommissioning of Cheyenne, I think that DERECHO should be kept, while CHEYENNE should be removed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derecho does not suffer from the same issue that this logic accounts for, see my PR description for why this fix was included (and on a related note, #998). Cheyenne functionality should be removed in its entirety in a separate PR, I feel that's beyond the scope of this PR.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich -

Thanks for removing the last vestiges of the FV3_GFS_2017_gfdlmp/_regional CCPP suites from the repository! I was able to successfully test that the change in ush/get_crontab_contents.py works while using run_WE2E_tests.py on Derecho and using the --launch cron option. I was also able to successfully run the new MET_ensemble_verification_winter_wx test on Gaea C5:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_winter_wx_20240104094734                 COMPLETE             208.52
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             208.52

Approving this PR now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jan 4, 2024
@MichaelLueken
Copy link
Collaborator

The coverage WE2E tests were manually ran on Derecho and they all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km_20240104115217                      COMPLETE              24.66
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              45.55
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024010411521  COMPLETE              45.90
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240104  COMPLETE              29.86
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              18.61
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024010411522  COMPLETE              41.38
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              24.19
pregen_grid_orog_sfc_climo_20240104115227                          COMPLETE              22.15
specify_template_filenames_20240104115229                          COMPLETE              22.77
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             275.07

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The MET_verification_only_vx WE2E test on Hercules failed in run_MET_PcpCombine_obs_APCP01h with the following message:

/work/noaa/epic/role-epic/jenkins/workspace/fs-srweather-app_pipeline_PR-997/hercules/scripts/exregional_run_met_pcpcombine.sh: line 140: 10#: invalid integer constant (error token is "10#")
End exregional_run_met_pcpcombine.sh at Thu Jan  4 17:28:02 UTC 2024 with error code 1 (time elapsed: 00:00:03)

@mkavulich
Copy link
Collaborator Author

@MichaelLueken Thanks for letting me know, can you point me to the location of Jenkins tests on Hercules? I'm not sure where to find that information, is it in the Wiki somewhere?

@MichaelLueken
Copy link
Collaborator

@mkavulich - The Jenkins workspace for your test on Hercules is:

/work/noaa/epic/role-epic/jenkins/workspace/fs-srweather-app_pipeline_PR-997/hercules/expt_dirs/MET_verification_only_vx

At the bottom of the Contributor's Guide, there are paths to the Jenkins workspaces on the various machines.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken For the record I am getting "Permission denied" errors when I attempt to view the contents of /work/noaa/epic/role-epic/jenkins/workspace/fs-srweather-app_pipeline_PR-997/hercules/expt_dirs/MET_verification_only_vx, can that be opened up in the future? I was able to replicate the error on my own so it's not hugely urgent

@MichaelLueken
Copy link
Collaborator

@MichaelLueken For the record I am getting "Permission denied" errors when I attempt to view the contents of /work/noaa/epic/role-epic/jenkins/workspace/fs-srweather-app_pipeline_PR-997/hercules/expt_dirs/MET_verification_only_vx, can that be opened up in the future? I was able to replicate the error on my own so it's not hugely urgent

@mkavulich - Thanks for bringing this to my attention. I have opened an issue with the EPIC Platform team to see if they can make the Jenkins workspace on Hercules readable for developers outside of the epic account.

…his bug has existed for a while but only causes a failure after my Octal bug fix...and for some reason was only failing on Hercules?? (Derecho and Hera tests both passed)
@mkavulich
Copy link
Collaborator Author

@MichaelLueken This PR is ready for a re-test on Hercules.

For anyone curious about the details: This was a very strange confluence of circumstances revealing an old bug where some logic in exregional_run_met_pcpcombine.sh that was only used for forecast jobs (not for observations) needed to be moved inside an if-block to avoid utilizing undefined variables (specifically, the variable ENSMEM_INDX doesn't make sense for observation tasks since they are independent of the number of ensemble members). This didn't cause an error until I introduced my Octal bug fix...and even then, for some reason this is only failing on Hercules: tests on Derecho and Hera still pass. So something specific about the bash version (or other environment) on Hercules is revealing more problems than other machines.

@MichaelLueken
Copy link
Collaborator

@mkavulich - The WE2E tests successfully passed on Hercules:

---------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE               8.54
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              10.44
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              30.58
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              18.92
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024011108  COMPLETE              26.41
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240111083  COMPLETE              54.92
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              14.94
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240111083520  COMPLETE              67.05
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401110835  COMPLETE              28.96
MET_verification_only_vx_20240111083521                            COMPLETE               0.22
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240111083523               COMPLETE               8.57
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             269.55

Merging this work now.

@MichaelLueken MichaelLueken merged commit 7df8b2d into ufs-community:develop Jan 11, 2024
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should "EXPT_SUBDIR" be a mandatory variable? Incorrect octal notation causing ensemble vx to fail
4 participants