Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add job cards for wrappers for individual machines and tasks #924

Merged
merged 38 commits into from
Dec 5, 2023

Conversation

RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA commented Oct 4, 2023

DESCRIPTION OF CHANGES:

Added job cards for individual tasks, similar to wrapper scripts, but tailored for each machine.

  • Hera
  • Jet
  • Gaea
  • Gaea-C5
  • Orion
  • Cheyenne
  • Derecho

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • gaea-c5.intel
  • jet.intel
  • hercules.intel
  • fundamental test suite

DOCUMENTATION:

@gspetro-NOAA we can add few lines to document this.

ISSUE:

#909

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • enhancement
  • run_ci
  • run_we2e_comprehensive_tests

Contributors:

@EdwardSnyder-NOAA
@natalie-perlin

@MichaelLueken MichaelLueken changed the title Add job cards for wrappers for individual machines and tasks [develop] Add job cards for wrappers for individual machines and tasks Oct 5, 2023
@MichaelLueken MichaelLueken added the enhancement New feature or request label Oct 5, 2023
@mkavulich
Copy link
Collaborator

For the sake of further discussion I'll echo my points from today's code management meeting as to why job cards should not be included in the repository:

  1. In most cases, re-running a task is as easy as using the rocotorewind or rocotoboot command.
  2. For cases where the job card is explicitly needed (for example, sending to sysadmins to demonstrate a bug), these job cards can already be found in the current workflow! Job cards are automatically generated by rocoto, and if a task fails and a developer would like to debug by submitting the job manually, those job cards can be retrieved from the workflow launch log file log.launch_FV3LAM_wflow. For the Workflow End-to-end tests, if users are using the python monitoring script, the job cards are not saved to log by default, but can be saved to the test log with the "--debug" option.
  3. These job cards will break almost immediately upon being committed to the repository. The moment any change to the variables required by the underlying job or exscripts is made, these job cards will be out-of-date, and probably do more harm than good.
  4. These job cards will not help with general debugging: they can only be run for a very specific test for a specific date, domain, and settings. If you are already making modifications to these job cards to work for a specific test that you are debugging, why can you not do the same with a job card generated by rocoto from some other experiment?
  5. If you are on a machine without rocoto, these job cards won't exist for that machine anyway!

@RatkoVasic-NOAA I want to be clear that I understand the frustrations of debugging the workflow when a task fails. I do not want to stand in the way of development: I want to make debugging as easy as possible! I agree it would be more convenient if we had a tool that could just spit out a job card without needing to run rocoto as an intermediary. But for the reasons above I argue that putting these job cards in the repository will do more harm than good, especially in the long run!

I can think of some potential solutions/mitigations for the difficulties you are having in lieu of this PR:

  1. An option in launch_FV3LAM_wflow.sh to save job cards for each as a text file when calling rocotorun, to avoid having to manually search the logs for the correct job card. This should be relatively easy to implement.
  2. Better documentation in job cards for which environment variables are required for running that script. This is something that is needed regardless, so should be prioritized.

I hope others can chime in with other helpful features and potential solutions. And of course I welcome further discussion and rebuttals, I am by no means the final authority on this.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RatkoVasic-NOAA -

Thank you very much! Once @mkavulich gives his approval, I will be able to submit the Jenkins tests.

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @RatkoVasic-NOAA for making the suggested changes. I had a few more suggestions that are minor (since the same comment appears in all files it shows up as a lot of suggested changes). Not mandatory changes, but I thought this would clarify the text a little.

ush/wrappers/job_cards/README Outdated Show resolved Hide resolved
ush/wrappers/job_cards/qsub/get_ics.qsub Outdated Show resolved Hide resolved
ush/wrappers/job_cards/qsub/get_lbcs.qsub Outdated Show resolved Hide resolved
ush/wrappers/job_cards/qsub/make_grid.qsub Outdated Show resolved Hide resolved
ush/wrappers/job_cards/qsub/make_ics.qsub Outdated Show resolved Hide resolved
ush/wrappers/job_cards/sbatch/make_lbcs.sbatch Outdated Show resolved Hide resolved
ush/wrappers/job_cards/sbatch/make_orog.sbatch Outdated Show resolved Hide resolved
ush/wrappers/job_cards/sbatch/make_sfc_climo.sbatch Outdated Show resolved Hide resolved
ush/wrappers/job_cards/sbatch/run_fcst.sbatch Outdated Show resolved Hide resolved
ush/wrappers/job_cards/sbatch/run_post.sbatch Outdated Show resolved Hide resolved
RatkoVasic-NOAA and others added 19 commits December 4, 2023 11:33
@RatkoVasic-NOAA
Copy link
Collaborator Author

@mkavulich thanks for corrections!

@MichaelLueken
Copy link
Collaborator

Thanks @mkavulich and @RatkoVasic-NOAA! I will go ahead and launch the Jenkins tests now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Dec 4, 2023
@MichaelLueken
Copy link
Collaborator

The Functional Workflow Task Tests were manually submitted on Derecho and all successfully passed:

# Try derecho with the first few simple SRW tasks ...
run_make_grid: COMPLETE
run_get_ics: COMPLETE
run_get_lbcs: COMPLETE
run_make_orog: COMPLETE
run_make_sfc_climo: COMPLETE
run_make_ics: COMPLETE
run_make_lbcs: COMPLETE
run_fcst: COMPLETE
run_post: COMPLETE

The WE2E Coverage tests were manually submitted on Derecho and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              23.32
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              37.81
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              44.75
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              29.07
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              17.89
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              40.80
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              23.76
pregen_grid_orog_sfc_climo                                         COMPLETE              16.36
specify_template_filenames                                         COMPLETE              15.19
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             248.95

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding my official approval for the record

@MichaelLueken
Copy link
Collaborator

The Functional Workflow Task Tests successfully passed on all machines. The WE2E coverage tests passed for all platforms, except for Hera Intel, where the usual suspects failed (custom_ESGgrid_Central_Asia_3km, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot, and pregen_grid_orog_sfc_climo).

Since this PR doesn't affect the code in the SRW App, Hera being down for maintenance today (12/05/2023), and with PR #977 already in place to correct these issues, I will go ahead and merge this work now.

@MichaelLueken MichaelLueken merged commit 6b14175 into ufs-community:develop Dec 5, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrapper scripts need improvement (does not work properly on every machine)
6 participants