Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect job status returned from squeue in rt.sh #2379

Open
DusanJovic-NOAA opened this issue Jul 26, 2024 · 16 comments
Open

Incorrect job status returned from squeue in rt.sh #2379

DusanJovic-NOAA opened this issue Jul 26, 2024 · 16 comments
Assignees

Comments

@DusanJovic-NOAA
Copy link
Collaborator

I am running a regression test on Gaea and I noticed that some test jobs fail due to wall clock timeout error but the status of those jobs is incorrectly interpreted by rt scripts. This is part of the log file:

++ squeue -u Dusan.Jovic -j 135087666
+ job_info='JOBID       CLUSTER  PARTITION  QOS         USER                STATE    TIME_LEFT   NODES  NAME
135087666   c5       batch      normal      Dusan.Jovic         COMPLETININVALID     3      run_cpld_debug_pdlib'
+ grep -q 135087666
+ job_running=true
++ grep 135087666
+ status='135087666   c5       batch      normal      Dusan.Jovic         COMPLETININVALID     3      run_cpld_debug_pdlib'
++ awk '{print $5}'
+ status=Dusan.Jovic
+ case ${status} in
+ status_label=Unknown
+ echo 'rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic'
rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic
+ echo 'rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed'
rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed
+ echo '34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)'
34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)

In this specific case the job status is 'COMPLETININVALID', but the script gets the status 'Dusan.Jovic'. Obviously wrong. This happens because the default format of the squeue command is not the same across different systems running slurm scheduler:

On Hera:

$ squeue -u Dusan.Jovic -j 63947991
    JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)
 63947991 hera       compile_atm_dyn32_intel  Dusan.Jovic      RUNNING      0:36      30:00     1 h20c01

On Gaea:

$ squeue -u Dusan.Jovic -j 135087764
JOBID       CLUSTER  PARTITION  QOS         USER                STATE    TIME_LEFT   NODES  NAME
135087764   c5       batch      normal      Dusan.Jovic         RUNNING  17:23       3      run_cpld_debug_pdlib

On Hera job state is in column 5, while on Gaea it is in column 6.

Instead of just running squeue with the default format we must explicitly set the desired format in order to get the same output on all platforms. For example, I suggest:

squeue -u "${USER}" -j "${jobid}" -o '%i %T'

In this case the first column will always be JOBID and the second column will always be STATE.

@DusanJovic-NOAA
Copy link
Collaborator Author

As I'm testing the updates that fix the above issue I'm finding few other problems with the current scripts.

Consider this test configuration:

COMPILE | atm_dyn32 | intel | -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v17_p8 -D32BIT=ON | | fv3 |
RUN | control_p8            |   | baseline |
RUN | control_restart_p8    |   |          | control_p8

If control_p8 test fails for whatever reason (timeout or model crash) it is not rerun the second time despite the fact that we have ECF_TRIES=2 set in rt.sh. The second issue is that the dependent task (control_restart_p8) is submitted, which is wrong because the task it depends on (control_p8) failed.

This happens because the exit code of a job is never checked. Once the job finishes we must check the error code. Instead the script (run_test.sh) proceeds by comparing the output files, finds that they are missing, labels the test as failed and exits successfully. So workflow manager, ecflow in this case, never reruns the failed job, instead it (ecflow) thinks the job finished successfully and submits the dependent task (control_restart_p8), which it should not. The previous job failed. See the control_p8 test log file:

+ echo '1 min. Slurm Job 64031780 Status: Job running (RUNNING)'
1 min. Slurm Job 64031780 Status: Job running (RUNNING)
+ ((  n=n+1  ))
+ wait 1528283
+ sleep 60
+ [[ true == \f\a\l\s\e ]]
+ case ${SCHEDULER} in
++ squeue -u Dusan.Jovic -j 64031780
+ job_info='     JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)'
+ grep -q 64031780
+ job_running=false
+ continue
+ [[ false == \f\a\l\s\e ]]
+ skip_check_results=false
+ results_okay=YES
+ [[ false = false ]]
+ check_results
+ echo 'rt_utils.sh: Checking results of the regression test: control_p8_intel'
rt_utils.sh: Checking results of the regression test: control_p8_intel
+ ROCOTO=false
+ ECFLOW=true
+ local test_status=PASS
+ echo
+ echo 'baseline dir = /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/develop-20240724/control_p8_intel'
+ echo 'working dir  = /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel'
+ echo 'Checking test control_p8_intel results ....'
+ echo

+ echo 'baseline dir = /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/develop-20240724/control_p8_intel'
baseline dir = /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/develop-20240724/control_p8_intel
+ echo 'working dir  = /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel'
working dir  = /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel
+ echo 'Checking test control_p8_intel results ....'
Checking test control_p8_intel results ....
+ [[ false = false ]]
+ for i in ${LIST_FILES}
+ printf %s ' Comparing sfcf000.nc .....'
+ printf %s ' Comparing sfcf000.nc .....'
 Comparing sfcf000.nc .....+ [[ ! -f /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel/sfcf000.nc ]]
+ echo '.......MISSING file'
+ echo '.......MISSING file'

This used to work before. In previous versions of rt_utils.sh we used qstat and sacct commands to look at exit code of finished jobs, for example here:

exit_status=$( qstat ${jobid} -x -f | grep Exit_status | awk '{print $3}')

and here:

status_label=$( sacct -n -j ${slurm_id} --format=JobID,state%20,Jobname%20 | grep "^${slurm_id}" | grep ${JBNME} | awk '{print $2}' )

These checks have been removed for some reason. I think we should add them back.

@BrianCurtis-NOAA
Copy link
Collaborator

Good catch, it seems the slurm/pbs default output format changed and we need to be more specific.

For the second issue, am I correct in my understanding that the run test script, which is what ecflow executes, returns exit 0 even when a test fails?

I'm not trying to say we shouldn't check the job's exit status in the batch system (this may end up being the easiest approach), but wouldn't it be best to try to catch the non-zero exit status in the run_test script and ensure it gets propagated such that ecflow will properly handle the job?

@DusanJovic-NOAA
Copy link
Collaborator Author

Good catch, it seems the slurm/pbs default output format changed and we need to be more specific.

For the second issue, am I correct in my understanding that the run test script, which is what ecflow executes, returns exit 0 even when a test fails?

Currently, the run test script (run_test.sh) returns exit code 0 regardless on whether job failed or not. By job I mean the script (job_card) that is executed by the pbs or slurm.

I'm not trying to say we shouldn't check the job's exit status in the batch system (this may end up being the easiest approach), but wouldn't it be best to try to catch the non-zero exit status in the run_test script and ensure it gets propagated such that ecflow will properly handle the job?

Yes, and 'catch the non-zero exit status in the run_test script' is exactly what these two commands (qstat -x and sacct) were doing before.

@DusanJovic-NOAA
Copy link
Collaborator Author

@BrianCurtis-NOAA
Copy link
Collaborator

Yes, and 'catch the non-zero exit status in the run_test script' is exactly what these two commands (qstat -x and sacct) were doing before.

OK. Let's bring them back in, then.

I think we'd have to wrap these commands in a set +e because if i recall they would cause error even with success.

@DusanJovic-NOAA
Copy link
Collaborator Author

I also noticed that ecflow_client commands that set the task labels were also removed:

if [[ ${ECFLOW:-false} == true ]]; then
ecflow_client --label=job_id "${jobid}"
ecflow_client --label=job_status "submitted"
fi

The task labels are very useful while using ecflow ui to monitor tests execution during development/debugging, for example:

image

I can see here that control_p8_intel test failed and there is a job_id so I can check the reason for failure (if job itself failed due to timeout error or network/mpi error, or if test script failed due to disk quota error for example), I can then easily restart the failed test. I also see that control_restart_p8_intel has never been submitted (due to failed dependency) which means I should also resubmit it too. All this without resubmitting rt.sh and most importantly without recompiling the executable.

I find this very useful. I suggest we add those ecflow_client commands back.

@BrianCurtis-NOAA
Copy link
Collaborator

You may be the only person I know who uses the ecflow gui.

@DusanJovic-NOAA
Copy link
Collaborator Author

I found one more minor issue. If, for whatever reason, any command in the check_result function in rt_utils.sh fails, the error will not be detected.  This is the log I see in one of my tests:

....OK
+ for i in ${LIST_FILES}
+ printf %s ' Moving 20210323.060000.out_grd.ww3 .....'
 Moving 20210323.060000.out_grd.ww3 .....+ printf %s ' Moving 20210323.060000.out_grd.ww3 .....'
+ [[ -f /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/20210323.060000.out_grd.ww3 ]]
++ dirname 20210323.060000.out_grd.ww3
+ mkdir -p /scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_intel/.
mkdir: cannot create directory '/scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_intel': Permission denied
+ cp /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/20210323.060000.out_grd.ww3 /scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_inte
l/20210323.060000.out_grd.ww3
cp: failed to access '/scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_intel/20210323.060000.out_grd.ww3': Permission denied
+ echo ....OK
+ echo ....OK
....OK
+ echo
+ grep 'The total amount of wall time' /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/out
+ grep 'The maximum resident set size' /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/out
+ echo
+ TRIES=
+ [[ false == true ]]
+ echo 'Test cpld_2threads_p8_intel PASS'
+ echo
+ echo 'Test cpld_2threads_p8_intel PASS'
Test cpld_2threads_p8_intel PASS
+ echo

+ [[ PASS = \F\A\I\L ]]
+ return 0
+ [[ slurm != \n\o\n\e ]]
+ cat /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/job_timestamp.txt
+ [[ YES == YES ]]
+ remove_fail_test
+ echo 'Removing test failure flag file for cpld_2threads_p8_intel'
Removing test failure flag file for cpld_2threads_p8_intel
+ rm -f /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/pr2278/ufs-weather-model/tests-dev/fail_test_cpld_2threads_p8_intel
++ date +%s
+ date_s=1722549369
+ echo ' 1722549369, 6'
+ delete_rundir=false
+ [[ false = true ]]
+ elapsed=374
+ echo 'run_test.sh: Test cpld_2threads_p8_intel Completed.'
run_test.sh: Test cpld_2threads_p8_intel Completed.
+ echo 'run_test.sh: Test cpld_2threads_p8_intel Elapsed time 374 seconds.'
run_test.sh: Test cpld_2threads_p8_intel Elapsed time 374 seconds.
+ '[' 0 -eq 0 ']'
_______________________________________________________________
Start Epilog on node h23c07 for job 64322161 :: Thu Aug  1 21:56:09 UTC 2024
Job 64322161 finished for user Dusan.Jovic in partition hera with exit code 0:0
_______________________________________________________________
End Epilogue Thu Aug  1 21:56:09 UTC 2024

Copying test outputs to the baseline holding area failed, but the test script continued and was labeled as PASS. Even though we have '-o errexit' set and the function continues execution, and the calling script (run_test.sh) reports successful test execution. This happens due to a recent change in how check_result function is invoked from a calling script. In order to catch errors from a command (mkdir and cp in this case) run by the function, the function must not be in an if test.

diff --git a/tests/run_test.sh b/tests/run_test.sh
index ace4fd0c..9adf9f1d 100755
--- a/tests/run_test.sh
+++ b/tests/run_test.sh
@@ -398,7 +398,8 @@ fi
 skip_check_results=${skip_check_results:-false}
 results_okay=YES
 if [[ ${skip_check_results} = false ]]; then
-  if ( ! check_results ) ; then
+  check_results
+  if [[ $? -ne 0 ]]; then
     results_okay=NO
   fi
 else

@BrianCurtis-NOAA
Copy link
Collaborator

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

@DusanJovic-NOAA
Copy link
Collaborator Author

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

I am not sure about the linter. Do we have it available on some machine or it is only in github workflow?

@BrianCurtis-NOAA
Copy link
Collaborator

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

I am not sure about the linter. Do we have it available on some machine or it is only in github workflow?

https://www.shellcheck.net/

@DusanJovic-NOAA
Copy link
Collaborator Author

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

I am not sure about the linter. Do we have it available on some machine or it is only in github workflow?

https://www.shellcheck.net/

Thanks. When I check this test script that is similar to what's happening in run_test.sh:

#!/usr/bin/bash
set -eux
set -o pipefail

trap '[ "$?" -eq 0 ] || write_fail_test' EXIT

write_fail_test() {
  echo "failed in run_test"
  exit 1
}

check_results() {

    local test_status='PASS'

    cp missing_file misssing_directory/

    if [[ ${test_status} == 'FAIL' ]]; then
       return 1
    else
       return 0
    fi
}

results_okay=YES
check_results
if [[ $? -ne 0 ]]; then
    results_okay=NO
fi

echo "results_okay: ${results_okay}" 

I get:

$ shellcheck myscript
 
Line 27:
if [[ $? -ne 0 ]]; then
      ^-- SC2181 (style): Check exit code directly with e.g. 'if ! mycmd;', not indirectly with $?.

$

Looks like this message is about 'style' not an error or warning.

@DusanJovic-NOAA
Copy link
Collaborator Author

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

@BrianCurtis-NOAA
Copy link
Collaborator

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

That's where it should reside then. Agree with it.

@DusanJovic-NOAA
Copy link
Collaborator Author

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

That's where it should reside then. Agree with it.

Done. See DusanJovic-NOAA@9690557

@DusanJovic-NOAA
Copy link
Collaborator Author

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

That's where it should reside then. Agree with it.

Done. See DusanJovic-NOAA@9690557

I ran ./rt.sh -l error-test.conf -e to test error handling, and I got:

PASS -- COMPILE 'atm_dyn32_intel' [13:12, 11:56] ( 1 warnings 1 remarks )
PASS -- TEST 'control_c48.v2.sfc_intel' [08:23, 06:24](874 MB)
FAILED: RUN DID NOT COMPLETE -- TEST 'fail_to_copy_intel' [, ]( MB)
FAILED: RUN DID NOT COMPLETE -- TEST 'fail_to_run_intel' [, ]( MB)

PASS -- COMPILE 'atm_dyn64_intel' [13:12, 11:55] ( 1 warnings 1 remarks )
FAILED: UNABLE TO COMPLETE COMPARISON -- TEST 'control_c48_intel' [, ]( MB)

FAILED: UNABLE TO FINISH COMPILE -- COMPILE 'fail_to_compile_intel' [, ]
SKIPPED: ASSOCIATED COMPILE FAILED -- TEST 'dependency_unmet_intel' [, ]( MB)

jobs that are expected to pass passed, jobs that are expected to fail failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

2 participants