add gpu health check in prolog_epilog #223

NinaCai · 2024-10-15T19:20:04Z

Add dcgm and xid tests in prolog_epilog folder. This will be used in blueprint

tpdownes · 2024-10-21T16:18:52Z

tools/prologs-epilogs/gpu-health

+        [ $NVLINK_ERRORS -gt 0 ] && REASON+="NVLink errors detected ($NVLINK_ERRORS errors), "
+        REASON+="see /tmp/dcgm.out and /tmp/ecc_errors.out"
+
+        scontrol update nodename=$HOSTNAME state=drain reason="$REASON"


I believe this line is unnecessary as it is already the default behavior:

https://slurm.schedmd.com/prolog_epilog.html#failure_handling

It is also explicitly recommend that scontrol commands not be present in prologs/epilogs (see link above):

Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). Long running scripts can cause scheduling problems when jobs take a long time to start or finish. Slurm commands in these scripts can potentially lead to performance issues and should not be used.

if she doesn't do this, is there any other way to set the reason for the drain state?

tpdownes · 2024-10-21T16:24:35Z

tools/prologs-epilogs/gpu-health

+
+set -e
+
+#


I suggest you fail gracefully if nvidia-smi or dcgmi do not exist. Examples:

if ! type -P nvidia-smi 1>/dev/null; then # this script requires nvidia-smi to function exit 0 fi if ! type -P dcgmi 1>/dev/null; then # this script requires dcgmi to function exit 0 fi

I'm not sure if it's good or bad to print messages to stdout or stderr. @samskillman WDYT?

I agree with testing for dcgmi and nvidia-smi, I also think we should probably also test that the machine type is valid.

tpdownes · 2024-10-21T16:28:39Z

tools/prologs-epilogs/gpu-health

+#
+if [ $NUMGPUS -gt 0 ]; then
+    echo "Execute DCGM health check and ECC error check for GPUs"
+    GPULIST=`nvidia-smi --query-gpu=index --format=csv,noheader | tr '\n' ',' | sed 's/,$//'`


Backtick is an older way of doing this. Use $(....) instead

https://www.shellcheck.net/wiki/SC2006

tpdownes · 2024-10-21T16:29:56Z

tools/prologs-epilogs/gpu-health

+if [ $NUMGPUS -gt 0 ]; then
+    echo "Execute DCGM health check and ECC error check for GPUs"
+    GPULIST=`nvidia-smi --query-gpu=index --format=csv,noheader | tr '\n' ',' | sed 's/,$//'`
+    rm /tmp/dcgm.out 2> /dev/null


Suggest rm -f as safer. This will prompt if the user has rm aliases to rm -i (which is common for the root user)

add gpu health check in prolog_epilog

53140d8

tpdownes reviewed Oct 21, 2024

View reviewed changes

tpdownes requested changes Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add gpu health check in prolog_epilog #223

add gpu health check in prolog_epilog #223

NinaCai commented Oct 15, 2024

tpdownes Oct 21, 2024

cboneti Oct 30, 2024

tpdownes Oct 21, 2024 •

edited

Loading

cboneti Oct 29, 2024

tpdownes Oct 21, 2024

tpdownes Oct 21, 2024

add gpu health check in prolog_epilog #223

Are you sure you want to change the base?

add gpu health check in prolog_epilog #223

Conversation

NinaCai commented Oct 15, 2024

tpdownes Oct 21, 2024

Choose a reason for hiding this comment

cboneti Oct 30, 2024

Choose a reason for hiding this comment

tpdownes Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

cboneti Oct 29, 2024

Choose a reason for hiding this comment

tpdownes Oct 21, 2024

Choose a reason for hiding this comment

tpdownes Oct 21, 2024

Choose a reason for hiding this comment

tpdownes Oct 21, 2024 •

edited

Loading