Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gpu health check in prolog_epilog #223

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

NinaCai
Copy link

@NinaCai NinaCai commented Oct 15, 2024

Add dcgm and xid tests in prolog_epilog folder. This will be used in blueprint

[ $NVLINK_ERRORS -gt 0 ] && REASON+="NVLink errors detected ($NVLINK_ERRORS errors), "
REASON+="see /tmp/dcgm.out and /tmp/ecc_errors.out"

scontrol update nodename=$HOSTNAME state=drain reason="$REASON"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this line is unnecessary as it is already the default behavior:

https://slurm.schedmd.com/prolog_epilog.html#failure_handling

It is also explicitly recommend that scontrol commands not be present in prologs/epilogs (see link above):

Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). Long running scripts can cause scheduling problems when jobs take a long time to start or finish. Slurm commands in these scripts can potentially lead to performance issues and should not be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if she doesn't do this, is there any other way to set the reason for the drain state?


set -e

#
Copy link
Member

@tpdownes tpdownes Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you fail gracefully if nvidia-smi or dcgmi do not exist. Examples:

if ! type -P nvidia-smi 1>/dev/null; then
        # this script requires nvidia-smi to function
        exit 0
fi

if ! type -P dcgmi 1>/dev/null; then
        # this script requires dcgmi to function
        exit 0
fi

I'm not sure if it's good or bad to print messages to stdout or stderr. @samskillman WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with testing for dcgmi and nvidia-smi, I also think we should probably also test that the machine type is valid.

#
if [ $NUMGPUS -gt 0 ]; then
echo "Execute DCGM health check and ECC error check for GPUs"
GPULIST=`nvidia-smi --query-gpu=index --format=csv,noheader | tr '\n' ',' | sed 's/,$//'`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backtick is an older way of doing this. Use $(....) instead

https://www.shellcheck.net/wiki/SC2006

if [ $NUMGPUS -gt 0 ]; then
echo "Execute DCGM health check and ECC error check for GPUs"
GPULIST=`nvidia-smi --query-gpu=index --format=csv,noheader | tr '\n' ',' | sed 's/,$//'`
rm /tmp/dcgm.out 2> /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rm -f as safer. This will prompt if the user has rm aliases to rm -i (which is common for the root user)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants