-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gpu health check in prolog_epilog #223
base: master
Are you sure you want to change the base?
Conversation
[ $NVLINK_ERRORS -gt 0 ] && REASON+="NVLink errors detected ($NVLINK_ERRORS errors), " | ||
REASON+="see /tmp/dcgm.out and /tmp/ecc_errors.out" | ||
|
||
scontrol update nodename=$HOSTNAME state=drain reason="$REASON" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this line is unnecessary as it is already the default behavior:
https://slurm.schedmd.com/prolog_epilog.html#failure_handling
It is also explicitly recommend that scontrol
commands not be present in prologs/epilogs (see link above):
Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). Long running scripts can cause scheduling problems when jobs take a long time to start or finish. Slurm commands in these scripts can potentially lead to performance issues and should not be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if she doesn't do this, is there any other way to set the reason for the drain state?
|
||
set -e | ||
|
||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you fail gracefully if nvidia-smi
or dcgmi
do not exist. Examples:
if ! type -P nvidia-smi 1>/dev/null; then
# this script requires nvidia-smi to function
exit 0
fi
if ! type -P dcgmi 1>/dev/null; then
# this script requires dcgmi to function
exit 0
fi
I'm not sure if it's good or bad to print messages to stdout or stderr. @samskillman WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with testing for dcgmi and nvidia-smi, I also think we should probably also test that the machine type is valid.
# | ||
if [ $NUMGPUS -gt 0 ]; then | ||
echo "Execute DCGM health check and ECC error check for GPUs" | ||
GPULIST=`nvidia-smi --query-gpu=index --format=csv,noheader | tr '\n' ',' | sed 's/,$//'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backtick is an older way of doing this. Use $(....)
instead
if [ $NUMGPUS -gt 0 ]; then | ||
echo "Execute DCGM health check and ECC error check for GPUs" | ||
GPULIST=`nvidia-smi --query-gpu=index --format=csv,noheader | tr '\n' ',' | sed 's/,$//'` | ||
rm /tmp/dcgm.out 2> /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest rm -f
as safer. This will prompt if the user has rm
aliases to rm -i
(which is common for the root
user)
Add dcgm and xid tests in prolog_epilog folder. This will be used in blueprint