Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurmd nodes fail health check, enter drain state #22

Open
bnordgren opened this issue Aug 21, 2024 · 0 comments
Open

slurmd nodes fail health check, enter drain state #22

bnordgren opened this issue Aug 21, 2024 · 0 comments
Labels
needs triage Needs further investigation to determine cause and/or work required to implement fix/feature

Comments

@bnordgren
Copy link

Bug Description

Nodes come up temporarily, but are quickly set to the drain state.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurmd       up   infinite      2  drain node-b,node-d
ubuntu@hpc-login:~$ sinfo -R 
REASON               USER      TIMESTAMP           NODELIST
NHC: check_ps_servic root      2024-08-21T13:10:00 node-b,node-d

To Reproduce

  1. scontrol update nodename=node-b state=resume
  2. Wait. Usually not more than 5 min.
  3. sinfo

Environment

Base OS is Ubuntu 22.04 deployed by MAAS. Channel is latest/edge.

Relevant log output

juju exec -u slurmctld/0 -- sudo journalctl -u slurmctld -x
...
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b state set to IDLE
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d state set to IDLE
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: Node node-d now responding
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: Node node-b now responding
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: sched/backfill: _start_job: Started JobId=3 in slurmd on node-b,node-d
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=3 WTERMSIG 53
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=3 done
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _slurm_rpc_submit_batch_job: JobId=4 InitPrio=4294901756 usec=909
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: sched: Allocate JobId=4 NodeList=node-b,node-d #CPUs=4 Partition=slurmd
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=4 WEXITSTATUS 0
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=4 done
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d reason set to: NHC: check_ps_service:  Service sshd (process sshd) owned by root not running; start in progress
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d state set to DRAINED
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b reason set to: NHC: check_ps_service:  Service sshd (process sshd) owned by root not running; start in progress
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b state set to DRAINED

Additional context

No response

@NucciTheBoss NucciTheBoss added the needs triage Needs further investigation to determine cause and/or work required to implement fix/feature label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs further investigation to determine cause and/or work required to implement fix/feature
Projects
None yet
Development

No branches or pull requests

2 participants