You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Results:
Centos8 + Slurm - 20.11.0-1 = No Autoscaling
Centos7 + Slurm - 20.11.0-1 = Autoscale works
[root@ip-0A781804 slurmctld]# cat slurmctld.log
[2020-12-18T00:15:27.016] debug: Log file re-opened
[2020-12-18T00:15:27.020] debug: creating clustername file: /var/spool/slurmd/clustername
[2020-12-18T00:15:27.021] error: Configured MailProg is invalid
[2020-12-18T00:15:27.021] slurmctld version 20.11.0 started on cluster asdasd
[2020-12-18T00:15:27.021] cred/munge: init: Munge credential signature plugin loaded
[2020-12-18T00:15:27.021] debug: auth/munge: init: Munge authentication plugin loaded
[2020-12-18T00:15:27.021] select/cons_res: common_init: select/cons_res loaded
[2020-12-18T00:15:27.021] select/cons_tres: common_init: select/cons_tres loaded
[2020-12-18T00:15:27.021] select/cray_aries: init: Cray/Aries node selection plugin loaded
[2020-12-18T00:15:27.021] select/linear: init: Linear node selection plugin loaded with argument 20
[2020-12-18T00:15:27.021] preempt/none: init: preempt/none loaded
[2020-12-18T00:15:27.021] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2020-12-18T00:15:27.022] debug: jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
[2020-12-18T00:15:27.022] ext_sensors/none: init: ExtSensors NONE plugin loaded
[2020-12-18T00:15:27.022] debug: switch/none: init: switch NONE plugin loaded
[2020-12-18T00:15:27.022] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:15:27.022] accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded
[2020-12-18T00:15:27.022] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/assoc_usage`, No such file or directory
[2020-12-18T00:15:27.022] debug: Reading slurm.conf file: /etc/slurm/slurm.conf
[2020-12-18T00:15:27.023] debug: NodeNames=hpc-pg0-[1-4] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug: NodeNames=htc-[1-5] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf
[2020-12-18T00:15:27.023] topology/tree: init: topology tree plugin loaded
[2020-12-18T00:15:27.023] debug: No DownNodes
[2020-12-18T00:15:27.023] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/last_config_lite`, No such file or directory
[2020-12-18T00:15:27.140] debug: Log file re-opened
[2020-12-18T00:15:27.141] sched: Backfill scheduler plugin loaded
[2020-12-18T00:15:27.141] debug: topology/tree: _read_topo_file: Reading the topology.conf file
[2020-12-18T00:15:27.141] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2020-12-18T00:15:27.141] debug: topology/tree: _log_switches: Switch level:0 name:hpc-Standard_HB60rs-pg0 nodes:hpc-pg0-[1-4] switches:(null)
[2020-12-18T00:15:27.141] debug: topology/tree: _log_switches: Switch level:0 name:htc nodes:htc-[1-5] switches:(null)
[2020-12-18T00:15:27.141] route/default: init: route default plugin loaded
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open node state file /var/spool/slurmd/node_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Information may be lost!
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state.old`, No such file or directory
[2020-12-18T00:15:27.141] No node state file (/var/spool/slurmd/node_state.old) to recover
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:15:27.142] No job state file (/var/spool/slurmd/job_state.old) to recover
[2020-12-18T00:15:27.142] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gpu/generic: init: init: GPU Generic plugin loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: Updating partition uid access list
[2020-12-18T00:15:27.142] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open reservation state file /var/spool/slurmd/resv_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Reservations may be lost
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No reservation state file (/var/spool/slurmd/resv_state.old) to recover
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open trigger state file /var/spool/slurmd/trigger_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No trigger state file (/var/spool/slurmd/trigger_state.old) to recover
[2020-12-18T00:15:27.143] read_slurm_conf: backup_controller not specified
[2020-12-18T00:15:27.143] Reinitializing job accounting state
[2020-12-18T00:15:27.143] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-18T00:15:27.143] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.143] Running as primary controller
[2020-12-18T00:15:27.143] debug: No backup controllers, not launching heartbeat.
[2020-12-18T00:15:27.143] debug: priority/basic: init: Priority BASIC plugin loaded
[2020-12-18T00:15:27.143] No parameter for mcs plugin, default values set
[2020-12-18T00:15:27.143] mcs: MCSParameters = (null). ondemand set.
[2020-12-18T00:15:27.143] debug: mcs/none: init: mcs none plugin loaded
[2020-12-18T00:15:57.143] debug: sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:15:57.143] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:16:27.212] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-18T00:16:27.212] debug: sched: Running job scheduler
[2020-12-18T00:17:27.284] debug: sched: Running job scheduler
[2020-12-18T00:17:27.285] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:18:27.362] debug: sched: Running job scheduler
[2020-12-18T00:19:27.438] debug: sched: Running job scheduler
[2020-12-18T00:19:27.438] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:20:27.512] debug: sched: Running job scheduler
[2020-12-18T00:20:27.513] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:20:27.513] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:20:27.513] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:20:27.513] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:20:27.513] No job state file (/var/spool/slurmd/job_state.old) found
[2020-12-18T00:21:27.676] debug: sched: Running job scheduler
[2020-12-18T00:21:27.676] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:22:27.749] debug: sched: Running job scheduler
[2020-12-18T00:23:27.821] debug: sched: Running job scheduler
[2020-12-18T00:23:27.821] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:24:27.892] debug: sched: Running job scheduler
[2020-12-18T00:25:27.966] debug: Updating partition uid access list
[2020-12-18T00:25:27.966] debug: sched: Running job scheduler
[2020-12-18T00:25:28.067] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:26:27.139] debug: sched: Running job scheduler
[2020-12-18T00:27:27.212] debug: sched: Running job scheduler
[2020-12-18T00:27:28.214] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:28:27.286] debug: sched: Running job scheduler
[2020-12-18T00:29:27.361] debug: sched: Running job scheduler
[2020-12-18T00:29:28.362] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:30:27.437] debug: sched: Running job scheduler
[2020-12-18T00:30:55.711] req_switch=-2 network='(null)'
[2020-12-18T00:30:55.711] Setting reqswitch to 1.
[2020-12-18T00:30:55.711] returning.
[2020-12-18T00:30:55.712] sched: _slurm_rpc_allocate_resources JobId=2 NodeList=htc-1 usec=1268
[2020-12-18T00:30:56.261] debug: sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:30:56.261] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:30:57.263] error: power_save: program exit status of 1
[2020-12-18T00:31:27.588] debug: sched: Running job scheduler
[2020-12-18T00:31:28.589] debug: shutting down backup controllers (my index: 0)
[root@ip-0A781804 slurmctld]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 htc hostname andreim CF 1:37 1 htc-1
[root@ip-0A781804 slurmctld]# sinfo -V
slurm 20.11.0
[root@ip-0A781804 slurmctld]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: active (running) since Fri 2020-12-18 00:15:26 UTC; 17min ago
Main PID: 3980 (slurmctld)
Tasks: 8
Memory: 5.5M
CGroup: /system.slice/slurmctld.service
└─3980 /usr/sbin/slurmctld -D
Dec 18 00:15:26 ip-0A781804 systemd[1]: Started Slurm controller daemon.
[root@ip-0A781804 slurm]# cat topology.conf
SwitchName=hpc-Standard_HB60rs-pg0 Nodes=hpc-pg0-[1-4]
SwitchName=htc Nodes=htc-[1-5]
[root@ip-0A781804 slurm]# cat slurm.conf
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
PropagateResourceLimits=ALL
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser="slurm"
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
SelectTypeParameters=CR_Core_Memory
ClusterName="ASDASD"
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmctldParameters=idle_on_node_suspend
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd/slurmd.log
TopologyPlugin=topology/tree
JobSubmitPlugins=job_submit/cyclecloud
PrivateData=cloud
TreeWidth=65533
ResumeTimeout=1800
SuspendTimeout=600
SuspendTime=300
ResumeProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_program.sh
ResumeFailProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_fail_program.sh
SuspendProgram=/opt/cycle/jetpack/system/bootstrap/slurm/suspend_program.sh
SchedulerParameters=max_switch_wait=24:00:00
AccountingStorageType=accounting_storage/none
Include cyclecloud.conf
SlurmctldHost=ip-0A781804
[root@ip-0A781804 slurm]# cat cyclecloud.conf
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=hpc Nodes=hpc-pg0-[1-4] Default=YES DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=hpc-pg0-[1-4] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=htc Nodes=htc-[1-5] Default=NO DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=htc-[1-5] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
The text was updated successfully, but these errors were encountered:
Hi, it seems that autoscaling no longer works with Centos8
Tested with :
Cyclecloud Version: 8.1.0-1275
Cyclecloud-Slurm 2.4.2
Results:
Centos8 + Slurm - 20.11.0-1 = No Autoscaling
Centos7 + Slurm - 20.11.0-1 = Autoscale works
The text was updated successfully, but these errors were encountered: