Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm nodes installation under Ubuntu16.04 #2

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@

Salt formula provisioning a Slurm cluster

Availables states:
* Munge
* Screen
* Slurm
* Slurm Database
* SSH
To install Slurm nodes, you need to copy (on Slurm mater node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move that lower in an "Instructions" subsection instead of replacing what the formula contains?


- munge.key from /etc/munge/munge.key to /srv/salt/munge.key
- slurm.cert from /etc/slurm-llnl/slurm.cert to /srv/salt/slurm.cert
- slurm.conf from files/etc/slurm-llnl/slurm.conf to /srv/salt/slurm.conf
- create empty cgroup.conf and gres.conf in /srv/salt/
5 changes: 0 additions & 5 deletions files/etc/apt/preferences.d/disable-utopic-policy

This file was deleted.

5 changes: 0 additions & 5 deletions files/etc/apt/preferences.d/slurm-utopic-policy

This file was deleted.

1 change: 0 additions & 1 deletion files/etc/apt/sources.list.d/utopic.list

This file was deleted.

1 change: 1 addition & 0 deletions files/etc/apt/sources.list.d/xenial.list
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
deb http://archive.ubuntu.com/ubuntu xenial main restricted universe multiverse
6 changes: 0 additions & 6 deletions files/etc/default/munge

This file was deleted.

2 changes: 1 addition & 1 deletion files/etc/security/access.conf
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
# who wants to be able to SSH in as root via public-key on Biomedia servers.
# disable SSH for anybody but root
+:root:ALL
-:ALL EXCEPT (csg) dr jpassera bglocker:ALL
-:ALL EXCEPT (csg) (biomedia) dr jpassera bglocker jgao:ALL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular users shouldn't have SSH access to the cluster nodes, hence the previous config

2 changes: 0 additions & 2 deletions files/etc/slurm-llnl/bardolph/gres.conf

This file was deleted.

9 changes: 5 additions & 4 deletions files/etc/slurm-llnl/cgroup.conf
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ CgroupReleaseAgentDir=/var/spool/slurm-llnl/cgroup
ConstrainCores=yes
TaskAffinity=yes
#ConstrainRAMSpace=no
### not used yet
#ConstrainDevices=no
#AllowedDevicesFile=/etc/slurm-llnl/cgroup_allowed_devices_file.conf

ConstrainSwapSpace=yes
AllowedSwapSpace=10.0
# Not well supported until Slurm v14.11.4 https://groups.google.com/d/msg/slurm-devel/oKAUed7AETs/Eb6thh9Lc0YJ
#ConstrainDevices=yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be enabled and not commented out then?

#AllowedDevicesFile=/etc/slurm-llnl/cgroup_allowed_devices_file.conf
8 changes: 0 additions & 8 deletions files/etc/slurm-llnl/monal01/gres.conf

This file was deleted.

6 changes: 0 additions & 6 deletions files/etc/slurm-llnl/monal02/gres.conf

This file was deleted.

96 changes: 70 additions & 26 deletions files/etc/slurm-llnl/slurm.conf
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# See the slurm.conf man page for more information.
#
# Workaround because Slurm does not recognize full hostname...
ControlMachine={{ pillar['slurm']['controller'] }}
ControlMachine=biomedia03
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pillar

#ControlAddr=
#BackupController=
#BackupAddr=
Expand Down Expand Up @@ -31,6 +31,7 @@ JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert
#Licenses=foo*4,bar
MailProg=/usr/bin/mail
MaxJobCount=25000
MaxArraySize=32000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
Expand Down Expand Up @@ -119,9 +120,9 @@ PreemptMode=OFF
#
# LOGGING AND ACCOUNTING
DefaultStorageType=slurmdbd
DefaultStorageUser={{ pillar['slurm']['db']['user'] }}
DefaultStorageUser=slurm
#DefaultStorageLoc=/var/log/slurm-llnl/job_completions.log
DefaultStorageHost={{ pillar['slurm']['controller'] }}
DefaultStorageHost=biomedia03
DefaultStoragePort=6819
AccountingStorageEnforce=associations,limits
#AccountingStorageHost=
Expand All @@ -132,11 +133,11 @@ AccountingStorageEnforce=associations,limits
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=biomediacluster
#DebugFlags=
#JobCompHost={{ pillar['slurm']['controller'] }}
DebugFlags=Gres
#JobCompHost=biomedia03
#JobCompLoc=
#JobCompUser={{ pillar['slurm']['db']['user'] }}
#JobCompPass={{ pillar['slurm']['db']['password'] }}
#JobCompUser=slurm
#JobCompPass=1BUy4eVv7X
#JobCompPort=
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
Expand Down Expand Up @@ -164,26 +165,69 @@ SlurmSchedLogFile=/var/log/slurm-llnl/sched.log
#
#
# GRes configuration
GresTypes={{ ','.join(pillar['slurm']['gres']) }}
GresTypes=gpu
# COMPUTE NODES
{% for node, values in pillar['slurm']['nodes']['batch']['cpus'].items() %}
NodeName={{ node }} RealMemory={{ values.mem }} CPUs={{ values.cores }} State=UNKNOWN
{% endfor %}
{% for node, values in pillar['slurm']['nodes']['interactive']['cpus'].items() %}
NodeName={{ node }} RealMemory={{ values.mem }} CPUs={{ values.cores }} State=UNKNOWN
{% endfor %}
{% for node, values in pillar['slurm']['nodes']['batch']['gpus'].items() %}
NodeName={{ node }} RealMemory={{ values.mem }} CPUs={{ values.cores }} Gres={{ (values.gres.items()|join(',')).replace('(','').replace(')','').replace("', ",':').replace("'",'') }} State=UNKNOWN
{% endfor %}
# Partitions
PartitionName=long Nodes={{ ','.join(pillar['slurm']['nodes']['batch']['cpus']) }} Default=YES MaxTime=43200
PartitionName=short Nodes={{ ','.join(pillar['slurm']['nodes']['batch']['cpus']) }} Default=NO MaxTime=60 Priority=5000
PartitionName=gpus Nodes={{ ','.join(pillar['slurm']['nodes']['batch']['gpus']) }} Default=NO MaxTime=10080
PartitionName=interactive Nodes={{ ','.join(pillar['slurm']['nodes']['interactive']['cpus']) }} Default=NO MaxTime=4320 Priority=7000 PreemptMode=OFF

{% set rocsList = [] %}
{% for node, values in pillar['slurm']['nodes']['batch']['cpus'].items() %} {% if node.startswith('roc') %} {% set rocsListTrash = rocsList.append(node) %} {% endif %} {% endfor %}
NodeName=biomedia01 RealMemory=64000 CPUs=24 State=UNKNOWN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be part of the Jinja template

NodeName=biomedia02 RealMemory=63800 CPUs=24 State=UNKNOWN

NodeName=biomedia05 RealMemory=64414 CPUs=24 State=UNKNOWN

NodeName=biomedia06 RealMemory=128850 CPUs=64 State=UNKNOWN

NodeName=biomedia07 RealMemory=128850 CPUs=64 State=UNKNOWN

NodeName=biomedia08 RealMemory=128850 CPUs=64 State=UNKNOWN

NodeName=biomedia09 RealMemory=128850 CPUs=64 State=UNKNOWN

NodeName=biomedia10 RealMemory=128851 CPUs=24 State=UNKNOWN

NodeName=biomedia11 RealMemory=252000 CPUs=32 State=UNKNOWN

NodeName=roc01 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc02 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc03 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc04 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc05 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc06 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc07 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc08 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc09 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc10 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc11 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc12 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc13 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc14 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc15 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=roc16 RealMemory=257869 CPUs=32 State=UNKNOWN

NodeName=monal01 RealMemory=80000 CPUs=12 Gres=gpu:8 State=UNKNOWN

# Partitions
PartitionName=long Nodes=biomedia01,biomedia02,biomedia05,biomedia06,biomedia07,biomedia08,biomedia09,biomedia10,roc01,roc02,roc03 Default=YES MaxTime=43200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all roc machines should be in long as well. it's fine to have two partitions overlapping

PartitionName=rocsLong Nodes=roc04,roc05,roc06,roc07,roc08,roc09,roc10,roc11,roc12,roc13,roc14,roc15,roc16 Default=NO MaxTime=43200

PartitionName=rocsLong Nodes={{ ','.join(rocsList) }} Default=NO MaxTime=43200
PartitionName=rocsShort Nodes={{ ','.join(rocsList) }} Default=NO MaxTime=60 Priority=5000
#PartitionName=long Nodes=biomedia01,biomedia02,biomedia03,biomedia05 Default=YES MaxTime=43200
#PartitionName=short Nodes=biomedia01,biomedia03,biomedia05 Default=NO MaxTime=60 Priority=5000
PartitionName=gpus Nodes=monal01 Default=NO MaxTime=10080 MaxCPUsPerNode=4 MaxMemPerNode=30720
Copy link
Member

@jopasserat jopasserat May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the MaxCPUsPerNode=4 MaxMemPerNode=30720 settings. Just legacy code here

PartitionName=interactive Nodes=biomedia11 Default=NO MaxTime=4320 Priority=7000 PreemptMode=OFF

#PartitionName=rocsLong Nodes= Default=NO MaxTime=43200
#PartitionName=rocsShort Nodes= Default=NO MaxTime=60 Priority=5000
10 changes: 5 additions & 5 deletions files/etc/slurm-llnl/slurmdbd.conf
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ ArchiveSuspend=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive
#AuthInfo=/var/run/munge/munge.socket.2
AuthType=auth/munge
DbdHost={{ pillar['slurm']['controller'] }}
DbdHost=biomedia03
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no hardcoded values please => use Pillar

DbdPort=6819
DebugLevel=info
PurgeEventAfter=1month
Expand All @@ -16,10 +16,10 @@ PurgeSuspendAfter=1month
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurm-llnl/slurmdbd.pid
SlurmUser=slurm
#StorageHost={{ pillar['slurm']['controller'] }}
#StorageHost=biomedia03
StorageHost=localhost
StorageType=accounting_storage/mysql
StoragePort=3306
StorageLoc={{ pillar['slurm']['db']['name'] }}
StorageUser={{ pillar['slurm']['db']['user'] }}
StoragePass={{ pillar['slurm']['db']['password'] }}
StorageLoc=slurmdb
StorageUser=slurm
StoragePass=1BUy4eVv7X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

password in cleartext in the commit history...

Loading