-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm nodes installation under Ubuntu16.04 #2
base: master
Are you sure you want to change the base?
Changes from 1 commit
a48702a
6632b26
f2c7ef1
b5893f8
0e8c1f2
8e54231
0b8945e
3510b4e
7e84382
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
deb http://archive.ubuntu.com/ubuntu xenial main restricted universe multiverse |
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,4 +9,4 @@ | |
# who wants to be able to SSH in as root via public-key on Biomedia servers. | ||
# disable SSH for anybody but root | ||
+:root:ALL | ||
-:ALL EXCEPT (csg) dr jpassera bglocker:ALL | ||
-:ALL EXCEPT (csg) (biomedia) dr jpassera bglocker jgao:ALL | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. regular users shouldn't have SSH access to the cluster nodes, hence the previous config |
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,8 @@ CgroupReleaseAgentDir=/var/spool/slurm-llnl/cgroup | |
ConstrainCores=yes | ||
TaskAffinity=yes | ||
#ConstrainRAMSpace=no | ||
### not used yet | ||
#ConstrainDevices=no | ||
#AllowedDevicesFile=/etc/slurm-llnl/cgroup_allowed_devices_file.conf | ||
|
||
ConstrainSwapSpace=yes | ||
AllowedSwapSpace=10.0 | ||
# Not well supported until Slurm v14.11.4 https://groups.google.com/d/msg/slurm-devel/oKAUed7AETs/Eb6thh9Lc0YJ | ||
#ConstrainDevices=yes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should that be enabled and not commented out then? |
||
#AllowedDevicesFile=/etc/slurm-llnl/cgroup_allowed_devices_file.conf |
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,7 @@ | |
# See the slurm.conf man page for more information. | ||
# | ||
# Workaround because Slurm does not recognize full hostname... | ||
ControlMachine={{ pillar['slurm']['controller'] }} | ||
ControlMachine=biomedia03 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pillar |
||
#ControlAddr= | ||
#BackupController= | ||
#BackupAddr= | ||
|
@@ -31,6 +31,7 @@ JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert | |
#Licenses=foo*4,bar | ||
MailProg=/usr/bin/mail | ||
MaxJobCount=25000 | ||
MaxArraySize=32000 | ||
#MaxStepCount=40000 | ||
#MaxTasksPerNode=128 | ||
MpiDefault=none | ||
|
@@ -119,9 +120,9 @@ PreemptMode=OFF | |
# | ||
# LOGGING AND ACCOUNTING | ||
DefaultStorageType=slurmdbd | ||
DefaultStorageUser={{ pillar['slurm']['db']['user'] }} | ||
DefaultStorageUser=slurm | ||
#DefaultStorageLoc=/var/log/slurm-llnl/job_completions.log | ||
DefaultStorageHost={{ pillar['slurm']['controller'] }} | ||
DefaultStorageHost=biomedia03 | ||
DefaultStoragePort=6819 | ||
AccountingStorageEnforce=associations,limits | ||
#AccountingStorageHost= | ||
|
@@ -132,11 +133,11 @@ AccountingStorageEnforce=associations,limits | |
#AccountingStorageUser= | ||
AccountingStoreJobComment=YES | ||
ClusterName=biomediacluster | ||
#DebugFlags= | ||
#JobCompHost={{ pillar['slurm']['controller'] }} | ||
DebugFlags=Gres | ||
#JobCompHost=biomedia03 | ||
#JobCompLoc= | ||
#JobCompUser={{ pillar['slurm']['db']['user'] }} | ||
#JobCompPass={{ pillar['slurm']['db']['password'] }} | ||
#JobCompUser=slurm | ||
#JobCompPass=1BUy4eVv7X | ||
#JobCompPort= | ||
JobCompType=jobcomp/none | ||
JobAcctGatherFrequency=30 | ||
|
@@ -164,26 +165,69 @@ SlurmSchedLogFile=/var/log/slurm-llnl/sched.log | |
# | ||
# | ||
# GRes configuration | ||
GresTypes={{ ','.join(pillar['slurm']['gres']) }} | ||
GresTypes=gpu | ||
# COMPUTE NODES | ||
{% for node, values in pillar['slurm']['nodes']['batch']['cpus'].items() %} | ||
NodeName={{ node }} RealMemory={{ values.mem }} CPUs={{ values.cores }} State=UNKNOWN | ||
{% endfor %} | ||
{% for node, values in pillar['slurm']['nodes']['interactive']['cpus'].items() %} | ||
NodeName={{ node }} RealMemory={{ values.mem }} CPUs={{ values.cores }} State=UNKNOWN | ||
{% endfor %} | ||
{% for node, values in pillar['slurm']['nodes']['batch']['gpus'].items() %} | ||
NodeName={{ node }} RealMemory={{ values.mem }} CPUs={{ values.cores }} Gres={{ (values.gres.items()|join(',')).replace('(','').replace(')','').replace("', ",':').replace("'",'') }} State=UNKNOWN | ||
{% endfor %} | ||
# Partitions | ||
PartitionName=long Nodes={{ ','.join(pillar['slurm']['nodes']['batch']['cpus']) }} Default=YES MaxTime=43200 | ||
PartitionName=short Nodes={{ ','.join(pillar['slurm']['nodes']['batch']['cpus']) }} Default=NO MaxTime=60 Priority=5000 | ||
PartitionName=gpus Nodes={{ ','.join(pillar['slurm']['nodes']['batch']['gpus']) }} Default=NO MaxTime=10080 | ||
PartitionName=interactive Nodes={{ ','.join(pillar['slurm']['nodes']['interactive']['cpus']) }} Default=NO MaxTime=4320 Priority=7000 PreemptMode=OFF | ||
|
||
{% set rocsList = [] %} | ||
{% for node, values in pillar['slurm']['nodes']['batch']['cpus'].items() %} {% if node.startswith('roc') %} {% set rocsListTrash = rocsList.append(node) %} {% endif %} {% endfor %} | ||
NodeName=biomedia01 RealMemory=64000 CPUs=24 State=UNKNOWN | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should be part of the Jinja template |
||
NodeName=biomedia02 RealMemory=63800 CPUs=24 State=UNKNOWN | ||
|
||
NodeName=biomedia05 RealMemory=64414 CPUs=24 State=UNKNOWN | ||
|
||
NodeName=biomedia06 RealMemory=128850 CPUs=64 State=UNKNOWN | ||
|
||
NodeName=biomedia07 RealMemory=128850 CPUs=64 State=UNKNOWN | ||
|
||
NodeName=biomedia08 RealMemory=128850 CPUs=64 State=UNKNOWN | ||
|
||
NodeName=biomedia09 RealMemory=128850 CPUs=64 State=UNKNOWN | ||
|
||
NodeName=biomedia10 RealMemory=128851 CPUs=24 State=UNKNOWN | ||
|
||
NodeName=biomedia11 RealMemory=252000 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc01 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc02 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc03 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc04 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc05 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc06 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc07 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc08 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc09 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc10 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc11 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc12 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc13 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc14 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc15 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=roc16 RealMemory=257869 CPUs=32 State=UNKNOWN | ||
|
||
NodeName=monal01 RealMemory=80000 CPUs=12 Gres=gpu:8 State=UNKNOWN | ||
|
||
# Partitions | ||
PartitionName=long Nodes=biomedia01,biomedia02,biomedia05,biomedia06,biomedia07,biomedia08,biomedia09,biomedia10,roc01,roc02,roc03 Default=YES MaxTime=43200 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. all roc machines should be in long as well. it's fine to have two partitions overlapping |
||
PartitionName=rocsLong Nodes=roc04,roc05,roc06,roc07,roc08,roc09,roc10,roc11,roc12,roc13,roc14,roc15,roc16 Default=NO MaxTime=43200 | ||
|
||
PartitionName=rocsLong Nodes={{ ','.join(rocsList) }} Default=NO MaxTime=43200 | ||
PartitionName=rocsShort Nodes={{ ','.join(rocsList) }} Default=NO MaxTime=60 Priority=5000 | ||
#PartitionName=long Nodes=biomedia01,biomedia02,biomedia03,biomedia05 Default=YES MaxTime=43200 | ||
#PartitionName=short Nodes=biomedia01,biomedia03,biomedia05 Default=NO MaxTime=60 Priority=5000 | ||
PartitionName=gpus Nodes=monal01 Default=NO MaxTime=10080 MaxCPUsPerNode=4 MaxMemPerNode=30720 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can remove the |
||
PartitionName=interactive Nodes=biomedia11 Default=NO MaxTime=4320 Priority=7000 PreemptMode=OFF | ||
|
||
#PartitionName=rocsLong Nodes= Default=NO MaxTime=43200 | ||
#PartitionName=rocsShort Nodes= Default=NO MaxTime=60 Priority=5000 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,7 @@ ArchiveSuspend=no | |
#ArchiveScript=/usr/sbin/slurm.dbd.archive | ||
#AuthInfo=/var/run/munge/munge.socket.2 | ||
AuthType=auth/munge | ||
DbdHost={{ pillar['slurm']['controller'] }} | ||
DbdHost=biomedia03 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no hardcoded values please => use Pillar |
||
DbdPort=6819 | ||
DebugLevel=info | ||
PurgeEventAfter=1month | ||
|
@@ -16,10 +16,10 @@ PurgeSuspendAfter=1month | |
LogFile=/var/log/slurm-llnl/slurmdbd.log | ||
PidFile=/var/run/slurm-llnl/slurmdbd.pid | ||
SlurmUser=slurm | ||
#StorageHost={{ pillar['slurm']['controller'] }} | ||
#StorageHost=biomedia03 | ||
StorageHost=localhost | ||
StorageType=accounting_storage/mysql | ||
StoragePort=3306 | ||
StorageLoc={{ pillar['slurm']['db']['name'] }} | ||
StorageUser={{ pillar['slurm']['db']['user'] }} | ||
StoragePass={{ pillar['slurm']['db']['password'] }} | ||
StorageLoc=slurmdb | ||
StorageUser=slurm | ||
StoragePass=1BUy4eVv7X | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. password in cleartext in the commit history... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move that lower in an "Instructions" subsection instead of replacing what the formula contains?