Merge pull request NVIDIA#1157 from ajdecon/slurm-nvml

Enable Slurm MIG support with `AutoDetect=NVML`
krepsjan · Sep 30, 2022 · 30f4dee · 30f4dee
2 parents 2cf44a8 + 23e6e01
commit 30f4dee
Show file tree

Hide file tree

Showing 9 changed files with 260 additions and 1 deletion.
diff --git a/config.example/group_vars/slurm-cluster.yml b/config.example/group_vars/slurm-cluster.yml
@@ -17,6 +17,10 @@ slurm_db_password: AlsoReplaceWithASecurePasswordInTheVault
 #slurm_max_job_timelimit: INFINITE
 #slurm_default_job_timelimit:
 
+# Auto-detect GPUs using NVML
+# See docs/slurm-cluster/nvml.md for details.
+slurm_autodetect_nvml: true
+
 # Ensure hosts file generation only runs across slurm cluster
 hosts_add_ansible_managed_hosts_groups: ["slurm-cluster"]
 

diff --git a/docs/slurm-cluster/README.md b/docs/slurm-cluster/README.md
@@ -57,6 +57,8 @@ Instructions for deploying a GPU cluster with Slurm
    > `slurm_enable_ha: true` in `config/group_vars/slurm-cluster.yml`. For more information about HA Slurm deployments,
    > see: https://slurm.schedmd.com/quickstart_admin.html#HA
 
+4. If running on a cluster where you intend to configure [Multi-Instance GPU](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/), consult the [Slurm NVML documentation](./nvml.md).
+
 4. Verify the configuration.
 
    ```bash

diff --git a/docs/slurm-cluster/nvml.md b/docs/slurm-cluster/nvml.md
@@ -0,0 +1,226 @@
+GPU auto-detection with Slurm and NVML
+======================================
+
+By default, DeepOps auto-detects GPUs on each node using a [custom Ansible facts script](https://github.com/NVIDIA/deepops/blob/72fe3a187ceb36c76febb64c0bab484cbae6a451/roles/facts/files/gpus.fact),
+and uses this to generate Slurm configuration files such as [`slurm.conf`](https://slurm.schedmd.com/slurm.conf.html) and [`gres.conf`](https://slurm.schedmd.com/gres.conf.html).
+This mechanism detects GPUs using only `lspci`, allowing us to generate configuration even if a GPU driver is not present.
+
+The disadvantage to this method is that it generates static configuration files.
+This method cannot account for future changes in GPU hardware,
+or dynamic GPU configurations such as the use of [NVIDIA Multi-Instance GPU](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/).
+
+Instead of generating the configuration based on `lspci` output, Slurm provides the option of [GPU auto-detection](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect)
+using the [NVIDIA Management Library](https://developer.nvidia.com/nvidia-management-library-nvml) (NVML).
+This method enables Slurm to automatically detect the presence of NVIDIA GPUs, and set up local CPU and network affinity correctly.
+However, using this feature requires that the NVIDIA driver and CUDA be installed before building Slurm.
+
+
+## Turning on NVML auto-detection
+
+To enable NVML auto-detection in DeepOps, set the following variable in your DeepOps configuration:
+
+```
+slurm_autodetect_nvml: true
+```
+
+If building a new cluster, follow the [Slurm deployment guide](./README.md) to set up your new cluster.
+
+If enabling NVML on an existing cluster, run:
+
+```
+ansible-playbook -l slurm-cluster -e '{"slurm_force_rebuild": true}' playbooks/slurm-cluster/slurm.yml
+```
+
+
+## Configuring slurm.conf for heterogeneous configurations
+
+The auto-generated `slurm.conf` provided by DeepOps assumes a uniform GPU configuration across your cluster.
+It does not account for multiple types of GPU hardware in the same cluster,
+or for multiple different types of MIG instance.
+
+This will result in node configuration lines in `slurm.conf` such as the following:
+
+```
+NodeName=node01  Gres=gpu:2     CPUs=2 Sockets=2 CoresPerSocket=1 ThreadsPerCore=1 Procs=2 RealMemory=15208 State=UNKNOWN
+```
+
+If you have a non-uniform GPU configuration, and especially if you have multiple types of MIG instance, you may wish to configure Slurm so that you can schedule jobs based on GPU type.
+In order to distinguish GPU types, Slurm requires that you specify the number of GPUs of each type expected on the node.
+This includes specifying the expected set of MIG instances on each node.
+
+For example, if you have a node with two A100 GPUs, and in which each GPU has one `2g.20gb` and one `1g.10gb` MIG instance configured, then the `slurm.conf` line for this node might be:
+
+```
+NodeName=node01  Gres=gpu:1g.10gb:2,gpu:2g.20gb:2     CPUs=2 Sockets=2 CoresPerSocket=1 ThreadsPerCore=1 Procs=2 RealMemory=15208 State=UNKNOWN
+```
+
+To configure a custom `slurm.conf` file, instead of using the auto-generated file provided by DeepOps, see the documentation
+on [configuring DeepOps](../deepops/configuration.md) and on [using static Slurm configuration](./large-deployments.md#manually-generate-static-files-for-cluster-wide-configuration).
+
+
+## Configuring MIG
+
+NVIDIA Multi-Instance GPU (MIG) is supported on specific NVIDIA GPUs such as the A100 and the A30.
+This feature enables the GPU to be split into multiple distinct GPU instances, which are presented to the user as if they were distinct physical GPUs.
+This is helpful especially when scheduling applications which do not need a full physical GPU to get good performance.
+
+MIG is managed using the NVIDIA MIG Manager tool.
+To install and configure this tool, you can use the [nvidia-mig.yml](../../playbooks/nvidia-software/nvidia-mig.yml) playbook in DeepOps.
+For example,
+
+```
+ansible-playbook -l slurm-node -e mig_manager_profile="all-1g.10gb" playbooks/nvidia-software/nvidia-mig.yml
+```
+
+Where `mig_manager_profile` is a configuration profile for the NVIDIA `mig-parted` tool.
+
+For more information on configuring MIG, see the documentation for [NVIDIA mig-parted](https://github.com/NVIDIA/mig-parted).
+
+### MIG configuration example
+
+#### Deploy the cluster
+
+Performed a test on a small cluster consisting of the following two nodes:
+
+- Generic white-box server running Ubuntu 20.04 for Slurm controller
+- DGX A100 running DGX OS 5.2.0 for Slurm compute node
+
+With inventory file:
+
+```
+[all]
+login01
+dgx01
+
+######
+# SLURM
+######
+[slurm-master]
+login01
+
+[slurm-nfs]
+login01
+
+[slurm-node]
+dgx01
+
+[slurm-cache:children]
+slurm-master
+
+[slurm-nfs-client:children]
+slurm-node
+
+[slurm-metric:children]
+slurm-master
+
+[slurm-login:children]
+slurm-master
+
+# Single group for the whole cluster
+[slurm-cluster:children]
+slurm-master
+slurm-node
+slurm-cache
+slurm-nfs
+slurm-metric
+slurm-login
+```
+
+With a default configuration in `config/group_vars/slurm-cluster.yml` except for:
+
+```
+slurm_autodetect_nvml: true
+mig_manager_profile: all-balanced-a100-80
+```
+
+Configured MIG on the DGX A100 by running the following:
+
+```
+ansible-playbook -b -l slurm-node playbooks/nvidia-software/nvidia-mig.yml
+```
+
+And then we can show that the MIG devices are configured:
+
+```
+user@dgx01$ nvidia-smi -L | head -n10
+GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-8e100991-ce97-6694-b8eb-b8b1ff5053af)
+  MIG 3g.40gb     Device  0: (UUID: MIG-dcf43e8e-5d19-5fa8-8b1d-8aaca1e467f0)
+  MIG 2g.20gb     Device  1: (UUID: MIG-6b84cf06-0978-530f-bea3-dd6108f98ebf)
+  MIG 1g.10gb     Device  2: (UUID: MIG-3fff3a97-95a9-5046-9053-4c0da3d1add7)
+  MIG 1g.10gb     Device  3: (UUID: MIG-7a99e994-6607-5d8a-9ab3-ace43cf9bd96)
+GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-938c4705-98bf-1735-e98b-c2a5872a8022)
+  MIG 3g.40gb     Device  0: (UUID: MIG-51872ee5-5410-564e-abd2-27382756d2d4)
+  MIG 2g.20gb     Device  1: (UUID: MIG-9af7553b-306e-55b4-a8c7-4cdef4066af4)
+  MIG 1g.10gb     Device  2: (UUID: MIG-3407f5e5-83f9-55c0-abe5-0e2259ac50b1)
+  MIG 1g.10gb     Device  3: (UUID: MIG-91e26891-6f5b-5ff3-9c66-310e47d6b059)
+```
+
+We then deploy a Slurm cluster across the two nodes by running:
+
+```
+ansible-playbook -b -l slurm-cluster playbooks/slurm-cluster.yml
+```
+
+#### Custom slurm.conf and nhc.conf
+
+Because we are using MIG, we will need to use a custom Slurm config to specify the number and type of MIG instances. To do this, we first copy the `slurm.conf` from the DGX node to our DeepOps repo:
+
+```
+$ scp user@dgx01:/etc/slurm/slurm.conf config/files/slurm.conf
+$ scp user@dgx01:/etc/nhc/nhc.conf config/files/nhc.conf
+```
+
+Edit the `slurm.conf` to specify the GPUs on the node according to our configured MIG instances. Note that because we have 8 physical GPUs on the node, we have 8x 3g.40gb instances, 8x 2g.20gb instances, and 16x 1g.10gb instances:
+
+```
+< NodeName=dgx01  Gres=gpu:8     CPUs=256 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 Procs=128 RealMemory=1960850 State=UNKNOWN
+---
+> NodeName=dgx01  Gres=gpu:3g.40gb:8,gpu:2g.20gb:8,gpu:1g.10gb:16  CPUs=256 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 Procs=128 RealMemory=1960850 State=UNKNOWN
+```
+
+NHC does not yet support MIG, so we just disable the GPU check:
+
+```
+<  dgx01 || check_nv_gpu_count 8
+---
+> # dgx01 || check_nv_gpu_count 8
+```
+
+We edit `config/group_vars/slurm-cluster.yml` to specify our custom files:
+
+```
+slurm_conf_template: "../../config/files/slurm.conf"
+nhc_config_template: "../../config/files/nhc.conf"
+```
+
+Then run Ansible to push this configuration back out to the cluster:
+
+```
+ansible-playbook -b -l slurm-cluster playbooks/slurm-cluster/slurm.yml
+ansible-playbook -b -l slurm-cluster playbooks/slurm-cluster/nhc.yml
+```
+
+#### Testing the resulting config
+
+Verify that Slurm sees the list of expected GPUs:
+
+```
+$ scontrol show node dgx01 | grep Gres
+   Gres=gpu:3g.40gb:8(S:0-1),gpu:2g.20gb:8(S:0-1),gpu:1g.10gb:16(S:0-1)
+```
+
+Run a job on a 3g.40gb instance:
+
+```
+$ srun -N1 --gres=gpu:3g.40gb:1 nvidia-smi -L
+GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-8e100991-ce97-6694-b8eb-b8b1ff5053af)
+  MIG 3g.40gb     Device  0: (UUID: MIG-dcf43e8e-5d19-5fa8-8b1d-8aaca1e467f0)
+```
+
+Run a job on a 1g.10gb instance:
+
+```
+$ srun -N1 --gres=gpu:1g.10gb:1 nvidia-smi -L
+GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-8e100991-ce97-6694-b8eb-b8b1ff5053af)
+  MIG 1g.10gb     Device  0: (UUID: MIG-3fff3a97-95a9-5046-9053-4c0da3d1add7)
+```
diff --git a/roles/nvidia-mig-manager/tasks/main.yml b/roles/nvidia-mig-manager/tasks/main.yml
@@ -4,6 +4,8 @@
 # Check node state
 - name: check for MIG capable devices
   shell: set -o pipefail && nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader | grep -v 'N/A'
+  args:
+    executable: "/bin/bash"
   register: has_mig
   failed_when: false
   changed_when: false

diff --git a/roles/slurm/defaults/main.yml b/roles/slurm/defaults/main.yml
@@ -15,7 +15,9 @@ slurm_config_dir: /etc/slurm
 slurmctl_config_dir: /sw/.slurm
 slurm_sysconf_dir: /etc/sysconfig
 slurm_install_prefix: /usr/local
+slurm_cuda_prefix: /usr/local/cuda
 slurm_configure:  './configure --prefix={{ slurm_install_prefix }} --disable-dependency-tracking --disable-debug --disable-x11 --enable-really-no-cray --enable-salloc-kill-cmd --with-hdf5=no --sysconfdir={{ slurm_config_dir }} --enable-pam --with-pam_dir={{ slurm_pam_lib_dir }} --with-shared-libslurm --without-rpath --with-pmix={{ pmix_install_prefix }} --with-hwloc={{ hwloc_install_prefix }}'
+slurm_configure_nvml:  './configure --prefix={{ slurm_install_prefix }} --disable-dependency-tracking --disable-debug --disable-x11 --enable-really-no-cray --enable-salloc-kill-cmd --with-hdf5=no --sysconfdir={{ slurm_config_dir }} --enable-pam --with-pam_dir={{ slurm_pam_lib_dir }} --with-shared-libslurm --without-rpath --with-pmix={{ pmix_install_prefix }} --with-hwloc={{ hwloc_install_prefix }} --with-nvml={{ slurm_cuda_prefix }}'
 slurm_force_rebuild: no
 slurm_contain_ssh: yes
 
@@ -45,6 +47,7 @@ slurm_resume_timeout: 900
 # Sets: GresTypes=gpu
 # Sets: Gres=gpu:{{ gpu_topology|count }}
 slurm_manage_gpus: true
+slurm_autodetect_nvml: true 
 
 # Configuration file templates to use for configuring Slurm.
 # The default values are relative paths that will point to files within the

diff --git a/roles/slurm/molecule/default/molecule.yml b/roles/slurm/molecule/default/molecule.yml
@@ -56,8 +56,15 @@ platforms:
 #    groups:
 #    - slurm-master
 #    - slurm-node
+
+# Note: Molecule tests do not use NVML because the Github runner can't handle
+# the CUDA install as well as the Slurm build, it times out.
 provisioner:
   name: ansible
+  inventory:
+    group_vars:
+      all:
+        slurm_autodetect_nvml: false
   ansible_args:
   - -vv
 verifier:

diff --git a/roles/slurm/tasks/build.yml b/roles/slurm/tasks/build.yml
@@ -170,7 +170,13 @@
   command: "{{ slurm_configure }}"
   args:
     chdir: "{{ slurm_build_dir }}"
-  when: slurm_build
+  when: slurm_build and (not slurm_autodetect_nvml)
+
+- name: configure
+  command: "{{ slurm_configure_nvml }}"
+  args:
+    chdir: "{{ slurm_build_dir }}"
+  when: slurm_build and slurm_autodetect_nvml
 
 - name: build
   shell: "make -j$(nproc) > build.log 2>&1"

diff --git a/roles/slurm/tasks/main.yml b/roles/slurm/tasks/main.yml
@@ -1,4 +1,9 @@
 ---
+- name: Install CUDA when we require NVML support for autodetection
+  include_role:
+    name: nvidia_cuda
+  when: slurm_autodetect_nvml|default(false)
+
 - include: setup-role.yml
 
 - include: build.yml

diff --git a/roles/slurm/templates/etc/slurm/gres.conf b/roles/slurm/templates/etc/slurm/gres.conf
@@ -1,5 +1,9 @@
+{% if slurm_autodetect_nvml -%}
+AutoDetect=nvml
+{% else -%}
 {% set cpu_topology = ansible_local["topology"]["cpu_topology"] -%}
 {% set gpu_topology = ansible_local["topology"]["gpu_topology"] -%}
 {% for affinity in gpu_topology %}
 Name=gpu File=/dev/nvidia{{ loop.index0 }} Cores={{ affinity }}
 {% endfor %}
+{% endif -%}