Skip to content

Commit 97db18d

Browse files
Load nvidia-uvm at boot time and install nvidia-persistenced as system service on GPU instances
When running on a GPU instance, this commit loads kernel module of Nvidia unified virtual memory by default and install Nvidia persistence daemon as a system service. Nvidia unified virtual memory makes it easy to use memory on both CPU and GPU. Nvidia persistence daemon keeps GPU initialized, therefore shorten application startup latency. See reference: Nvidia-uvm: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ Nvidia persistence daemon: https://docs.nvidia.com/deploy/driver-persistence/index.html Signed-off-by: Hanwen <[email protected]>
1 parent 637319d commit 97db18d

File tree

5 files changed

+67
-0
lines changed

5 files changed

+67
-0
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ This file is used to list changes made in each version of the AWS ParallelCluste
2424
- Add log rotation support for ParallelCluster managed logs.
2525
- Track head node memory and root volume disk utilization using the `mem_used_percent` and `disk_used_percent` metrics collected through the CloudWatch Agent.
2626
- Enforce the DCV Authenticator Server to use at least `TLS-1.2` protocol when creating the SSL Socket.
27+
- Load kernel module [nvidia-uvm](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) by default.
28+
- Install [Nvidia persistence daemon](https://docs.nvidia.com/deploy/driver-persistence/index.html) as a system service.
2729

2830
**CHANGES**
2931
- Upgrade Slurm to version 23.02.1.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
nvidia-uvm

cookbooks/aws-parallelcluster-config/recipes/base.rb

+3
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@
3939
action :configure
4040
end
4141

42+
# Configure Nvidia driver
43+
include_recipe "aws-parallelcluster-config::nvidia"
44+
4245
# EFA runtime configuration
4346
efa 'Configure system for EFA' do
4447
action :configure
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# frozen_string_literal: true
2+
3+
#
4+
# Cookbook:: aws-parallelcluster
5+
# Recipe:: nvidia
6+
#
7+
# Copyright:: 2013-2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
8+
#
9+
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
10+
# License. A copy of the License is located at
11+
#
12+
# http://aws.amazon.com/apache2.0/
13+
#
14+
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
15+
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
18+
if graphic_instance? && nvidia_installed?
19+
# Load kernel module Nvidia-uvm
20+
kernel_module 'nvidia-uvm' do
21+
action :load
22+
end
23+
# Make sure kernel module Nvidia-uvm is loaded at instance boot time
24+
cookbook_file 'nvidia.conf' do
25+
source 'nvidia/nvidia.conf'
26+
path '/etc/modules-load.d/nvidia.conf'
27+
owner 'root'
28+
group 'root'
29+
mode '0644'
30+
end
31+
# Make sure nvidia_persistenced is installed as a system service
32+
bash 'nvidia.run advanced' do
33+
cwd '/usr/share/doc/NVIDIA_GLX-1.0/samples'
34+
user 'root'
35+
group 'root'
36+
code <<-NVIDIA
37+
tar -xf nvidia-persistenced-init.tar.bz2
38+
./nvidia-persistenced-init/install.sh
39+
NVIDIA
40+
end
41+
end

test/recipes/controls/aws_parallelcluster_config/nvidia_spec.rb

+20
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,26 @@
3939
end
4040
end
4141

42+
control 'tag:config_nvidia_uvm_and_persistenced_on_graphic_instances' do
43+
only_if do
44+
!(os_properties.centos7? && os_properties.arm?) &&
45+
!instance.custom_ami? && instance.graphic?
46+
end
47+
48+
describe kernel_module('nvidia_uvm') do
49+
it { should be_loaded }
50+
end
51+
52+
describe file('/etc/modules-load.d/nvidia.conf') do
53+
its('content') { should include("uvm") }
54+
end
55+
56+
describe service('nvidia-persistenced') do
57+
it { should be_enabled }
58+
it { should be_running }
59+
end
60+
end
61+
4262
control 'tag:config_gdrcopy_disabled_on_non_graphic_instances' do
4363
only_if do
4464
!(os_properties.centos7? && os_properties.arm?) &&

0 commit comments

Comments
 (0)