A daemon that automatically manages the performance states of NVIDIA GPUs.
flowchart TD
A(Start) --> B
subgraph For each GPU
B("Check temperature[1]") -->|Below threshold| D
B("Check temperature[1]") -->|Above threshold| C
C("Enter low PState[2]") --> M
D(Check utilization) -->|Is 0%| E
D(Check utilization) -->|Is not 0%| J
E(Check current PState) -->|High| F
E(Check current PState) -->|Low| I
F("Iterations counter exceeded threshold[3]") -->|Yes| G
F("Iterations counter exceeded threshold[3]") -->|No| H
G("Enter low PState[2]") --> H
H(Increment iterations counter) --> M
I(Do nothing) --> M
J(Check current PState) -->|High| K
J(Check current PState) -->|Low| L
K(Reset iterations counter) --> M
L("Enter high PState[4]") --> M
end
M(End) --> N
N("Sleep[5]") --> A
1 - Threshold is controlled by option --temperature-threshold
(default: 80
degrees C)
2 - Value is controlled by option --performance-state-low
(default: 8
)
3 - Threshold is controlled by option --iterations-before-switch
(default: 30
iterations)
4 - Value is controlled by option --performance-state-high
(default: 16
)
5 - Value is controlled by option --sleep-interval
(default: 100
milliseconds)
Make sure the proprietary NVIDIA driver is installed.
You will need the following libraries:
libnvidia-api.so.1
libnvidia-ml.so.1
Packages that provide these libraries:
- ArchLinux:
nvidia-utils
- Debian:
libnvidia-api1
orlibnvidia-tesla-api1
(depending on the GPU and driver installed)
On Debian derivatives, you can use apt search libnvidia-api.so.1
and apt search libnvidia-ml.so.1
to find the package you need.
Note that you MUST run this daemon at the host level, i.e. where the CUDA Driver is available. You can NOT run this daemon in a container.
Make sure the NVIDIA driver is installed.
Download the latest version of the executable for your OS from releases.
- CMake
- CUDA toolkit
# Configure
cmake -B build
# Build
cmake --build build
You can use -i
/--ids
option to manage only specific GPUs.
Suppose you have 8 GPUs and you want to manage only the first 4 (as in nvidia-smi
):
./nvidia-pstated -i 0,1,2,3
Install nvidia-pstated
in /usr/local/bin
. Then save the following as /etc/systemd/system/nvidia-pstated.service
.
[Unit]
Description=A daemon that automatically manages the performance states of NVIDIA GPUs
StartLimitInterval=0
[Service]
DynamicUser=yes
ExecStart=/usr/local/bin/nvidia-pstated
Restart=on-failure
RestartSec=1s
[Install]
WantedBy=multi-user.target
If you are using a hypervisor (KVM) with a vGPU manager, you cannot run nvidia-pstated
in virtual machines. Instead, you can run it at the hypervisor level.
To do this, you need to:
- Extract
libnvidia-api.so.1
from your guest driver (in my caseGuest_Drivers/nvidia-linux-grid-535_535.183.06_amd64.deb/data.tar.xz/usr/lib/x86_64-linux-gnu/libnvidia-api.so.1
) to some directory. - Download
nvidia-pstated
to the same directory. - Try running
nvidia-pstated
:LD_LIBRARY_PATH=. ./nvidia-pstated
. You should get the following:Check$ LD_LIBRARY_PATH=. ./nvidia-pstated NvAPI_Initialize(): NVAPI_ERROR
dmesg
, you should get the following message:NVRM: API mismatch: the client has the version 535.183.06, but NVRM: this kernel module has the version 535.183.04. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version.
- Use
sed -i 's/535.183.06/535.183.04/g' libnvidia-api.so.1
(replace the values with what you got indmesg
) to replace the client version inlibnvidia-api.so.1
. - Run
nvidia-pstated
:LD_LIBRARY_PATH=. ./nvidia-pstated
. Enjoy.