Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: rocm-smi --setpoweroverdrive does allow lowering the power usage anymore #190

Open
acaleechurn opened this issue Aug 14, 2024 · 14 comments

Comments

@acaleechurn
Copy link

Problem Description

rocm-smi --setpoweroverdrive 200 does allow lowering the power usage anymore. This was functional (6.1.2) prior to upgrading. We would lower the temperature significantly with minimal impact on training times.
Operating System

22.04.4 LTS (Jammy Jellyfish)
CPU

AMD EPYC 7402P
GPU

AMD Instinct MI100
ROCm Version

ROCm 6.2.0
ROCm Component

amdsmi, rocm_smi_lib
Steps to Reproduce

acaleechurn@svr-ph-ml01:~$ rocm-smi --setpoweroverdrive 200
============================ ROCm System Management Interface ============================
================================ Set GPU Power OverDrive =================================
ERROR: GPU[0] : Unable to set Power OverDrive
ERROR: GPU[0] : Value cannot be less than: 290W
ERROR: GPU[1] : Unable to set Power OverDrive
ERROR: GPU[1] : Value cannot be less than: 290W
ERROR: GPU[2] : Unable to set Power OverDrive
ERROR: GPU[2] : Value cannot be less than: 290W

================================== End of ROCm SMI Log ===================================

Operating System

"Ubuntu" VERSION="22.04.4 LTS (Jammy Jellyfish)"

CPU

AMD EPYC 7402P

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

amdsmi

Steps to Reproduce

Cleaning up the install and running a multi-version install with the kernel-mode-driver from 6.1.2 works as expected. Upgrading to 6.2.0 breaks the functionality.

acaleechurn@svr-ph-ml01:~$ rocm-smi --setpoweroverdrive 200

acaleechurn@svr-ph-ml01:~$ rocm-smi --setpoweroverdrive 200
============================ ROCm System Management Interface ============================
================================ Set GPU Power OverDrive =================================
GPU[0] : Successfully set power to: 200W
GPU[1] : Successfully set power to: 200W
GPU[2] : Successfully set power to: 200W

================================== End of ROCm SMI Log ===================================
NAME="Ubuntu"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU:
model name : AMD EPYC 7402P 24-Core Processor
GPU:
Name: AMD EPYC 7402P 24-Core Processor
Marketing Name: AMD EPYC 7402P 24-Core Processor
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
acaleechurn@svr-ph-ml01:~$ rocminfo --support
ROCk module version 6.7.0 is loaded
HSA System Attributes

Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

acaleechurn@svr-ph-ml01:~$
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
�[37mROCk module version 6.8.5 is loaded�[0m
HSA System Attributes

Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

HSA Agents

Agent 1

Name: AMD EPYC 7402P 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402P 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 263781388(0xfb8fc0c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 263781388(0xfb8fc0c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 263781388(0xfb8fc0c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 2

Name: gfx908
Uuid: GPU-e336f877361f1399
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 35328
Internal Node ID: 1
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32

Agent 3

Name: gfx908
Uuid: GPU-c13bf411f2279689
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 17920
Internal Node ID: 2
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32

Agent 4

Name: gfx908
Uuid: GPU-34985558949eb94a
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 1280
Internal Node ID: 3
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information

OS:
NAME="Ubuntu"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU:
model name : AMD EPYC 7402P 24-Core Processor
GPU:
Name: AMD EPYC 7402P 24-Core Processor
Marketing Name: AMD EPYC 7402P 24-Core Processor
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
(acal-ml) acaleechurn@svr-ph-ml01:~$

Additional Information

ROCM 6.2 with the kernel mode driver from the same repo does not work but the same install with the kernel mode driver from 6.1.2 works as expected.

@OpenMOSE
Copy link

i'm also same problem.

AMD MI100 x 2
AMD Ryzen 5700X3D

Ubuntu 22.04
Rocm 6.2.0

@harkgill-amd
Copy link
Contributor

Hi @acaleechurn, could you please confirm if you are able to use the --setpoweroverdrive option for values greater than 290W?

@acaleechurn
Copy link
Author

Hi @harkgill-amd

I have removed the kernel mode driver from 6.1.2 and installed the one from 6.2.0 and I cannot set anything above or under the displayed value.

acaleechurn@svr-ph-ml01:~$ !308
/opt/rocm-6.2.0/bin/rocm-smi --setpoweroverdrive 150

============================ ROCm System Management Interface ============================
================================ Set GPU Power OverDrive =================================
ERROR: GPU[0] : Unable to set Power OverDrive
ERROR: GPU[0] : Value cannot be less than: 290W
ERROR: GPU[1] : Unable to set Power OverDrive
ERROR: GPU[1] : Value cannot be less than: 290W
ERROR: GPU[2] : Unable to set Power OverDrive
ERROR: GPU[2] : Value cannot be less than: 290W

================================== End of ROCm SMI Log ===================================
acaleechurn@svr-ph-ml01:~$ !309
/opt/rocm-6.2.0/bin/rocm-smi --setpoweroverdrive 295

============================ ROCm System Management Interface ============================
================================ Set GPU Power OverDrive =================================
ERROR: GPU[0] : Unable to set Power OverDrive
ERROR: GPU[0] : Value cannot be greater than: 290W
ERROR: GPU[1] : Unable to set Power OverDrive
ERROR: GPU[1] : Value cannot be greater than: 290W
ERROR: GPU[2] : Unable to set Power OverDrive
ERROR: GPU[2] : Value cannot be greater than: 290W

================================== End of ROCm SMI Log ===================================

@harkgill-amd
Copy link
Contributor

Hi @acaleechurn, quick update, I was able to reproduce this issue internally on a MI100 system. Will continue to investigate this issue and update this thread with relevant details.

@OpenMOSE
Copy link

gooday do you have any update?

in Rocm 6.2.2 still couldnt change --setpoweroverdrive

@harkgill-amd
Copy link
Contributor

@acaleechurn and @OpenMOSE, this is fixed in ROCm 6.2.4. Could you please give this a try on your end?

@OpenMOSE
Copy link

OpenMOSE commented Nov 14, 2024

gooday, still can't change power limit on Rocm 6.2.4
ubuntu 22.04
kernel 5.15.0-125-generic

If you need any information please let me know.

(base) client@mi100:~$ rocm-smi --setpoweroverdrive 220
[sudo] password for client: 


============================ ROCm System Management Interface ============================
================================ Set GPU Power OverDrive =================================
ERROR: GPU[0]	: Unable to set Power OverDrive
ERROR: GPU[0]		: Value cannot be less than: 290W 
ERROR: GPU[1]	: Unable to set Power OverDrive
ERROR: GPU[1]		: Value cannot be less than: 290W 
==========================================================================================
================================== End of ROCm SMI Log ===================================
(base) client@mi100:~$ rocm-smi


========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                   
====================================================================================================================
0       1     0x738c,   58447  31.0°C  34.0W  N/A, N/A, 0         300Mhz  1200Mhz  0%   auto  290.0W  0%     0%    
1       2     0x738c,   5391   33.0°C  34.0W  N/A, N/A, 0         300Mhz  1200Mhz  0%   auto  290.0W  0%     0%    
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================
(base) client@mi100:~$ rocm-smi -a


============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.8.5
==========================================================================================
=========================================== ID ===========================================
GPU[0]		: Device Name: 		Arcturus GL-XL [Instinct MI100]
GPU[0]		: Device ID: 		0x738c
GPU[0]		: Device Rev: 		0x01
GPU[0]		: Subsystem ID: 	0x0c34
GPU[0]		: GUID: 		58447
GPU[1]		: Device Name: 		Arcturus GL-XL [Instinct MI100]
GPU[1]		: Device ID: 		0x738c
GPU[1]		: Device Rev: 		0x01
GPU[1]		: Subsystem ID: 	0x0c34
GPU[1]		: GUID: 		5391
==========================================================================================
======================================= Unique ID ========================================
GPU[0]		: Unique ID: 0x4eaff7b4ef465d52
GPU[1]		: Unique ID: 0xacfe22f7902a5d23
==========================================================================================
========================================= VBIOS ==========================================
GPU[0]		: VBIOS version: 113-D3431401-100
GPU[1]		: VBIOS version: 113-D3431401-100
==========================================================================================
====================================== Temperature =======================================
GPU[0]		: Temperature (Sensor edge) (C): 31.0
GPU[0]		: Temperature (Sensor junction) (C): 35.0
GPU[0]		: Temperature (Sensor memory) (C): 30.0
GPU[1]		: Temperature (Sensor edge) (C): 32.0
GPU[1]		: Temperature (Sensor junction) (C): 35.0
GPU[1]		: Temperature (Sensor memory) (C): 33.0
==========================================================================================
=============================== Current clock frequencies ================================
GPU[0]		: fclk clock level: 0: (1402Mhz)
GPU[0]		: mclk clock level: 0: (1200Mhz)
GPU[0]		: sclk clock level: 0: (300Mhz)
GPU[0]		: socclk clock level: 0: (1000Mhz)
GPU[0]		: pcie clock level: 0 (16.0GT/s x8)
GPU[1]		: fclk clock level: 0: (1402Mhz)
GPU[1]		: mclk clock level: 0: (1200Mhz)
GPU[1]		: sclk clock level: 0: (300Mhz)
GPU[1]		: socclk clock level: 0: (1000Mhz)
GPU[1]		: pcie clock level: 0 (16.0GT/s x8)
==========================================================================================
=================================== Current Fan Metric ===================================
GPU[0]		: Not supported
GPU[1]		: Not supported
==========================================================================================
================================= Show Performance Level =================================
GPU[0]		: Performance Level: auto
GPU[1]		: Performance Level: auto
==========================================================================================
==================================== OverDrive Level =====================================
GPU[0]		: get_overdrive_level_sclk, Not supported on the given system
GPU[1]		: get_overdrive_level_sclk, Not supported on the given system
==========================================================================================
==================================== OverDrive Level =====================================
GPU[0]		: get_mem_overdrive_level_mclk, Not supported on the given system
GPU[1]		: get_mem_overdrive_level_mclk, Not supported on the given system
==========================================================================================
======================================= Power Cap ========================================
GPU[0]		: Max Graphics Package Power (W): 290.0
GPU[1]		: Max Graphics Package Power (W): 290.0
==========================================================================================
================================== Show Power Profiles ===================================
GPU[0]		: 1. Available power profile (#1 of 7): CUSTOM
GPU[0]		: 2. Available power profile (#2 of 7): VIDEO
GPU[0]		: 3. Available power profile (#3 of 7): POWER SAVING
GPU[0]		: 4. Available power profile (#4 of 7): COMPUTE
GPU[0]		: 5. Available power profile (#7 of 7): BOOTUP DEFAULT*
GPU[1]		: 1. Available power profile (#1 of 7): CUSTOM
GPU[1]		: 2. Available power profile (#2 of 7): VIDEO
GPU[1]		: 3. Available power profile (#3 of 7): POWER SAVING
GPU[1]		: 4. Available power profile (#4 of 7): COMPUTE
GPU[1]		: 5. Available power profile (#7 of 7): BOOTUP DEFAULT*
==========================================================================================
=================================== Power Consumption ====================================
GPU[0]		: Average Graphics Package Power (W): 34.0
GPU[1]		: Average Graphics Package Power (W): 34.0
==========================================================================================
============================== Supported clock frequencies ===============================
GPU[0]		: 
GPU[0]		: Supported fclk frequencies on GPU0
GPU[0]		: 0: 1402Mhz *
GPU[0]		: 
GPU[0]		: Supported mclk frequencies on GPU0
GPU[0]		: 0: 1200Mhz *
GPU[0]		: 
GPU[0]		: Supported sclk frequencies on GPU0
GPU[0]		: 0: 300Mhz *
GPU[0]		: 1: 495Mhz
GPU[0]		: 2: 731Mhz
GPU[0]		: 3: 962Mhz
GPU[0]		: 4: 1029Mhz
GPU[0]		: 5: 1087Mhz
GPU[0]		: 6: 1147Mhz
GPU[0]		: 7: 1189Mhz
GPU[0]		: 8: 1235Mhz
GPU[0]		: 9: 1283Mhz
GPU[0]		: 10: 1319Mhz
GPU[0]		: 11: 1363Mhz
GPU[0]		: 12: 1404Mhz
GPU[0]		: 13: 1430Mhz
GPU[0]		: 14: 1472Mhz
GPU[0]		: 15: 1502Mhz
GPU[0]		: 
GPU[0]		: Supported socclk frequencies on GPU0
GPU[0]		: 0: 1000Mhz *
GPU[0]		: 
GPU[0]		: Supported PCIe frequencies on GPU0
GPU[0]		: 0: 16.0GT/s x8 *
GPU[0]		: 
------------------------------------------------------------------------------------------
GPU[1]		: 
GPU[1]		: Supported fclk frequencies on GPU1
GPU[1]		: 0: 1402Mhz *
GPU[1]		: 
GPU[1]		: Supported mclk frequencies on GPU1
GPU[1]		: 0: 1200Mhz *
GPU[1]		: 
GPU[1]		: Supported sclk frequencies on GPU1
GPU[1]		: 0: 300Mhz *
GPU[1]		: 1: 495Mhz
GPU[1]		: 2: 731Mhz
GPU[1]		: 3: 962Mhz
GPU[1]		: 4: 1029Mhz
GPU[1]		: 5: 1087Mhz
GPU[1]		: 6: 1147Mhz
GPU[1]		: 7: 1189Mhz
GPU[1]		: 8: 1235Mhz
GPU[1]		: 9: 1283Mhz
GPU[1]		: 10: 1319Mhz
GPU[1]		: 11: 1363Mhz
GPU[1]		: 12: 1404Mhz
GPU[1]		: 13: 1430Mhz
GPU[1]		: 14: 1472Mhz
GPU[1]		: 15: 1502Mhz
GPU[1]		: 
GPU[1]		: Supported socclk frequencies on GPU1
GPU[1]		: 0: 1000Mhz *
GPU[1]		: 
GPU[1]		: Supported PCIe frequencies on GPU1
GPU[1]		: 0: 16.0GT/s x8 *
GPU[1]		: 
------------------------------------------------------------------------------------------
==========================================================================================
=================================== % time GPU is busy ===================================
GPU[0]		: GPU use (%): 0
GPU[1]		: GPU use (%): 0
==========================================================================================
=================================== Current Memory Use ===================================
GPU[0]		: GPU Memory Allocated (VRAM%): 0
GPU[0]		: GPU Memory Read/Write Activity (%): 0
GPU[0]		: Memory Activity: N/A
GPU[0]		: Avg. Memory Bandwidth: 0
GPU[1]		: GPU Memory Allocated (VRAM%): 0
GPU[1]		: GPU Memory Read/Write Activity (%): 0
GPU[1]		: Memory Activity: N/A
GPU[1]		: Avg. Memory Bandwidth: 0
==========================================================================================
===================================== Memory Vendor ======================================
GPU[0]		: GPU memory vendor: samsung
GPU[1]		: GPU memory vendor: samsung
==========================================================================================
================================== PCIe Replay Counter ===================================
GPU[0]		: PCIe Replay Count: 0
GPU[1]		: PCIe Replay Count: 0
==========================================================================================
===================================== Serial Number ======================================
GPU[0]		: get_serial_number, Not supported on the given system
GPU[0]		: Serial Number: N/A
GPU[1]		: get_serial_number, Not supported on the given system
GPU[1]		: Serial Number: N/A
==========================================================================================
===================================== KFD Processes ======================================
No KFD PIDs currently running
==========================================================================================
================================== GPUs Indexed by PID ===================================
No KFD PIDs currently running
==========================================================================================
======================= GPU Memory clock frequencies and voltages ========================
GPU[0]		: get_od_volt, Not supported on the given system
GPU[1]		: get_od_volt, Not supported on the given system
==========================================================================================
==================================== Current voltage =====================================
GPU[0]		: Voltage (mV): 662
GPU[1]		: Voltage (mV): 656
==========================================================================================
======================================= PCI Bus ID =======================================
GPU[0]		: PCI Bus: 0000:0C:00.0
GPU[1]		: PCI Bus: 0000:0F:00.0
==========================================================================================
================================== Firmware Information ==================================
GPU[0]		: ASD firmware version: 	0x21000059
GPU[0]		: get_firmware_version_CE, Not supported on the given system
GPU[0]		: get_firmware_version_DMCU, Not supported on the given system
GPU[0]		: get_firmware_version_MC, Not supported on the given system
GPU[0]		: get_firmware_version_ME, Not supported on the given system
GPU[0]		: MEC firmware version: 	67
GPU[0]		: MEC2 firmware version: 	67
GPU[0]		: get_firmware_version_MES, Not supported on the given system
GPU[0]		: get_firmware_version_MES KIQ, Not supported on the given system
GPU[0]		: get_firmware_version_PFP, Not supported on the given system
GPU[0]		: RLC firmware version: 	24
GPU[0]		: get_firmware_version_RLC SRLC, Not supported on the given system
GPU[0]		: get_firmware_version_RLC SRLG, Not supported on the given system
GPU[0]		: get_firmware_version_RLC SRLS, Not supported on the given system
GPU[0]		: SDMA firmware version: 	18
GPU[0]		: SDMA2 firmware version: 	18
GPU[0]		: SMC firmware version: 	00.54.29.00
GPU[0]		: SOS firmware version: 	0x0017004f
GPU[0]		: TA RAS firmware version: 	27.00.01.62
GPU[0]		: TA XGMI firmware version: 	32.00.00.17
GPU[0]		: get_firmware_version_UVD, Not supported on the given system
GPU[0]		: get_firmware_version_VCE, Not supported on the given system
GPU[0]		: VCN firmware version: 	0x01101015
GPU[1]		: ASD firmware version: 	0x21000059
GPU[1]		: get_firmware_version_CE, Not supported on the given system
GPU[1]		: get_firmware_version_DMCU, Not supported on the given system
GPU[1]		: get_firmware_version_MC, Not supported on the given system
GPU[1]		: get_firmware_version_ME, Not supported on the given system
GPU[1]		: MEC firmware version: 	67
GPU[1]		: MEC2 firmware version: 	67
GPU[1]		: get_firmware_version_MES, Not supported on the given system
GPU[1]		: get_firmware_version_MES KIQ, Not supported on the given system
GPU[1]		: get_firmware_version_PFP, Not supported on the given system
GPU[1]		: RLC firmware version: 	24
GPU[1]		: get_firmware_version_RLC SRLC, Not supported on the given system
GPU[1]		: get_firmware_version_RLC SRLG, Not supported on the given system
GPU[1]		: get_firmware_version_RLC SRLS, Not supported on the given system
GPU[1]		: SDMA firmware version: 	18
GPU[1]		: SDMA2 firmware version: 	18
GPU[1]		: SMC firmware version: 	00.54.29.00
GPU[1]		: SOS firmware version: 	0x0017004f
GPU[1]		: TA RAS firmware version: 	27.00.01.62
GPU[1]		: TA XGMI firmware version: 	32.00.00.17
GPU[1]		: get_firmware_version_UVD, Not supported on the given system
GPU[1]		: get_firmware_version_VCE, Not supported on the given system
GPU[1]		: VCN firmware version: 	0x01101015
==========================================================================================
====================================== Product Info ======================================
GPU[0]		: Card Series: 		Arcturus GL-XL [Instinct MI100]
GPU[0]		: Card Model: 		0x738c
GPU[0]		: Card Vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]		: Card SKU: 		D3431401
GPU[0]		: Subsystem ID: 	0x0c34
GPU[0]		: Device Rev: 		0x01
GPU[0]		: Node ID: 		1
GPU[0]		: GUID: 		58447
GPU[0]		: GFX Version: 		gfx9008
GPU[1]		: Card Series: 		Arcturus GL-XL [Instinct MI100]
GPU[1]		: Card Model: 		0x738c
GPU[1]		: Card Vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]		: Card SKU: 		D3431401
GPU[1]		: Subsystem ID: 	0x0c34
GPU[1]		: Device Rev: 		0x01
GPU[1]		: Node ID: 		2
GPU[1]		: GUID: 		5391
GPU[1]		: GFX Version: 		gfx9008
==========================================================================================
======================================= Pages Info =======================================
==========================================================================================
================================= Show Valid sclk Range ==================================
GPU[0]		: get_od_volt, Not supported on the given system
GPU[1]		: get_od_volt, Not supported on the given system
==========================================================================================
================================= Show Valid mclk Range ==================================
GPU[0]		: get_od_volt, Not supported on the given system
GPU[1]		: get_od_volt, Not supported on the given system
==========================================================================================
================================ Show Valid voltage Range ================================
GPU[0]		: get_od_volt, Not supported on the given system
GPU[1]		: get_od_volt, Not supported on the given system
==========================================================================================
================================== Voltage Curve Points ==================================
GPU[0]		: get_od_volt_info, Not supported on the given system
ERROR: GPU[0]	: Voltage curve Points unsupported.
GPU[1]		: get_od_volt_info, Not supported on the given system
ERROR: GPU[1]	: Voltage curve Points unsupported.
==========================================================================================
==================================== Consumed Energy =====================================
GPU[0]		: Energy counter: 10544
GPU[0]		: Accumulated Energy (uJ): 161323.2
GPU[1]		: Energy counter: 10604
GPU[1]		: Accumulated Energy (uJ): 162241.2
==========================================================================================
=============================== Current Compute Partition ================================
GPU[0]		: Not supported on the given system
GPU[1]		: Not supported on the given system
==========================================================================================
================================ Current Memory Partition ================================
GPU[0]		: Not supported on the given system
GPU[1]		: Not supported on the given system
==========================================================================================
================================== End of ROCm SMI Log ===================================

@OpenMOSE
Copy link

i checked Rocm6.2.4 Release notes.

AMD SMI 24.6.3 ⇒ 24.6.3

24.6.3 to 24.6.3? it means no change?

thank you.

@harkgill-amd
Copy link
Contributor

Correction, a fix has been submitted that will address this issue in an upcoming ROCm release. The issue was with how the minimum and maximum power limits were derived within the amdgpu driver, not amd-smi. I'll provide more updates on the exact release of the fix as soon as I receive them. Thanks!

@OpenMOSE
Copy link

thank you for infomation. once updated. i will check it :)

@OpenMOSE
Copy link

gooday! this issue is solved on Rocm6.3.1 ?

Thank you!

@OpenMOSE
Copy link

OpenMOSE commented Jan 6, 2025

in rocm 6.3.1, still cannot change --setpoweroverdrive

thank you

@harkgill-amd
Copy link
Contributor

@OpenMOSE, the fix has not been released yet. Will update this thread once it's part of an official release.

@OpenMOSE
Copy link

@OpenMOSE, the fix has not been released yet. Will update this thread once it's part of an official release.

got it thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants