Skip to content

Commit

Permalink
chore: Add missing caps for CEEMS exporter service
Browse files Browse the repository at this point in the history
* Fix GPU rules by using correct GPU usage mem

Signed-off-by: Mahendra Paipuri <[email protected]>
  • Loading branch information
mahendrapaipuri committed Dec 27, 2024
1 parent 3afad20 commit 1ddc319
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 4 deletions.
4 changes: 2 additions & 2 deletions build/package/ceems_exporter/ceems_exporter.service
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ StartLimitInterval=0

ProtectHome=read-only

AmbientCapabilities=CAP_SYS_PTRACE CAP_DAC_READ_SEARCH CAP_SETUID CAP_SETGID CAP_BPF CAP_PERFMON CAP_SYS_RESOURCE
CapabilityBoundingSet=CAP_SYS_PTRACE CAP_DAC_READ_SEARCH CAP_SETUID CAP_SETGID CAP_BPF CAP_PERFMON CAP_SYS_RESOURCE
AmbientCapabilities=CAP_SYS_PTRACE CAP_DAC_READ_SEARCH CAP_SETUID CAP_SETGID CAP_DAC_OVERRIDE CAP_BPF CAP_PERFMON CAP_SYS_RESOURCE
CapabilityBoundingSet=CAP_SYS_PTRACE CAP_DAC_READ_SEARCH CAP_SETUID CAP_SETGID CAP_DAC_OVERRIDE CAP_BPF CAP_PERFMON CAP_SYS_RESOURCE

ProtectSystem=strict
ProtectControlGroups=true
Expand Down
1 change: 1 addition & 0 deletions etc/nvidia-dcgm-exporter/counters.csv
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encounte
# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Frame buffer memory free (in MB).
DCGM_FI_DEV_FB_USED, gauge, Frame buffer memory used (in MB).
DCGM_FI_DEV_FB_RESERVED, gauge, Frame buffer memory reserved (in MB).

# ECC
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
Expand Down
17 changes: 15 additions & 2 deletions etc/prometheus/rules/gpu.rules
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,21 @@ groups:
expr: avg by (job)(avg_over_time(DCGM_FI_DEV_GPU_UTIL{job="sample-dcgm"}[1d:15m]))

# Average GPU memory usage during last 1h
- record: instance:DCGM_FI_DEV_MEM_COPY_UTIL:avg1h
expr: avg by (job)(avg_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{job="sample-dcgm"}[1d:15m]))
- record: instance:DCGM_FI_DEV_MEM_UTIL:avg1h
expr: |2
avg by (job) (
avg_over_time(
(
(sum by (Hostname) (DCGM_FI_DEV_FB_USED{job="sample-dcgm"}))
/
(
sum by (Hostname) (DCGM_FI_DEV_FB_USED{job="sample-dcgm"})
+
sum by (Hostname) (DCGM_FI_DEV_FB_FREE{job="sample-dcgm"})
)
)[1h:15m]
)
)

# Total energy usage during last 1h in kWh
# PUE of 1 is used by default. Use appropriate PUE by replacing 1 by PUE ratio in expr
Expand Down

0 comments on commit 1ddc319

Please sign in to comment.