-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add recording rules for prometheus
* Rename configs to etc in repo Signed-off-by: Mahendra Paipuri <[email protected]>
- Loading branch information
1 parent
b53e768
commit 38fe313
Showing
11 changed files
with
571 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,265 @@ | ||
--- | ||
# Recording rules for nodes with CPU and GPU | ||
# | ||
# This example is applicable for nVIDIA GPUs and metrics reported by nVIDIA-DCGM exporter | ||
# The core idea should be applicable for AMD GPUs as well using the power usage reported | ||
# by AMD SMI exporter. | ||
# | ||
# We can leverage the recording rules to estimate the power consumption of each compute | ||
# unit. We get node level power consumption from IPMI DCMI readings and we should split | ||
# this global power comsumption into consumption of each compute unit. The idea here is | ||
# to use RAPL power readings that report power consumption by CPU and DRAM to split | ||
# the power reading of IPMI DCMI to power consumption of CPU and DRAM. There is no easy | ||
# way to split the power for rest of the components like network, storage, fans (if they exist). | ||
# | ||
# We can also introduce the Power Usage Effectiveness (PUE) ratio in the power usage | ||
# estimation of compute units in these recording rules. | ||
# | ||
# Energy consumption per compute unit is estimated with following assumptions: | ||
# | ||
# Firstly, we make an assumption that 90% of power is consumed by CPU, DRAM and 10% by network | ||
# We ignore storage as they come from external storage clusters and seldom using local | ||
# disks | ||
# | ||
# We leverage RAPL package and DRAM to split the rest of 90% power between CPU and DRAM | ||
# components. | ||
# | ||
# At node level, power consumed by CPU and DRAM can be estimated as | ||
# | ||
# node_cpu_power = 0.9 * ipmi_power * (rapl_package/(rapl_package * rapl_dram)) | ||
# node_dram_power = 0.9 * ipmi_power * (rapl_dram/(rapl_package * rapl_dram)) | ||
# | ||
# Now we have power usage at node level for CPU and DRAM. We split it further at the | ||
# compute unit level using CPU time and DRAM usage by the compute unit. For network power usage, we split | ||
# total network power usage equally among all compute units that running on the node | ||
# at a given time. | ||
# | ||
# compute_unit_cpu_power = node_cpu_power * (total_compute_unit_cpu_sec / total_node_cpu_sec) | ||
# compute_unit_dram_power = node_dram_power * (total_compute_unit_mem_usage / total_node_mem_usage) | ||
# compute_unit_net_power = 0.1 * ipmi_power / num_compute_units | ||
# | ||
# Total power usage of compute unit = compute_unit_cpu_power + compute_unit_dram_power + compute_unit_net_power | ||
# | ||
# Finally, we can introduce PUE into the energy consumption by multiplying it with | ||
# Total compute unit power usage. | ||
# | ||
# Total power usage of compute unit = PUE * (compute_unit_cpu_power + compute_unit_dram_power + compute_unit_net_power) | ||
# | ||
# IMPORTANT: In rate estimation a rate interval of 2m is used. This is based on scrape | ||
# duration of 30s. It is always desirable to have rate interval atleast 4 times the | ||
# scrape interval. So, if scrape interval changes, we need to change rate interval | ||
# in the rules. | ||
# | ||
groups: | ||
- name: sample-gpu | ||
rules: | ||
# In some cases, power reported by IPMI DCMI includes both CPU and GPU power | ||
# We need to remove power reported by DCGM from IPMI DCMI to get CPU "only" power | ||
# consumption. | ||
# | ||
# We take average power over 5m reported by DCGM to smoothen any "unnatural" peaks. | ||
# | ||
# IMPORTANT: DCGM reports power usage by modelling. So there can be times where | ||
# power reported by ensemble of GPUs by DCGM exceeds power reported by IPMI. In this | ||
# case the query tends to give negative values which is not possible. | ||
# To avoid such situations, we use a condition >=0 to the query where the negative | ||
# values will be filtered from query result. They will appear as gaps in TSDB | ||
# We can fix it on Grafana side by choosing to join the time series gaps in panel | ||
# options. | ||
# Not ideal, but we dont have much options here!! | ||
# | ||
# PUE of 1 is used by default. Use appropriate PUE by replacing 1 by PUE ratio in expr | ||
- record: instance:ceems_ipmi_dcmi_avg_watts:pue_avg | ||
expr: 1 * (ceems_ipmi_dcmi_avg_watts{job="sample-gpu"} - on(instance) group_left() sum by (instance) (avg_over_time(DCGM_FI_DEV_POWER_USAGE{job="sample-dcgm"}[5m:])) >= 0) | ||
|
||
# Unit's CPU total energy usage estimated from IPMI DCMI | ||
# A100 partition does not report RAPL CPU and DRAM power usage. So for this | ||
# partition we make a assumption of 70% CPU, 20% DRAM and 10% network | ||
- record: unit:ceems_compute_unit_cpu_energy_usage:sum | ||
expr: |2 | ||
0.7 * instance:ceems_ipmi_dcmi_avg_watts:pue_avg{job="sample-gpu"} # Assumption 90% Power usage by CPU and DRAM and 10% by network. | ||
* on (instance) group_right () # CPU Power usage * (Job CPU Time / Total CPU Time) -> CPU power usage by "Job" | ||
( | ||
( | ||
rate(ceems_compute_unit_cpu_user_seconds_total{job="sample-gpu"}[2m]) | ||
+ | ||
rate(ceems_compute_unit_cpu_system_seconds_total{job="sample-gpu"}[2m]) | ||
) | ||
/ on (instance) group_left () | ||
sum by (instance) (rate(ceems_cpu_seconds_total{job="sample-gpu",mode!~"idle|iowait|steal"}[2m])) | ||
) | ||
+ | ||
0.2 * instance:ceems_ipmi_dcmi_avg_watts:pue_avg{job="sample-gpu"} | ||
* on (instance) group_right () # DRAM Power usage * (Job Memory used / Total Memory used) -> DRAM power usage by "Job" | ||
( | ||
ceems_compute_unit_memory_used_bytes{job="sample-gpu"} | ||
/ on (instance) group_left () | ||
( | ||
ceems_meminfo_MemTotal_bytes{job="sample-gpu"} | ||
- on (instance) | ||
ceems_meminfo_MemAvailable_bytes{job="sample-gpu"} | ||
) | ||
) | ||
+ | ||
0.1 * instance:ceems_ipmi_dcmi_avg_watts:pue_avg{job="sample-gpu"} | ||
* on (instance) group_right () # Network Power usage / Number of Jobs -> Network power usage by "Job" | ||
( | ||
ceems_compute_unit_memory_used_bytes{job="sample-gpu"} | ||
/ | ||
( | ||
ceems_compute_unit_memory_used_bytes{job="sample-gpu"} | ||
* on (instance) group_left () | ||
ceems_compute_units{job="sample-gpu"} | ||
) | ||
) | ||
|
||
# Rule group for aggregate metrics. Here we estimate the CPU, CPU Mem, CPU Energy, | ||
# CPU emissions last 24hr in the intervals of 15m. | ||
# So we get a nice rolling aggregate of usage and consumption time series | ||
# | ||
# NOTE that there are two different intervals in these queries: | ||
# - Evaluation interval: It is how regularly we evaluate this rule. We dont need to | ||
# evaluate these series for every, lets say, 30s. We want to have a global view | ||
# of these aggregate metrics and evaluating every 15m or 30m should be ok | ||
# | ||
# - Query interval: It is how many data points that we use to evaluate the rules. By | ||
# default we are using a scrape interval of 30s and hence we have data for every 30s | ||
# However, we do not need to use such a fine interval to estimate daily metrics. Using | ||
# such interval will also increase the query time as TSDB needs to get a lot of data into | ||
# memory to estimate this query. So we use 15m as query interval which is a reasonable | ||
# for estimating aggregate metrics for 24h. | ||
- name: sample-gpu-agg | ||
interval: 15m | ||
rules: | ||
# Average CPU usage during last 24h | ||
- record: instance:ceems_cpu_usage:avg24h | ||
expr: |2 | ||
avg by (job)( | ||
avg_over_time( | ||
( | ||
sum by (instance) (rate(ceems_cpu_seconds_total{job="sample-gpu",mode!~"idle|iowait|steal"}[2m])) | ||
* | ||
100 | ||
/ on (instance) group_right() | ||
ceems_cpu_count{job="sample-gpu"} | ||
)[1d:15m] | ||
) | ||
) | ||
|
||
# Average CPU memory usage during last 24h | ||
- record: instance:ceems_cpu_mem_usage:avg24h | ||
expr: |2 | ||
avg by (job)( | ||
avg_over_time( | ||
( | ||
( | ||
1 | ||
- | ||
( | ||
ceems_meminfo_MemAvailable_bytes{job="sample-gpu"} / ceems_meminfo_MemTotal_bytes{job="sample-gpu"} | ||
) | ||
) | ||
)[1d:15m] | ||
) | ||
) | ||
* | ||
100 | ||
|
||
# Total energy usage during last 24h in kWh | ||
# PUE of 1 is used by default. Use appropriate PUE by replacing 1 by PUE ratio in expr | ||
# We are summing all the IPMI power reading during last 24h with a interval of 15m. | ||
# So, to estimate the total energy, we need to do sum(P*deltat) / 3.6e9 where P is power | ||
# usage and deltat is the interval in milliseconds | ||
- record: instance:ceems_cpu_energy_usage:pue_sum24h | ||
expr: |2 | ||
sum by (job)( | ||
sum_over_time( | ||
( | ||
1 | ||
* | ||
( | ||
ceems_ipmi_dcmi_avg_watts{job="sample-gpu"} | ||
- on (instance) group_left () | ||
sum by (instance) (avg_over_time(DCGM_FI_DEV_POWER_USAGE{job="sample-dcgm"}[5m:])) | ||
) >= 0 | ||
)[1d:15m] | ||
) | ||
) | ||
* | ||
15*60000 | ||
/ | ||
3.6e+09 | ||
|
||
# Total emissions from CPU during last 24h in gms using emission factor from RTE | ||
# PUE of 1 is used by default. Use appropriate PUE by replacing 1 by PUE ratio in expr | ||
- record: instance:ceems_cpu_emissions_rte:pue_sum24h | ||
expr: |2 | ||
sum by (job) ( | ||
sum_over_time( | ||
( | ||
label_replace( | ||
1 | ||
* | ||
( | ||
( | ||
ceems_ipmi_dcmi_avg_watts{job="sample-gpu"} | ||
- on (instance) group_left () | ||
sum by (instance) (avg_over_time(DCGM_FI_DEV_POWER_USAGE{job="sample-dcgm"}[5m:])) | ||
) | ||
>= | ||
0 | ||
) | ||
* | ||
15*60000 | ||
/ | ||
3.6e+09, | ||
"common_label", | ||
"mock", | ||
"hostname", | ||
"(.*)" | ||
) | ||
* on (common_label) group_left () | ||
label_replace(ceems_emissions_gCo2_kWh{provider="rte"}, "common_label", "mock", "hostname", "(.*)") | ||
)[1d:15m] | ||
) | ||
) | ||
|
||
# Total emissions from CPU during last 24h in gms using emission factor from Electricty Maps | ||
# PUE of 1 is used by default. Use appropriate PUE by replacing 1 by PUE ratio in expr | ||
- record: instance:ceems_cpu_emissions_emaps:pue_sum24h | ||
expr: |2 | ||
sum by (job) ( | ||
sum_over_time( | ||
( | ||
label_replace( | ||
1 | ||
* | ||
( | ||
( | ||
ceems_ipmi_dcmi_avg_watts{job="sample-gpu"} | ||
- on (instance) group_left () | ||
sum by (instance) (avg_over_time(DCGM_FI_DEV_POWER_USAGE{job="sample-dcgm"}[5m:])) | ||
) | ||
>= | ||
0 | ||
) | ||
* | ||
15*60000 | ||
/ | ||
3.6e+09, | ||
"common_label", | ||
"mock", | ||
"hostname", | ||
"(.*)" | ||
) | ||
* on (common_label) group_left () | ||
label_replace( | ||
ceems_emissions_gCo2_kWh{provider="emaps"}, | ||
"common_label", | ||
"mock", | ||
"hostname", | ||
"(.*)" | ||
) | ||
)[1d:15m] | ||
) | ||
) |
Oops, something went wrong.