changes to collect metrics from Prometheus with benchmark run outside of kepler-model-server #191

knarayan · 2023-11-10T05:34:52Z

Users might want to run a custom benchmark outside of kepler-model-server and collect metrics using kepler-model-server to train model subsequently.

Collect metrics (only query Prometheus to fetch metrics in the specified time window)

NATIVE="true" ./script.sh custom_collect

Validate the collected metrics

NATIVE="true" ./script.sh validate customBenchmark

Train custom benchmark metrics

NATIVE="true" ./script.sh custom_train

Note that Prometheus is queried to collect both kepler provided power consumption and node-exporter provided CPU utilization metrics (like node_cpu_seconds_total) for each cluster node.

…mark run outside kepler-model-server Signed-off-by: Krishnasuri Narayanam <[email protected]>

sunya-ch · 2023-11-10T06:05:36Z

Thank you for the PR. It is promising to extend the custom metrics for validation and train.

I think we don't need to create a new function for them. We can add an option for specifying start-time,end-time directly in addition to interval. In stead of saving only query result (like idle.json), we have custom_benchmark.json and custom_query_response.json when start_time end_time are specified with startTimeUTC/startTimeUTC in json as you defined.
For the custom_metrics, is it the same as PROM_THIRDPARTY_METRICS introduced in #185?

src/train/exporter/validator.py

src/util/prom_types.py

model_training/script.sh

model_training/README.md

sunya-ch

Thank you so much for the PR. I think this PR is very helpful on introducing a new custom benchmark which is not based on CPE operator. Without CPE-defined benchmark, trainer can directly set start time and end time (or last interval) of the benchmark.

However, since it has some conflict on the current design and recently-merged PR (#185), I would like to have some more discussion on changes.

After we got the conclusion, it would be great if contributor would also update the instruction in https://github.com/sustainable-computing-io/kepler-model-server/blob/main/contributing.md#introduce-new-benchmarks.

model_training/README.md

Signed-off-by: Krishnasuri Narayanam <[email protected]>

src/util/prom_types.py

Signed-off-by: Krishnasuri Narayanam <[email protected]>

…he benchmark suite Signed-off-by: Krishnasuri Narayanam <[email protected]>

sunya-ch

@knarayan Thank you so much for your contribution.

As there are multiple significant changes here, I would like to summarize the contribution of this PR again as below.
This PR includes

introducing customBenchmark with variable-specified startTime, endTime for non-CPE benchmark (separate sample, stressng, and customBenchmark)
supporting validation for all available queries
removing required benchmark constraint on export function to allow reuse on customBenchmark

Please feel free to add or correct my summary.

changes to collect metrics from Prometheus and train model with bench…

9bef696

…mark run outside kepler-model-server Signed-off-by: Krishnasuri Narayanam <[email protected]>

knarayan mentioned this pull request Nov 10, 2023

[kepler-model-server] Fetch both power consumption and CPU utilization metrics for each node sustainable-computing-io/peaks#31

Closed