Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes to collect metrics from Prometheus with benchmark run outside of kepler-model-server #191

Merged
merged 4 commits into from
Nov 13, 2023

Conversation

knarayan
Copy link
Contributor

@knarayan knarayan commented Nov 10, 2023

Users might want to run a custom benchmark outside of kepler-model-server and collect metrics using kepler-model-server to train model subsequently.

Collect metrics (only query Prometheus to fetch metrics in the specified time window)

NATIVE="true" ./script.sh custom_collect

Validate the collected metrics

NATIVE="true" ./script.sh validate customBenchmark

Train custom benchmark metrics

NATIVE="true" ./script.sh custom_train

Note that Prometheus is queried to collect both kepler provided power consumption and node-exporter provided CPU utilization metrics (like node_cpu_seconds_total) for each cluster node.

…mark run outside kepler-model-server

Signed-off-by: Krishnasuri Narayanam <[email protected]>
@sunya-ch
Copy link
Contributor

Thank you for the PR. It is promising to extend the custom metrics for validation and train.

I think we don't need to create a new function for them. We can add an option for specifying start-time,end-time directly in addition to interval. In stead of saving only query result (like idle.json), we have custom_benchmark.json and custom_query_response.json when start_time end_time are specified with startTimeUTC/startTimeUTC in json as you defined.
For the custom_metrics, is it the same as PROM_THIRDPARTY_METRICS introduced in #185?

src/util/prom_types.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sunya-ch sunya-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the PR. I think this PR is very helpful on introducing a new custom benchmark which is not based on CPE operator. Without CPE-defined benchmark, trainer can directly set start time and end time (or last interval) of the benchmark.

However, since it has some conflict on the current design and recently-merged PR (#185), I would like to have some more discussion on changes.

After we got the conclusion, it would be great if contributor would also update the instruction in https://github.com/sustainable-computing-io/kepler-model-server/blob/main/contributing.md#introduce-new-benchmarks.

model_training/README.md Show resolved Hide resolved
Signed-off-by: Krishnasuri Narayanam <[email protected]>
src/util/prom_types.py Outdated Show resolved Hide resolved
Signed-off-by: Krishnasuri Narayanam <[email protected]>
…he benchmark suite

Signed-off-by: Krishnasuri Narayanam <[email protected]>
Copy link
Contributor

@sunya-ch sunya-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knarayan Thank you so much for your contribution.

As there are multiple significant changes here, I would like to summarize the contribution of this PR again as below.
This PR includes

  • introducing customBenchmark with variable-specified startTime, endTime for non-CPE benchmark (separate sample, stressng, and customBenchmark)
  • supporting validation for all available queries
  • removing required benchmark constraint on export function to allow reuse on customBenchmark

Please feel free to add or correct my summary.

@sunya-ch sunya-ch merged commit b1ee3f0 into sustainable-computing-io:main Nov 13, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants