Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance MCM metrics #872

Open
1 of 5 tasks
unmarshall opened this issue Nov 17, 2023 · 0 comments
Open
1 of 5 tasks

Enhance MCM metrics #872

unmarshall opened this issue Nov 17, 2023 · 0 comments
Labels
area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension priority/3 Priority (lower number equals higher priority)

Comments

@unmarshall
Copy link
Contributor

unmarshall commented Nov 17, 2023

How to categorize this issue?

/area control-plane
/area monitoring
/kind enhancement
/priority 3

What would you like to be added:

Today MCM exposes metrics which has a few shortcomings:

  • Metrics do not follow the best practice/recommendations from Prometheus (Refer to this and this). We need to relook at the metrics and the labels that are used on them.
  • Contextual information is missing on metrics which prevents from correlating different metrics captured across different mcm and mcm-provider functions/Provider-API calls.

While we recommend to re-look at all the metrics but we also had some concrete improvements for 2 metrics that got recently introduced:

Provider API metrics:

APIRequestDuration:
For this metric we propose to add additional labels which capture the following:

  • Provider API Operation that is invoked. Today we use service to capture that but we should relook at renaming this.
  • Driver Operation under which the provider API is invoked.
  • Machine name for which this API request is made
  • MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
  • MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.

DriverAPIRequestDuration:
For this metrics we propose to add additional labels which capture the following:

  • Driver Operation under which the provider API is invoked.
  • Machine name for which this API request is made
  • MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
  • MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.

Provider Implementations:-

Why is this needed:

This allows us to observe metrics at different levels:

  • Driver API methods level
  • Machine level
  • Provider API level
  • Reconcile Flow level
@unmarshall unmarshall added the kind/enhancement Enhancement, improvement, extension label Nov 17, 2023
@gardener-robot gardener-robot added area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related priority/3 Priority (lower number equals higher priority) labels Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

2 participants