Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts. #837

ashwani2k · 2023-07-31T04:49:25Z

How to categorize this issue?

/area monitoring
/kind enhancement
/priority 3

What would you like to be added:
With the introduction of a gardener-node-agent which is a controller-runtime based go implementation of the cloud-config downloader, it might be possible to get more insights to what happens during the processing of a node when it joins the cluster or rather fails to join a cluster.
This can help us isolate if the timeouts are happening at the infra layer or there is something wrong during the node processing within the kubernetes runtime.

This may require us to expose some metrics from the node-agent or enhance its logging to tailor for making directed queries from its logs to identify node joining issues.
This will make life easier for the MCM operators in identifying such issues with more determinism then what is possible as of today.

Why is this needed:
Currently we often have issues to analyze and identify why the node hasn't joined in 20mins window of default timeout.
All we have in the logs is following:
Machine shoot--<project-name>--<shoot-name>-<worker-pool>-<zone>-865f7-zggql failed to join the cluster in 20m0s minutes.
The current approach to identify what has gone wrong if the issue persist requires you to follow some FAQ #my-machine-is-not-joining-the-cluster-why to begin with and also might require you to explore the Infra and see if the respective instance status to ascertain if was created successfully but fails to join the cluster.

This is currently a time consuming task with an expectation of fair knowledge of MCM internal to ascertain the root cause.

The text was updated successfully, but these errors were encountered:

ashwani2k added the kind/enhancement Enhancement, improvement, extension label Jul 31, 2023

gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related priority/3 Priority (lower number equals higher priority) labels Jul 31, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts. #837

Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts. #837

ashwani2k commented Jul 31, 2023

Contribute to Gardener Node Agent for exposing metrics to gain better visibility of node joining timeouts. #837

Contribute to Gardener Node Agent for exposing metrics to gain better visibility of node joining timeouts. #837

Comments

ashwani2k commented Jul 31, 2023

Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts. #837

Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts. #837