Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contribute to Gardener Node Agent for exposing metrics to gain better visibility of node joining timeouts. #837

Open
ashwani2k opened this issue Jul 31, 2023 · 0 comments
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/3 Priority (lower number equals higher priority)

Comments

@ashwani2k
Copy link
Contributor

How to categorize this issue?

/area monitoring
/kind enhancement
/priority 3

What would you like to be added:
With the introduction of a gardener-node-agent which is a controller-runtime based go implementation of the cloud-config downloader, it might be possible to get more insights to what happens during the processing of a node when it joins the cluster or rather fails to join a cluster.
This can help us isolate if the timeouts are happening at the infra layer or there is something wrong during the node processing within the kubernetes runtime.

This may require us to expose some metrics from the node-agent or enhance its logging to tailor for making directed queries from its logs to identify node joining issues.
This will make life easier for the MCM operators in identifying such issues with more determinism then what is possible as of today.

Why is this needed:
Currently we often have issues to analyze and identify why the node hasn't joined in 20mins window of default timeout.
All we have in the logs is following:
Machine shoot--<project-name>--<shoot-name>-<worker-pool>-<zone>-865f7-zggql failed to join the cluster in 20m0s minutes.
The current approach to identify what has gone wrong if the issue persist requires you to follow some FAQ #my-machine-is-not-joining-the-cluster-why to begin with and also might require you to explore the Infra and see if the respective instance status to ascertain if was created successfully but fails to join the cluster.

This is currently a time consuming task with an expectation of fair knowledge of MCM internal to ascertain the root cause.

@ashwani2k ashwani2k added the kind/enhancement Enhancement, improvement, extension label Jul 31, 2023
@gardener-robot gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related priority/3 Priority (lower number equals higher priority) labels Jul 31, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

2 participants