-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Resource Status Messages Upon Failure #3322
Comments
This is a good idea. I suspect it might not be quite as simple as just grouping together messages based on time, given that outside factors might be at play. We'll talk this through at our next regular sync. |
We should look into making sure that we publish events that contain details about these errors in the Kubernetes event log for this resource. @mehighlow have you seen this problem where the "real" error is obscured by a subsequent retry for other resources? What other resource types? It might be worth us talking to the CosmosDB team and asking them if they could include the "cause of failure" in this error message as well, as it would lift all boats (ASO, Terraform, etc) which may re-apply a resource to retry it. |
I've looked into this some. As far as I know, Kubernetes groups events only if the text is exactly the same. In cases like this where we've gotten errors that contain activityIDs or other dynamically generated fields (timestamps), they'll not be grouped in event viewer. We could redact those fields to make the events uniform and groupable, but if we do that then the events are less useful as they don't actually contain IDs that you could take to Azure support. I think the right fix here is to just mark the initial error:
as fatal. That way we don't retry and get the less-helpful error. This results in:
which I think is more helpful than what we were doing before. That change is in PR in #3906. |
@mehighlow if you know of other resources you've seen similar behavior on, please let us know, we may be able to improve their experiences as well. |
@matthchr, you bet! Thanks for the improvement! |
Describe the current behavior
When a resource creation fails, ASO exposes the last status Azure provides, which might not be informative.
E.g.
CosmosDB provisioning failed due to capacity constraints.
This message provides clear instructions for re-creating the DatabaseAccount, while the earlier message in close proximity exposes another reason for the failure:
Describe the improvement
To group events occurring within a ~5-minute proximity and extend status messages to provide more details, add an event category.
Additional context
I've encountered these issues not only with CosmosDB but also with different Azure objects.
The text was updated successfully, but these errors were encountered: