Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Resource Status Messages Upon Failure #3322

Closed
mehighlow opened this issue Sep 19, 2023 · 5 comments · Fixed by #3906
Closed

Improve Resource Status Messages Upon Failure #3322

mehighlow opened this issue Sep 19, 2023 · 5 comments · Fixed by #3906
Assignees
Milestone

Comments

@mehighlow
Copy link
Contributor

Describe the current behavior
When a resource creation fails, ASO exposes the last status Azure provides, which might not be informative.

E.g.
CosmosDB provisioning failed due to capacity constraints.

Status:
  Conditions:
    Last Transition Time:  2023-09-19T00:21:23Z
    Message:               DatabaseAccount XXXXXXX is in a failed provisioning state because the previous attempt to create it was not successful. Please delete the previous instance before attempting to recreate this account.
ActivityId: 00000000-0000-0000-0000-000000000000, Microsoft.Azure.Documents.Common/2.14.0: PUT https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/XXXXXXX/providers/Microsoft.DocumentDB/databaseAccounts/XXXXXXX
--------------------------------------------------------------------------------
RESPONSE 400: 400 BadRequest
ERROR CODE: BadRequest
--------------------------------------------------------------------------------
{
  "code": "BadRequest",
  "message": "DatabaseAccount XXXXXXX is in a failed provisioning state because the previous attempt to create it was not successful. Please delete the previous instance before attempting to recreate this account.\r\nActivityId: 00000000-0000-0000-0000-000000000000, Microsoft.Azure.Documents.Common/2.14.0"
}
--------------------------------------------------------------------------------
Screenshot 2023-09-19 at 11 09 35 AM

This message provides clear instructions for re-creating the DatabaseAccount, while the earlier message in close proximity exposes another reason for the failure:

Screenshot 2023-09-19 at 11 09 24 AM

Describe the improvement
To group events occurring within a ~5-minute proximity and extend status messages to provide more details, add an event category.

Additional context
I've encountered these issues not only with CosmosDB but also with different Azure objects.

@theunrepentantgeek
Copy link
Member

This is a good idea.

I suspect it might not be quite as simple as just grouping together messages based on time, given that outside factors might be at play. We'll talk this through at our next regular sync.

@matthchr
Copy link
Member

We should look into making sure that we publish events that contain details about these errors in the Kubernetes event log for this resource.

@mehighlow have you seen this problem where the "real" error is obscured by a subsequent retry for other resources? What other resource types? It might be worth us talking to the CosmosDB team and asking them if they could include the "cause of failure" in this error message as well, as it would lift all boats (ASO, Terraform, etc) which may re-apply a resource to retry it.

@matthchr matthchr self-assigned this Sep 25, 2023
@matthchr matthchr added this to the v2.4.0 milestone Sep 25, 2023
@matthchr matthchr modified the milestones: v2.4.0, v2.5.0 Oct 23, 2023
@theunrepentantgeek theunrepentantgeek modified the milestones: v2.6.0, v2.7.0 Dec 11, 2023
@matthchr matthchr modified the milestone: v2.7.0 Feb 22, 2024
@matthchr
Copy link
Member

matthchr commented Apr 4, 2024

I've looked into this some. As far as I know, Kubernetes groups events only if the text is exactly the same. In cases like this where we've gotten errors that contain activityIDs or other dynamically generated fields (timestamps), they'll not be grouped in event viewer. We could redact those fields to make the events uniform and groupable, but if we do that then the events are less useful as they don't actually contain IDs that you could take to Azure support.

I think the right fix here is to just mark the initial error:

  Warning  CreateOrUpdateActionError  29m   DatabaseAccountController  Reason: ServiceUnavailable, Severity: Warning, RetryClassification: RetrySlow, Cause: Database account creation failed. Operation Id: 1403e78b-ab8a-4bb3-85b4-285aab8a96f8, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 17:35:21 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 8c3213a5-c570-4a8a-b337-e05de991eb9e, Microsoft.Azure.Documents.Common/2.14.0"}

as fatal.

That way we don't retry and get the less-helpful error.

This results in:

NAME               READY   SEVERITY   REASON               MESSAGE
matthchr-db-acct   False   Error      ServiceUnavailable   Database account creation failed. Operation Id: 768740a2-a70f-4dd0-8c8e-e4fe75ed0bf1, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 18:20:43 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 901bb9c4-21e5-43ad-b4ab-17a24cdcee47, Microsoft.Azure.Documents.Common/2.14.0"}, Request URI: /serviceReservation, RequestStats: , SDK: Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0: GET https://management.azure.com/subscriptions/4ef44fef-c51d-4d7c-a6ff-8635c02848b1/providers/Microsoft.DocumentDB/locations/westus3/operationsStatus/07c29785-548d-40d3-bb8d-a925ed207f11...

which I think is more helpful than what we were doing before. That change is in PR in #3906.

@matthchr
Copy link
Member

matthchr commented Apr 4, 2024

@mehighlow if you know of other resources you've seen similar behavior on, please let us know, we may be able to improve their experiences as well.

@github-project-automation github-project-automation bot moved this from Backlog to Recently Completed in Azure Service Operator Roadmap Apr 8, 2024
@mehighlow
Copy link
Contributor Author

@matthchr, you bet! Thanks for the improvement!

@matthchr matthchr moved this from Recently Completed to Ready for Release in Azure Service Operator Roadmap Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

3 participants