Improve Resource Status Messages Upon Failure #3322

mehighlow · 2023-09-19T18:39:02Z

Describe the current behavior
When a resource creation fails, ASO exposes the last status Azure provides, which might not be informative.

E.g.
CosmosDB provisioning failed due to capacity constraints.

Status:
  Conditions:
    Last Transition Time:  2023-09-19T00:21:23Z
    Message:               DatabaseAccount XXXXXXX is in a failed provisioning state because the previous attempt to create it was not successful. Please delete the previous instance before attempting to recreate this account.
ActivityId: 00000000-0000-0000-0000-000000000000, Microsoft.Azure.Documents.Common/2.14.0: PUT https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/XXXXXXX/providers/Microsoft.DocumentDB/databaseAccounts/XXXXXXX
--------------------------------------------------------------------------------
RESPONSE 400: 400 BadRequest
ERROR CODE: BadRequest
--------------------------------------------------------------------------------
{
  "code": "BadRequest",
  "message": "DatabaseAccount XXXXXXX is in a failed provisioning state because the previous attempt to create it was not successful. Please delete the previous instance before attempting to recreate this account.\r\nActivityId: 00000000-0000-0000-0000-000000000000, Microsoft.Azure.Documents.Common/2.14.0"
}
--------------------------------------------------------------------------------

This message provides clear instructions for re-creating the DatabaseAccount, while the earlier message in close proximity exposes another reason for the failure:

Describe the improvement
To group events occurring within a ~5-minute proximity and extend status messages to provide more details, add an event category.

Additional context
I've encountered these issues not only with CosmosDB but also with different Azure objects.

theunrepentantgeek · 2023-09-19T20:41:13Z

This is a good idea.

I suspect it might not be quite as simple as just grouping together messages based on time, given that outside factors might be at play. We'll talk this through at our next regular sync.

matthchr · 2023-09-25T22:56:49Z

We should look into making sure that we publish events that contain details about these errors in the Kubernetes event log for this resource.

@mehighlow have you seen this problem where the "real" error is obscured by a subsequent retry for other resources? What other resource types? It might be worth us talking to the CosmosDB team and asking them if they could include the "cause of failure" in this error message as well, as it would lift all boats (ASO, Terraform, etc) which may re-apply a resource to retry it.

matthchr · 2024-04-04T18:35:29Z

I've looked into this some. As far as I know, Kubernetes groups events only if the text is exactly the same. In cases like this where we've gotten errors that contain activityIDs or other dynamically generated fields (timestamps), they'll not be grouped in event viewer. We could redact those fields to make the events uniform and groupable, but if we do that then the events are less useful as they don't actually contain IDs that you could take to Azure support.

I think the right fix here is to just mark the initial error:

  Warning  CreateOrUpdateActionError  29m   DatabaseAccountController  Reason: ServiceUnavailable, Severity: Warning, RetryClassification: RetrySlow, Cause: Database account creation failed. Operation Id: 1403e78b-ab8a-4bb3-85b4-285aab8a96f8, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 17:35:21 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 8c3213a5-c570-4a8a-b337-e05de991eb9e, Microsoft.Azure.Documents.Common/2.14.0"}

as fatal.

That way we don't retry and get the less-helpful error.

This results in:

NAME               READY   SEVERITY   REASON               MESSAGE
matthchr-db-acct   False   Error      ServiceUnavailable   Database account creation failed. Operation Id: 768740a2-a70f-4dd0-8c8e-e4fe75ed0bf1, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 18:20:43 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 901bb9c4-21e5-43ad-b4ab-17a24cdcee47, Microsoft.Azure.Documents.Common/2.14.0"}, Request URI: /serviceReservation, RequestStats: , SDK: Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0: GET https://management.azure.com/subscriptions/4ef44fef-c51d-4d7c-a6ff-8635c02848b1/providers/Microsoft.DocumentDB/locations/westus3/operationsStatus/07c29785-548d-40d3-bb8d-a925ed207f11...

which I think is more helpful than what we were doing before. That change is in PR in #3906.

matthchr · 2024-04-04T18:36:05Z

@mehighlow if you know of other resources you've seen similar behavior on, please let us know, we may be able to improve their experiences as well.

mehighlow · 2024-04-08T22:32:51Z

@matthchr, you bet! Thanks for the improvement!

github-project-automation bot added this to Azure Service Operator Roadmap Sep 19, 2023

github-project-automation bot moved this to Backlog in Azure Service Operator Roadmap Sep 19, 2023

github-actions bot added the needs-triage 🔍 label Sep 19, 2023

matthchr self-assigned this Sep 25, 2023

matthchr added this to the v2.4.0 milestone Sep 25, 2023

matthchr removed the needs-triage 🔍 label Sep 25, 2023

matthchr modified the milestones: v2.4.0, v2.5.0 Oct 23, 2023

theunrepentantgeek modified the milestones: v2.6.0, v2.7.0 Dec 11, 2023

matthchr modified the milestone: v2.7.0 Feb 22, 2024

matthchr mentioned this issue Apr 4, 2024

Mark capacity errors for documentdb as fatal #3906

Merged

3 tasks

matthchr closed this as completed in #3906 Apr 8, 2024

github-project-automation bot moved this from Backlog to Recently Completed in Azure Service Operator Roadmap Apr 8, 2024

matthchr moved this from Recently Completed to Ready for Release in Azure Service Operator Roadmap Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Resource Status Messages Upon Failure #3322

Improve Resource Status Messages Upon Failure #3322

mehighlow commented Sep 19, 2023

theunrepentantgeek commented Sep 19, 2023

matthchr commented Sep 25, 2023

matthchr commented Apr 4, 2024

matthchr commented Apr 4, 2024

mehighlow commented Apr 8, 2024

Improve Resource Status Messages Upon Failure #3322

Improve Resource Status Messages Upon Failure #3322

Comments

mehighlow commented Sep 19, 2023

theunrepentantgeek commented Sep 19, 2023

matthchr commented Sep 25, 2023

matthchr commented Apr 4, 2024

matthchr commented Apr 4, 2024

mehighlow commented Apr 8, 2024