Skip to content

☂️ Enable MCM providers to force delete machines stuck in Terminal state #810

Open
@himanshu-kun

Description

@himanshu-kun

How to categorize this issue?

/area quality
/area robustness
/kind enhancement
/priority 3

What would you like to be added:

MCM should be able to force delete machines if they are stuck in terminal state (an irrecoverable state where API calls other than force delete won't work).

Why is this needed:

We have seen issues (currently on Azure only) where the VM was stuck in terminal state (refer Live Ticket # 2946)

VM deletion failed due to - machine codes error: 
code = [Internal] message = [Code="OSProvisioningTimedOut" Message="
OS Provisioning failure has reached terminal state and is non-recoverable for VM 
'shoot--hc-can-az--prod-az-haas-hana-vsmp4-z2-8667b-5frj5'. 
Consider deleting and recreating this virtual machine. 

The above error in Azure can be reproduced possibly by

MCM fails recovering from this situation , as we detach the disks first (an Update operation) in Azure and then go for DeleteVM() . Since disk detachment is never triggered due to terminal state, the situation becomes irrecoverable and the Delete flow of MCM keeps on repeating.

Similar situations could be seen in other providers where normal Delete won't work and a force delete might be needed.

Example ticket canary # 4358
Proposal:

  • Have an alternate Force Delete flow , which is triggered if the normal Delete flow fails for a threshold number of times
    • Need to confirm the error from provider , as force delete shouldn't be triggered for errors where a backoff should be done. Ex-
      • API rate limits (often seen in CCloud)
      • invalid credentials
  • If an Annotation is placed on machine obj, then we can trigger a force delete, which might vary from provider to provider.

Providers:

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/qualityOutput qualification (tests, checks, scans, automation in general, etc.) relatedarea/robustnessRobustness, reliability, resilience relatedkind/enhancementEnhancement, improvement, extensionlifecycle/rottenNobody worked on this for 12 months (final aging stage)priority/3Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions