Description
How to categorize this issue?
/area quality
/area robustness
/kind enhancement
/priority 3
What would you like to be added:
MCM should be able to force delete machines if they are stuck in terminal state (an irrecoverable state where API calls other than force delete won't work).
Why is this needed:
We have seen issues (currently on Azure only) where the VM was stuck in terminal state (refer Live Ticket # 2946)
VM deletion failed due to - machine codes error:
code = [Internal] message = [Code="OSProvisioningTimedOut" Message="
OS Provisioning failure has reached terminal state and is non-recoverable for VM
'shoot--hc-can-az--prod-az-haas-hana-vsmp4-z2-8667b-5frj5'.
Consider deleting and recreating this virtual machine.
The above error in Azure can be reproduced possibly by
- https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/troubleshoot-deployment-new-vm-linux#issue-custom-image-provisioning-errors
- https://learn.microsoft.com/en-us/answers/questions/1112332/osprovisioningtimedout
MCM fails recovering from this situation , as we detach the disks first (an Update
operation) in Azure and then go for DeleteVM()
. Since disk detachment is never triggered due to terminal state, the situation becomes irrecoverable and the Delete
flow of MCM keeps on repeating.
Similar situations could be seen in other providers where normal Delete won't work and a force delete might be needed.
Example ticket canary # 4358
Proposal:
- Have an alternate Force Delete flow , which is triggered if the normal Delete flow fails for a threshold number of times
- Need to confirm the error from provider , as force delete shouldn't be triggered for errors where a backoff should be done. Ex-
- API rate limits (often seen in CCloud)
- invalid credentials
- Need to confirm the error from provider , as force delete shouldn't be triggered for errors where a backoff should be done. Ex-
- If an Annotation is placed on machine obj, then we can trigger a force delete, which might vary from provider to provider.
Providers:
- mcm-provider-azure (See ☂️ Enable MCM providers to force delete machines stuck in
Terminal
state #810 (comment)) - mcm-provider-aws
- mcm-provider-gcp
- mcm-provider-openstack
- mcm-provider-alicloud
- mcm-provider-local