☂️  Enable MCM providers to force delete machines stuck in `Terminal` state

**How to categorize this issue?**

/area quality
/area robustness
/kind enhancement
/priority 3

**What would you like to be added**:

MCM should be able to force delete machines if they are stuck in terminal state (an irrecoverable state where API calls other than force delete won't work). 

**Why is this needed**:

We have seen issues (currently on Azure only) where the VM was stuck in terminal state (refer Live Ticket # 2946)
```
VM deletion failed due to - machine codes error: 
code = [Internal] message = [Code="OSProvisioningTimedOut" Message="
OS Provisioning failure has reached terminal state and is non-recoverable for VM 
'shoot--hc-can-az--prod-az-haas-hana-vsmp4-z2-8667b-5frj5'. 
Consider deleting and recreating this virtual machine. 
```

The above error in Azure can be reproduced possibly by
-  https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/troubleshoot-deployment-new-vm-linux#issue-custom-image-provisioning-errors
- https://learn.microsoft.com/en-us/answers/questions/1112332/osprovisioningtimedout

MCM fails recovering from this situation , as we detach the disks first (an `Update` operation) in Azure and then go for `DeleteVM()` . Since disk detachment is never triggered due to terminal state, the situation becomes irrecoverable and the `Delete` flow of MCM keeps on repeating.

Similar situations could be seen in other providers where normal Delete won't work and a force delete might be needed. 

Example ticket canary # 4358
Proposal:

- Have an alternate Force Delete flow , which is triggered if the normal Delete flow fails for a threshold number of times
   - Need to confirm the error from provider , as force delete shouldn't be triggered for errors where a backoff should be done. Ex-
     - API rate limits (often seen in CCloud)
     - invalid credentials
- If an Annotation is placed on machine obj, then we can trigger a force delete, which might vary from provider to provider.

Providers:
- [x] mcm-provider-azure (See https://github.com/gardener/machine-controller-manager/issues/810#issuecomment-1829160620)
- [ ] mcm-provider-aws
- [ ] mcm-provider-gcp
- [ ] mcm-provider-openstack
- [ ] mcm-provider-alicloud
- [ ] mcm-provider-local

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

☂️ Enable MCM providers to force delete machines stuck in `Terminal` state #810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

☂️ Enable MCM providers to force delete machines stuck in Terminal state #810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

☂️ Enable MCM providers to force delete machines stuck in `Terminal` state #810