Improve deletion flow to finish possiblity of leaving orphan resources #850
Labels
area/performance
Performance (across all domains, such as control plane, networking, storage, etc.) related
area/usability
Usability related
exp/beginner
Issue that requires only basic skills
kind/bug
Bug
lifecycle/stale
Nobody worked on this for 6 months (will further age)
priority/2
Priority (lower number equals higher priority)
How to categorize this issue?
/area performance
/area usability
/area productivity
/kind bug
/priority 1
What happened:
The deletion flow of MCM currently follows the following flow:
But acc. to the contract of
GetMachineStatus()
,NOTFOUND
should be returned only if VM not found, it doesn't mention nics/disks. This leaves the deletion flow , deleting a machine object but NOT cleaning up orphan nics , disks, in some casesSo the
**proposed flow**
is to tryDeleteMachine()
even afterNOTFOUND
is returned. This will ensure removal of orphan nics, disks.Note: We can't rely on orphan collection logic as the logic is limited to MCM . So in cases where MCM is removed after the last machine obj is deleted in a shoot cluster (like in gardener cases) , and the last machine obj satisfied the above described corner case, its disks and nics would stay, further blocking infra deletion (subnet / resource group deletion for example)
What you expected to happen:
Delete flow should not leave any orphan resources
The following changes are required:
DeleteMachine()
driver implementation also follows the contract. For example , gcp returnsNOTFOUND
error if VM not there, but acc. to contract it shouldn't return any error, it should be a no-op.NOTE: MCM-provider should vendor the MCM with proposed change only after their
DeleteMachine()
starts followingcontract, otherwise delete flow could get stuck on the Delete machine step
How to reproduce it (as minimally and precisely as possible):
false
Anything else we need to know?:
There are many canary and live tickets , where such orphan disks and nics are seen. The reason may not be the same as described above, but since this is one such codepath, we need to fix it
canary # 3637
live # 730
live # 2263
live # 2273
Environment:
mcm <= 0.49.3
kubectl version
):The text was updated successfully, but these errors were encountered: