Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve deletion flow to finish possiblity of leaving orphan resources #850

Open
7 tasks
himanshu-kun opened this issue Sep 13, 2023 · 1 comment
Open
7 tasks
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/usability Usability related exp/beginner Issue that requires only basic skills kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age) priority/2 Priority (lower number equals higher priority)

Comments

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Sep 13, 2023

How to categorize this issue?

/area performance
/area usability
/area productivity
/kind bug
/priority 1

What happened:

The deletion flow of MCM currently follows the following flow:
image

But acc. to the contract of GetMachineStatus() , NOTFOUND should be returned only if VM not found, it doesn't mention nics/disks. This leaves the deletion flow , deleting a machine object but NOT cleaning up orphan nics , disks, in some cases

So the **proposed flow** is to try DeleteMachine() even after NOTFOUND is returned. This will ensure removal of orphan nics, disks.
image

Note: We can't rely on orphan collection logic as the logic is limited to MCM . So in cases where MCM is removed after the last machine obj is deleted in a shoot cluster (like in gardener cases) , and the last machine obj satisfied the above described corner case, its disks and nics would stay, further blocking infra deletion (subnet / resource group deletion for example)

What you expected to happen:

Delete flow should not leave any orphan resources

The following changes are required:

  1. The mcm-providers should be updated , so that the DeleteMachine() driver implementation also follows the contract. For example , gcp returns NOTFOUND error if VM not there, but acc. to contract it shouldn't return any error, it should be a no-op.
  • aws
  • azure
  • gcp
  • openstack
  • alicloud
  • vshpere
  • equinix-metal
  1. MCM delete flow should be updated to match the proposed flow above.

NOTE: MCM-provider should vendor the MCM with proposed change only after their DeleteMachine() starts following
contract, otherwise delete flow could get stuck on the Delete machine step

How to reproduce it (as minimally and precisely as possible):

  1. Create single machine obj
  2. delete the VM such that disks and nics still remain. This can be achieved by turning the cascade delete option to false
  3. Put deletion timestamp on the machine obj
  4. As soon as the machine deletes , scale down MCM . This is to make sure orphan collection doesn't run. This is how higher level gardener controllers scale-down MCM.
  5. query the infra for disks and nics.

Anything else we need to know?:

There are many canary and live tickets , where such orphan disks and nics are seen. The reason may not be the same as described above, but since this is one such codepath, we need to fix it

canary # 3637
live # 730
live # 2263
live # 2273

Environment:
mcm <= 0.49.3

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/usability Usability related priority/1 Priority (lower number equals higher priority) labels Sep 13, 2023
@gardener-robot
Copy link

@himanshu-kun Label area/productivity does not exist.

@himanshu-kun himanshu-kun added the exp/beginner Issue that requires only basic skills label Sep 13, 2023
@himanshu-kun himanshu-kun added priority/2 Priority (lower number equals higher priority) and removed priority/1 Priority (lower number equals higher priority) labels Oct 6, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/usability Usability related exp/beginner Issue that requires only basic skills kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age) priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

2 participants