update vm delete to not wait for datadisk detachment #95

kon-angelo · 2023-04-18T12:39:49Z

What this PR does / why we need it:
Update VM delete to not wait until all data disks are detached. This may help to prevent situations where a failed VM cannot be deleted (because detach or any operation are prohibited).

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

gardener-robot · 2023-04-18T12:39:56Z

@kon-angelo You need rebase this pull request with latest master branch. Please check.

kon-angelo · 2023-04-18T12:43:48Z

/test

himanshu-kun · 2023-04-20T07:54:11Z

pkg/azure/utils.go

+	if deleteErr := DeleteVM(ctx, clients, resourceGroupName, VMName); deleteErr != nil && !NotFound(deleteErr) {
+		return deleteErr


if we don't wait for disk detachment (which could take 10 min)
then we'll directly move to the disk deletion below while disk are detaching
this could make things complicated as azure can't be trusted.

sorry after taking a re-look , I realize that we are not even issuing a detach call now, and going for a direct delete.
this solution would be deterministic only if DeleteVM returns after the detach and not before, otherwise we will go for disk deletion and it might fail and lead to inconsistencies.

himanshu-kun

could you provide more context regarding this change, and how you think this helps?
An PR is already in progress by @unmarshall for issue #91 where this could be handled.
Do you feel this change is urgent ?

himanshu-kun · 2023-04-20T07:57:08Z

/assign @himanshu-kun

elankath · 2023-04-21T10:42:06Z

Please note that when we move to the new Azure SDK in #91 , we will be changing the code to do cascade create and cascade delete. ie ONE call to create NIC+DISK+VM and ONE call to delete NIC+DISK+VM.

unmarshall · 2023-04-21T10:44:19Z

Please note that when we move to the new Azure SDK in #91 , we will be changing the code to do cascade create and cascade delete. ie ONE call to create NIC+DISK+VM and ONE call to delete NIC+DISK+VM.

We are currently testing these APIs. Tarun is right that there are options to cascade delete. For creation we are testing if we need to create NICs separately. Once we have done the testing then we will know more.

kon-angelo · 2023-04-24T11:46:56Z

Thank you all for the review/responses.

@himanshu-kun

I realize that we are not even issuing a detach call now, and going for a direct delete.
this solution would be deterministic only if DeleteVM returns after the detach and not before, otherwise we will go for disk deletion and it might fail and lead to inconsistencies

The DeleteVM should return after the VM is deleted in which case there is nothing to detach the disks from as the VM is already deleted. In practice, the Azure API should call the detach in the background but for some reason we chose to detach the disks in the foreground by ourselves.

could you provide more context regarding this change, and how you think this helps?

I think that this is how it should be done (in an ideal world). The API allows to delete the machine with data disks so let it take care of the process by itself. If there is an actual reason that we did it this way (past incident or issues), then please point me to the documentation otherwise let's not optimize without a good reason/data.

Do you feel this change is urgent ?

Not urgent. Just curious why we handle VMs in a "special" way.

@elankath

Please note that when we move to the new Azure SDK in #91 , we will be changing the code to do cascade create and cascade delete. ie ONE call to create NIC+DISK+VM and ONE call to delete NIC+DISK+VM.

👍 . However the cascade options are part of the VM parameters and you have to consider existing VMs and how will you handle it when MCM does not support VM updates. Will you update existing VMs before deleting them ? How will you deal with failing VMs that do not allow for updates etc. But these are part of your future PR. It is relevant however if you can't reliably update all machines with cascading options and you are forced to have keep code paths (old and new) for a period of time.
If you are okay with deleting the VM with a cascade option (hence skipping the detachment of the datadisks) it means that you are already okay with issuing a delete on the VM and let Azure API do the detaching like in this PR ( even if you have to do the deletion yourself).

Regardless, merging this PR does not take away or stop you for proceeding with the "one call" methods in the future.

TLDR; we don't "have to" proceed with this PR. But I would like to know why we do things the way we do since we go out of our way to do our own "optimization".

himanshu-kun · 2023-04-25T04:15:31Z

But I would like to know why we do things the way we do since we go out of our way to do our own "optimization".

I tried a search for the reason of this optimisation. I bumped into gardener/machine-controller-manager#248 , which introduced the change of waiting for data disk detachment . During this change the following were the conditions:

there was no concept like datadisk which MCM was aware of. The disk name referred here is just the os-disk. The support for data disks we refer to now, was added a year later by Added support for Data Disks in Azure and AliCloud machine-controller-manager#397

There is not much info in the issue #248 tried to fix , but from what I can make out , there were some issues with VM deletion because the PVs for pod were still attached to the VM (as there was no force draining of pods at the time, and VM was tried to be deleted with PVs attached). Since thhe Azure API vmClient.Delete() was not able to handle this situation , the waiting from our side was introduced.

A related issue was seen some time ago for Azure where in a particular situation we tried to delete a VM with PVs attached, but because of this wait for disk detachment , only downtime was seen , and not any other inconsistencies.

So I think we need to try out the behaviour of Azure API , when we delete a VM with disks attached like I asked here to see if Azure API is better now or not.

himanshu-kun · 2023-05-04T06:55:15Z

After discussing internally, we decided that this Azure APIs are not that reliable and we would not disturb the code flow which is working currently , until we move to the latest APIs and see them functioning correctly.
To deal with the problem of VM stuck in terminal state , tracking issue -> gardener/machine-controller-manager#810
/close

kon-angelo requested review from a team as code owners April 18, 2023 12:39

gardener-robot added the needs/rebase Needs git rebase label Apr 18, 2023

gardener-robot added needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Apr 18, 2023

update vm delete

a7538d9

kon-angelo force-pushed the update-vm-delete branch from d853823 to a7538d9 Compare April 18, 2023 12:43

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 18, 2023

gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 18, 2023

himanshu-kun reviewed Apr 20, 2023

View reviewed changes

gardener-robot assigned himanshu-kun Apr 20, 2023

add tests

a70d5a0

gardener-robot added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Apr 24, 2023

gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 24, 2023

gardener-robot closed this May 4, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update vm delete to not wait for datadisk detachment #95

update vm delete to not wait for datadisk detachment #95

kon-angelo commented Apr 18, 2023

gardener-robot commented Apr 18, 2023

kon-angelo commented Apr 18, 2023

himanshu-kun Apr 20, 2023

himanshu-kun Apr 20, 2023

himanshu-kun left a comment

himanshu-kun commented Apr 20, 2023

elankath commented Apr 21, 2023 •

edited

Loading

unmarshall commented Apr 21, 2023

kon-angelo commented Apr 24, 2023 •

edited

Loading

himanshu-kun commented Apr 25, 2023

himanshu-kun commented May 4, 2023

		if deleteErr := DeleteVM(ctx, clients, resourceGroupName, VMName); deleteErr != nil && !NotFound(deleteErr) {
		return deleteErr

update vm delete to not wait for datadisk detachment #95

update vm delete to not wait for datadisk detachment #95

Conversation

kon-angelo commented Apr 18, 2023

gardener-robot commented Apr 18, 2023

kon-angelo commented Apr 18, 2023

himanshu-kun Apr 20, 2023

Choose a reason for hiding this comment

himanshu-kun Apr 20, 2023

Choose a reason for hiding this comment

himanshu-kun left a comment

Choose a reason for hiding this comment

himanshu-kun commented Apr 20, 2023

elankath commented Apr 21, 2023 • edited Loading

unmarshall commented Apr 21, 2023

kon-angelo commented Apr 24, 2023 • edited Loading

himanshu-kun commented Apr 25, 2023

himanshu-kun commented May 4, 2023

elankath commented Apr 21, 2023 •

edited

Loading

kon-angelo commented Apr 24, 2023 •

edited

Loading