Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update vm delete to not wait for datadisk detachment #95

Closed
wants to merge 2 commits into from

Conversation

kon-angelo
Copy link
Contributor

What this PR does / why we need it:
Update VM delete to not wait until all data disks are detached. This may help to prevent situations where a failed VM cannot be deleted (because detach or any operation are prohibited).

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:


@kon-angelo kon-angelo requested review from a team as code owners April 18, 2023 12:39
@gardener-robot gardener-robot added the needs/rebase Needs git rebase label Apr 18, 2023
@gardener-robot
Copy link

@kon-angelo You need rebase this pull request with latest master branch. Please check.

@gardener-robot gardener-robot added needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Apr 18, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 18, 2023
@kon-angelo
Copy link
Contributor Author

/test

@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 18, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 18, 2023
Comment on lines +604 to +605
if deleteErr := DeleteVM(ctx, clients, resourceGroupName, VMName); deleteErr != nil && !NotFound(deleteErr) {
return deleteErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't wait for disk detachment (which could take 10 min)
then we'll directly move to the disk deletion below while disk are detaching
this could make things complicated as azure can't be trusted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry after taking a re-look , I realize that we are not even issuing a detach call now, and going for a direct delete.
this solution would be deterministic only if DeleteVM returns after the detach and not before, otherwise we will go for disk deletion and it might fail and lead to inconsistencies.

Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you provide more context regarding this change, and how you think this helps?
An PR is already in progress by @unmarshall for issue #91 where this could be handled.
Do you feel this change is urgent ?

@himanshu-kun
Copy link
Contributor

/assign @himanshu-kun

@elankath
Copy link
Contributor

elankath commented Apr 21, 2023

Please note that when we move to the new Azure SDK in #91 , we will be changing the code to do cascade create and cascade delete. ie ONE call to create NIC+DISK+VM and ONE call to delete NIC+DISK+VM.

@unmarshall
Copy link
Contributor

Please note that when we move to the new Azure SDK in #91 , we will be changing the code to do cascade create and cascade delete. ie ONE call to create NIC+DISK+VM and ONE call to delete NIC+DISK+VM.

We are currently testing these APIs. Tarun is right that there are options to cascade delete. For creation we are testing if we need to create NICs separately. Once we have done the testing then we will know more.

@kon-angelo
Copy link
Contributor Author

kon-angelo commented Apr 24, 2023

Thank you all for the review/responses.

@himanshu-kun

I realize that we are not even issuing a detach call now, and going for a direct delete.
this solution would be deterministic only if DeleteVM returns after the detach and not before, otherwise we will go for disk deletion and it might fail and lead to inconsistencies

The DeleteVM should return after the VM is deleted in which case there is nothing to detach the disks from as the VM is already deleted. In practice, the Azure API should call the detach in the background but for some reason we chose to detach the disks in the foreground by ourselves.

could you provide more context regarding this change, and how you think this helps?

I think that this is how it should be done (in an ideal world). The API allows to delete the machine with data disks so let it take care of the process by itself. If there is an actual reason that we did it this way (past incident or issues), then please point me to the documentation otherwise let's not optimize without a good reason/data.

Do you feel this change is urgent ?

Not urgent. Just curious why we handle VMs in a "special" way.

@elankath

Please note that when we move to the new Azure SDK in #91 , we will be changing the code to do cascade create and cascade delete. ie ONE call to create NIC+DISK+VM and ONE call to delete NIC+DISK+VM.

👍 . However the cascade options are part of the VM parameters and you have to consider existing VMs and how will you handle it when MCM does not support VM updates. Will you update existing VMs before deleting them ? How will you deal with failing VMs that do not allow for updates etc. But these are part of your future PR. It is relevant however if you can't reliably update all machines with cascading options and you are forced to have keep code paths (old and new) for a period of time.
If you are okay with deleting the VM with a cascade option (hence skipping the detachment of the datadisks) it means that you are already okay with issuing a delete on the VM and let Azure API do the detaching like in this PR ( even if you have to do the deletion yourself).

Regardless, merging this PR does not take away or stop you for proceeding with the "one call" methods in the future.

TLDR; we don't "have to" proceed with this PR. But I would like to know why we do things the way we do since we go out of our way to do our own "optimization".

@gardener-robot gardener-robot added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Apr 24, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 24, 2023
@himanshu-kun
Copy link
Contributor

But I would like to know why we do things the way we do since we go out of our way to do our own "optimization".

I tried a search for the reason of this optimisation. I bumped into gardener/machine-controller-manager#248 , which introduced the change of waiting for data disk detachment . During this change the following were the conditions:

There is not much info in the issue #248 tried to fix , but from what I can make out , there were some issues with VM deletion because the PVs for pod were still attached to the VM (as there was no force draining of pods at the time, and VM was tried to be deleted with PVs attached). Since thhe Azure API vmClient.Delete() was not able to handle this situation , the waiting from our side was introduced.

A related issue was seen some time ago for Azure where in a particular situation we tried to delete a VM with PVs attached, but because of this wait for disk detachment , only downtime was seen , and not any other inconsistencies.

So I think we need to try out the behaviour of Azure API , when we delete a VM with disks attached like I asked here to see if Azure API is better now or not.

@himanshu-kun
Copy link
Contributor

After discussing internally, we decided that this Azure APIs are not that reliable and we would not disturb the code flow which is working currently , until we move to the latest APIs and see them functioning correctly.
To deal with the problem of VM stuck in terminal state , tracking issue -> gardener/machine-controller-manager#810
/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/rebase Needs git rebase needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants