-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: removing nic after vm creation fails #68
Conversation
…efactor of vmName, nicName to be one resourceName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good and I like your approach here. Just have a few questions about your unit tests.
Also wondering if you've been able to test this in a cluster where you've previously seen NICs hanging around to demonstrate that this change is effective at preventing that.
@rakechill i did a scale to 750 nodes, and a scale down, and nics were mostly cleaned up. We do see 4 nics that were not and all of them failed due to the same error So wanted to cut in this pr first to get us out of the 100s of remaining ghost nics |
Did another run, scaled to 300 or so then deleted all the nodes, we are left with one nic that remains due to NicReservedForAnotherVm |
@@ -25,6 +25,8 @@ import ( | |||
"github.com/Azure/karpenter/pkg/auth" | |||
) | |||
|
|||
var MaxRetries = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
Fixes #67
Description
When we issue a VM.Create call, and the vm fails to create (happens frequently at the beginning of karpenter when its trying to find unavailable offerings), we don't delete the Network Interface.
We also do not delete Nic on the InstanceProvider.Delete() call. We should also be attempting to delete nics on the Delete of the Nodeclaim. Since we will get into a scenario where we return ICE errors, then the launch reconciler will delete the nodeclaim. In the case of an ICE Error for TotalRegionalCoresQuota, we will only create the nic for that nodeclaim. Then if we delete that nodeclaim, our provider will only attempt to delete the VM.
This PR adds a new cleanup function
p.cleanupAzureResources
that we will use to delete all associated azure resources.How was this change tested?
TODO
Does this change impact docs?
Release Note