Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Race conditions in Targeted Deletion of machines by CA #341

Merged

Conversation

rishabh-11
Copy link

@rishabh-11 rishabh-11 commented Dec 20, 2024

What this PR does / why we need it:
This PR fixes the issues noticed in live issues 6120 and 6101. We introduce a mutex in the NodeGroupImpl struct (node group implementation) which must be acquired before performing a scale-down or scale-up operation.

We also introduce an annotation on the MachineDeployment TriggerDeletionByMCM = "node.machine.sapcloud.io/trigger-deletion-by-mcm" whose value denotes the machines CA wants to remove and then atomically update replica count and annotation as one single step. This will help recognize machines for which CA has already reduced the replicas of the MachineDeployment and prevent it from being duplicated - race conditions are avoided when CA scaledowns are occurring in parallel. The MCM is also updated to check this TriggerDeletionByMCM annotation on the MachineDeployment.

The CA no longer directly sets the MachinePriority=1 annotation on the Machine objects. This is now done by the MCM controller when reconciling the updated MachineDeployment.

Which issue(s) this PR fixes:
Fixes #342

Special notes for your reviewer:

Release note:

Fix for data-races in concurrent CA scaledowns and CA restarts

@gardener-robot gardener-robot added needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Dec 20, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 20, 2024
@rishabh-11 rishabh-11 marked this pull request as ready for review December 24, 2024 08:31
@rishabh-11 rishabh-11 requested review from unmarshall and a team as code owners December 24, 2024 08:31
klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
return nil
}
markedMachines := sets.New(strings.Split(mcd.Annotations[machinesMarkedByCAForDeletion], ",")...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to a small function mcm.getMachinesMarkedByCAForDeletion(mcd) (machineNames sets.Set[string]) which is unit testable and can be consumed by both the mcm_cloud_provider.go and mcm_manager.go

klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
return err
}
var incorrectlyMarkedMachines []*Ref

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can omit the Ref struct and just used types.NamespacedName which is already being used anyways in this file

@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Dec 24, 2024
@rishabh-11 rishabh-11 requested a review from elankath December 24, 2024 12:23
@gardener-robot gardener-robot removed the size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) label Dec 24, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024
@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
return machinedeployment, nil
}

func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {
Copy link

@elankath elankath Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps using the "comma, ok" idiom here for mcd.Annotations[machinesMarkedByCAForDeletion] would be good for clarity to avoid unnecessary split of empty string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it allowed to read from a nil map? Only writes will cause panic, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getMachinesMarkedByCAForDeletion->getMachineNamesMarkedByCAForDeletion

// ignore the machine deployment if it is in rolling update
if !isRollingUpdateFinished(mcd) {
klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
return nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return an error here ? We are doing it for other cases of !isRollingUpdateFinished.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this check. Even if a machineDeployment is under rolling update, we should allow the annotation update if needed. wdyt?

// addNodeGroup adds node group defined in string spec. Format:
// minNodes:maxNodes:namespace.machineDeploymentName
func (m *McmManager) addNodeGroup(spec string) error {
machineDeployment, err := buildMachineDeploymentFromSpec(spec, m)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildMachineDeploymentFromSpec should also be moved to mcm_manager.go.

func (machineDeployment *MachineDeployment) Refresh() error {
machineDeployment.scalingMutex.Lock()
defer machineDeployment.scalingMutex.Unlock()
mcd, err := machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name)
Copy link

@elankath elankath Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated nearly a dozen times everywhere including common error handling: machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name) . Move to a method in mcmManager called GetMachineDeploymentResource which returns a formatted error that can simply be returned, so that error message is fully consistent with all uses. we are already having methods like mcmManager.getMachinesForMachineDeployment so this matches the existing convention.

if machine.Status.CurrentStatus.Phase == v1alpha1.MachineTerminating || machine.Status.CurrentStatus.Phase == v1alpha1.MachineFailed {
continue
}
if annotValue, ok := machine.Annotations[machinePriorityAnnotation]; ok && annotValue == priorityValueForCandidateMachines && !markedMachines.Has(machine.Name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, Why is this called priorityValueForCandidateMachines ? Shouldn't it be defined as const PriorityDeletionValueForCandidateMachine = 1 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rename it to PriorityValueForDeletionCandidateMachine. wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add a comment here to explain what we are doing so next guy making a patch doesn't scratch his head.

}
}
clone := mcd.DeepCopy()
clone.Annotations[machinesMarkedByCAForDeletion] = strings.Join(updatedMarkedMachines, ",")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strings.Join to construct the annotation repeated elsewhere. Make a utility method called getMarkedForDeletionAnnotationValue(machineNames []string) string

klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
return err
}
var incorrectlyMarkedMachines []*types.NamespacedName

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor (need not correct): types.NamespacedName is a simple struct - a value object. Modern programming practices discourages use of pointers to value objects for keeping data inline for processor cache performance. Pointers should be used only for structs holding active resource (like locks/files/etc) or class/service objects. Using a simple []types.NamespacedName and using types.NamespacedName instead of *types.NamespacedName should be followed for newer code.

@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
return machinedeployment, nil
}

func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name function to getMachineNamesMarkedByCAForDeletion

@@ -508,27 +474,32 @@ func (m *McmManager) DeleteMachines(targetMachineRefs []*Ref) error {
if !isRollingUpdateFinished(md) {
return fmt.Errorf("MachineDeployment %s is under rolling update , cannot reduce replica count", commonMachineDeployment.Name)
}
markedMachines := getMachinesMarkedByCAForDeletion(md)
Copy link

@elankath elankath Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markedMachines -> machineNamesMarkedByCA

@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 3, 2025
nodeNames := getNodeNames(nodes)
klog.V(0).Infof("Received request to delete nodes:- %v", nodeNames)
size, err := machinedeployment.mcmManager.GetMachineDeploymentSize(machinedeployment)
func (ngImpl *nodeGroup) DeleteNodes(nodes []*apiv1.Node) error {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider multiple calls to DeleteNodes for the same node object N1. I am not able to find the place where we are making sure that this does not cause multiple scale downs. Can you please show me where we have handled this case?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-11 I have now separated the computation logic into computeScaledownData inside mcm_manager.go and added unit tests for duplicate and overlapping scale-downs.

@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 3, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 3, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 3, 2025
@elankath
Copy link

elankath commented Feb 3, 2025

@rishabh-11 unfortunely PR is failing due to license/cla as this was originally your pull request. Can you check if re-accepting https://cla-assistant.io/gardener/autoscaler?pullRequest=341 solves it or not ?

@@ -1075,3 +1075,35 @@ func filterExtendedResources(allResources v1.ResourceList) (extendedResources v1
})
return
}

// computeScaledownData computes fresh scaleDownData for the given MachineDeployment given the machineNamesForDeletion
func computeScaledownData(md *v1alpha1.MachineDeployment, machineNamesForDeletion []string) (data scaleDownData) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename the method to computeScaleDownData?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

alreadyMarkedSet := sets.New(getMachineNamesTriggeredForDeletion(md)...)

uniqueForDeletionSet := forDeletionSet.Difference(alreadyMarkedSet)
toBeMarkedSet := alreadyMarkedSet.Union(forDeletionSet)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are using sets of strings then why do you need to get a difference and then take a union later on?
You can just take the existing and insert all new ones. Sets will anyways ensure that there are no duplicates.

Copy link

@elankath elankath Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand - it makes no difference since sets.Insert(slice...) will return a new set anyways. We are using union since parameter is a set and not a slice.

toBeMarkedSet := alreadyMarkedSet.Union(forDeletionSet)

data.RevisedToBeDeletedNames = uniqueForDeletionSet
data.RevisedScaledownAmount = uniqueForDeletionSet.Len()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of RevisedScaledownAmount if its always going to be the length of RevisedToBeDeletedNames?

Copy link

@elankath elankath Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just for documented legibility so callers can simply use scaleDownData.RevisedScaledownAmount. Its just a single int in the output type to offer clarity in consumption and logging.

} else if expectedReplicas < 0 {
klog.Errorf("Cannot delete machines in MachineDeployment %q, expected decrease in replicas: %d is more than current replicas: %d", md.Name, data.RevisedScaledownAmount, md.Spec.Replicas)
} else {
mdCopy := md.DeepCopy()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a need to create a MachineDeployment copy and keep it in the scaleDownData struct?
This could have been easily done by the caller and given that there is a single caller of this method this was not really necessary.

Copy link

@elankath elankath Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this is needed for computational symmetry and coalescing of responsibility. This function takes in an input MachineDeployment and input parameter and returns an output parameter scaledownData that contains a revised MachineDeployment. The callers job is only to update the store and not do any construction/modification of the MachineDeployment. If there are further changes to the revised MachineDeployment it will be fully encapsulated within computeScaleDownData.

@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 4, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025
@elankath
Copy link

elankath commented Feb 4, 2025

CA Integration Test Run

Integration Test Run of specs for standard scale-up, scale-down, test scale-down disabled

ill run 5 of 5 specs
------------------------------
[BeforeSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:26
Scaling Cluster Autoscaler to 0 replicas
  STEP: Adjusting node groups to initial required size @ 02/05/25 01:54:25.024
  STEP: Marking nodes present before the tests as unschedulable @ 02/05/25 01:59:00.273
  STEP: Taint node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 01:59:00.469
  STEP: Starting Cluster Autoscaler.... @ 02/05/25 01:59:01.534
[BeforeSuite] PASSED [279.858 seconds]
------------------------------
Cluster Autoscaler test Scale up and down nodes by deploying new workload requesting more resources should not lead to any errors and add 1 more node in target cluster
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:96
  STEP: Deploying workload... @ 02/05/25 01:59:01.534
  STEP: Validating Scale up @ 02/05/25 01:59:01.731
• [104.484 seconds]
------------------------------
Cluster Autoscaler test Scale up and down nodes by scaling deployed workload to 3 replicas should not lead to any errors and add 3 more node in target cluster
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:111
  STEP: Scaling up workload to 3 replicas... @ 02/05/25 02:00:46.019
  STEP: Validating Scale up @ 02/05/25 02:00:46.403
• [106.381 seconds]
------------------------------
Cluster Autoscaler test Scale up and down nodes by scaling down the deployed workload to 0 should not lead to any errors and 3 nodes to be removed from the target cluster
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:126
  STEP: Scaling down workload to zero... @ 02/05/25 02:02:32.401
  STEP: Validating Scale down @ 02/05/25 02:02:33.192
  STEP: Deleting workload @ 02/05/25 02:03:47.252
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:03:47.444
• [75.240 seconds]
------------------------------
Cluster Autoscaler test testing annotation to skip scaledown by adding annotation and then scaling the workload to zero should not scale down the extra node and should log correspondingly
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:143
  STEP: adding the annotation after deploy workload to 1 @ 02/05/25 02:03:47.642
  STEP: Validating Scale up @ 02/05/25 02:03:47.838
  STEP: getting the latest added node and adding annotation to it. @ 02/05/25 02:05:32.014
  STEP: Scaling down workload to zero... @ 02/05/25 02:05:32.416
• [105.185 seconds]
------------------------------
Cluster Autoscaler test testing annotation to skip scaledown by adding annotation and then scaling the workload to zero Should remove the unwanted node once scale down disable annotation is removed
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:168
  STEP: getting the latest added node and removing annotation from it. @ 02/05/25 02:05:33.033
  STEP: Validating Scale down @ 02/05/25 02:05:33.431
  STEP: Deleting workload @ 02/05/25 02:06:36.89
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:06:37.098
• [64.474 seconds]
------------------------------
[AfterSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:27
  STEP: Running CleanUp @ 02/05/25 02:06:37.302
  STEP: Deleting workload @ 02/05/25 02:06:37.302
  STEP: Adjusting node groups to initial required size @ 02/05/25 02:06:39.047
  STEP: Deleting workload @ 02/05/25 02:06:39.424
  STEP: Turning nodes present before the tests, back to schedulable @ 02/05/25 02:06:39.613
  STEP: Removing taint(s) from node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 02:06:39.81
  STEP: Removing taint(s) from node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:06:40.209
  STEP: Scaling CA back up to 1 in the Shoot namespace @ 02/05/25 02:06:40.629
Scaling Cluster Autoscaler to 1 replicas
[AfterSuite] PASSED [4.178 seconds]
------------------------------

Ran 5 of 5 Specs in 739.801 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestIntegration (739.80s)
PASS

@elankath
Copy link

elankath commented Feb 4, 2025

CA Integration Test Run

CA Integration Test Run output for NodeGroup min-max, limits, taint-causing-scaling, etc

Will run 6 of 6 specs
------------------------------
[BeforeSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:25
Scaling Cluster Autoscaler to 0 replicas
  STEP: Adjusting node groups to initial required size @ 02/05/25 02:08:30.115
  STEP: Marking nodes present before the tests as unschedulable @ 02/05/25 02:08:30.496
  STEP: Taint node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 02:08:30.691
  STEP: Taint node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:08:31.086
  STEP: Starting Cluster Autoscaler.... @ 02/05/25 02:08:31.832
[BeforeSuite] PASSED [5.155 seconds]
------------------------------
Cluster Autoscaler test testing min and max limit for Cluster autoscaler by increasing the workload to above max shouldn't scale beyond max number of workers
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:188
  STEP: Deploying workload with replicas = max+4 @ 02/05/25 02:08:31.833
  STEP: Validating Scale up @ 02/05/25 02:08:32.03
• [105.186 seconds]
------------------------------
Cluster Autoscaler test testing min and max limit for Cluster autoscaler by decreasing the workload to below min shouldn't scale down beyond min number of workers
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:200
  STEP: Scaling down workload to zero... @ 02/05/25 02:10:17.019
  STEP: Validating Scale down @ 02/05/25 02:10:17.403
  STEP: Deleting workload @ 02/05/25 02:11:31.516
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:11:31.709
• [74.891 seconds]
------------------------------
Cluster Autoscaler test testing scaling due to taints make current nodes unschedulable should spawn more nodes for accommodating new workload
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:216
  STEP: making the only node available, to be unschedulable @ 02/05/25 02:11:31.911
  STEP: Taint node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:11:32.102
  STEP: Increasing the workload @ 02/05/25 02:11:32.502
  STEP: Validating Scale up @ 02/05/25 02:11:32.698
• [104.870 seconds]
------------------------------
Cluster Autoscaler test testing scaling due to taints make current nodes unschedulable should remove the node as the taint has been removed and node has low utilization
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:233
  STEP: making the node available to be schedulable @ 02/05/25 02:13:16.782
  STEP: Removing taint(s) from node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:13:16.978
  STEP: Validating Scale down @ 02/05/25 02:13:17.38
  STEP: Deleting workload @ 02/05/25 02:14:20.897
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:14:21.092
• [64.509 seconds]
------------------------------
Cluster Autoscaler test testing scaling due to volume pending create a volume in a zone with no node and a pod requesting it should create a node in the zone with volume and hence scale up
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:252
  STEP: Creating StorageClass with topology restrictions @ 02/05/25 02:14:21.291
  STEP: deploying PVC in zone with no nodes @ 02/05/25 02:14:21.483
  STEP: deploying the workload which requires the volume @ 02/05/25 02:14:21.678
  STEP: Validation scale up to +1 in a new Zone @ 02/05/25 02:14:21.872
  STEP: Removing pvc created earlier @ 02/05/25 02:15:55.578
  STEP: Removing storage Class created earlier @ 02/05/25 02:15:55.771
  STEP: Deleting workload @ 02/05/25 02:15:55.964
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:15:56.165
• [158.437 seconds]
------------------------------
Cluster Autoscaler test testing not able to scale due to excess demand create a pod requiring more resources than a single machine can provide shouldn't scale up and log the error
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:320
  STEP: Deploying the workload @ 02/05/25 02:16:59.669
  STEP: checking that scale up didn't trigger because of no machine satisfying the requirement @ 02/05/25 02:16:59.866
  STEP: Deleting workload @ 02/05/25 02:17:09.911
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:17:10.119
• [10.842 seconds]
------------------------------
[AfterSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:26
  STEP: Running CleanUp @ 02/05/25 02:17:10.511
  STEP: Deleting workload @ 02/05/25 02:17:10.511
  STEP: Adjusting node groups to initial required size @ 02/05/25 02:17:12.652
  STEP: Deleting workload @ 02/05/25 02:17:13.04
  STEP: Turning nodes present before the tests, back to schedulable @ 02/05/25 02:17:13.229
  STEP: Removing taint(s) from node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 02:17:13.432
  STEP: Removing taint(s) from node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:17:13.836
  STEP: Scaling CA back up to 1 in the Shoot namespace @ 02/05/25 02:17:14.24
Scaling Cluster Autoscaler to 1 replicas
[AfterSuite] PASSED [4.570 seconds]
------------------------------

Ran 6 of 6 Specs in 528.462 seconds
SUCCESS! -- 6 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestIntegration (528.46s)
PASS

@aaronfern aaronfern merged commit edd7d34 into gardener:machine-controller-manager-provider Feb 4, 2025
9 of 10 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Feb 4, 2025
aaronfern pushed a commit to aaronfern/autoscaler that referenced this pull request Feb 4, 2025
)

* [WIP] fix CA marking machines for deletion

* [WIP] add mutex for machine deployment

* initialise machinedeployment map in mcmManager

* add Refresh method in nodegrp implementation

* address review comments

* address review comments - part 2

* update unit tests, misc code changes

* review comments addressed

* fixed broken test after refactor

* fixed broken test

* alternate solution using single annotation for deletion by CA

* fixed use of Go 1.23 functions

* fixed test

* added godoc for AcquireScalingMutex

* correct godoc for machinesMarkedByCAForDeletionAnnotation

* changed to machineutils.TriggerDeletionByMCM

* removed var shadowing, added TODO to godoc for util fns

* corr var names, clear mutex acquire, releasing logs

* ensured IncreaseSize/DecreaseTargetSize logged  delta in mutex acquire/release log

* review comments addressed: fixed unit test, adjusted log, removed redundant fn

* all code review comments addressed

* review feedback: unexport belongs, enforce interface

* addreseed review comment, refactored added test for computeScaleDownData

* review comment: preset capacity for slices

* review feedback: ->computeScaleDownData and revised godoc

* cordon nodes fix for when node is disabled for scaledown

* fixed unit test string quoting issue

---------

Co-authored-by: Rishabh Patel <[email protected]>
Co-authored-by: elankath <[email protected]>
aaronfern added a commit that referenced this pull request Feb 4, 2025
* [WIP] fix CA marking machines for deletion

* [WIP] add mutex for machine deployment

* initialise machinedeployment map in mcmManager

* add Refresh method in nodegrp implementation

* address review comments

* address review comments - part 2

* update unit tests, misc code changes

* review comments addressed

* fixed broken test after refactor

* fixed broken test

* alternate solution using single annotation for deletion by CA

* fixed use of Go 1.23 functions

* fixed test

* added godoc for AcquireScalingMutex

* correct godoc for machinesMarkedByCAForDeletionAnnotation

* changed to machineutils.TriggerDeletionByMCM

* removed var shadowing, added TODO to godoc for util fns

* corr var names, clear mutex acquire, releasing logs

* ensured IncreaseSize/DecreaseTargetSize logged  delta in mutex acquire/release log

* review comments addressed: fixed unit test, adjusted log, removed redundant fn

* all code review comments addressed

* review feedback: unexport belongs, enforce interface

* addreseed review comment, refactored added test for computeScaleDownData

* review comment: preset capacity for slices

* review feedback: ->computeScaleDownData and revised godoc

* cordon nodes fix for when node is disabled for scaledown

* fixed unit test string quoting issue

---------

Co-authored-by: Rishabh Patel <[email protected]>
Co-authored-by: Rishabh Patel <[email protected]>
Co-authored-by: elankath <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA Issues invalid scale downs due to scale-down processing delay in the MCM
9 participants