Fix Race conditions in Targeted Deletion of machines by CA #341

rishabh-11 · 2024-12-20T11:12:29Z

What this PR does / why we need it:
This PR fixes the issues noticed in live issues 6120 and 6101. We introduce a mutex in the NodeGroupImpl struct (node group implementation) which must be acquired before performing a scale-down or scale-up operation.

We also introduce an annotation on the MachineDeployment TriggerDeletionByMCM = "node.machine.sapcloud.io/trigger-deletion-by-mcm" whose value denotes the machines CA wants to remove and then atomically update replica count and annotation as one single step. This will help recognize machines for which CA has already reduced the replicas of the MachineDeployment and prevent it from being duplicated - race conditions are avoided when CA scaledowns are occurring in parallel. The MCM is also updated to check this TriggerDeletionByMCM annotation on the MachineDeployment.

The CA no longer directly sets the MachinePriority=1 annotation on the Machine objects. This is now done by the MCM controller when reconciling the updated MachineDeployment.

Which issue(s) this PR fixes:
Fixes #342

Special notes for your reviewer:

Release note:

Fix for data-races in concurrent CA scaledowns and CA restarts

elankath · 2024-12-24T10:49:42Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
+		return nil
+	}
+	markedMachines := sets.New(strings.Split(mcd.Annotations[machinesMarkedByCAForDeletion], ",")...)


Please move this to a small function mcm.getMachinesMarkedByCAForDeletion(mcd) (machineNames sets.Set[string]) which is unit testable and can be consumed by both the mcm_cloud_provider.go and mcm_manager.go

elankath · 2024-12-24T10:50:28Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
+		return err
+	}
+	var incorrectlyMarkedMachines []*Ref


I believe we can omit the Ref struct and just used types.NamespacedName which is already being used anyways in this file

elankath · 2024-12-26T06:39:47Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
 	return machinedeployment, nil
 }

+func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {


Perhaps using the "comma, ok" idiom here for mcd.Annotations[machinesMarkedByCAForDeletion] would be good for clarity to avoid unnecessary split of empty string.

Isn't it allowed to read from a nil map? Only writes will cause panic, right?

getMachinesMarkedByCAForDeletion->getMachineNamesMarkedByCAForDeletion

elankath · 2024-12-26T06:41:52Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+	// ignore the machine deployment if it is in rolling update
+	if !isRollingUpdateFinished(mcd) {
+		klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
+		return nil


Shouldn't we return an error here ? We are doing it for other cases of !isRollingUpdateFinished.

I think we should remove this check. Even if a machineDeployment is under rolling update, we should allow the annotation update if needed. wdyt?

elankath · 2024-12-26T06:52:30Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

+// addNodeGroup adds node group defined in string spec. Format:
+// minNodes:maxNodes:namespace.machineDeploymentName
+func (m *McmManager) addNodeGroup(spec string) error {
+	machineDeployment, err := buildMachineDeploymentFromSpec(spec, m)


buildMachineDeploymentFromSpec should also be moved to mcm_manager.go.

elankath · 2024-12-26T06:59:10Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+func (machineDeployment *MachineDeployment) Refresh() error {
+	machineDeployment.scalingMutex.Lock()
+	defer machineDeployment.scalingMutex.Unlock()
+	mcd, err := machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name)


This is repeated nearly a dozen times everywhere including common error handling: machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name) . Move to a method in mcmManager called GetMachineDeploymentResource which returns a formatted error that can simply be returned, so that error message is fully consistent with all uses. we are already having methods like mcmManager.getMachinesForMachineDeployment so this matches the existing convention.

elankath · 2024-12-26T07:04:50Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		if machine.Status.CurrentStatus.Phase == v1alpha1.MachineTerminating || machine.Status.CurrentStatus.Phase == v1alpha1.MachineFailed {
+			continue
+		}
+		if annotValue, ok := machine.Annotations[machinePriorityAnnotation]; ok && annotValue == priorityValueForCandidateMachines && !markedMachines.Has(machine.Name) {


Out of curiosity, Why is this called priorityValueForCandidateMachines ? Shouldn't it be defined as const PriorityDeletionValueForCandidateMachine = 1 ?

I think we can rename it to PriorityValueForDeletionCandidateMachine. wdyt?

we should also add a comment here to explain what we are doing so next guy making a patch doesn't scratch his head.

elankath · 2024-12-26T07:15:40Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		}
+	}
+	clone := mcd.DeepCopy()
+	clone.Annotations[machinesMarkedByCAForDeletion] = strings.Join(updatedMarkedMachines, ",")


This strings.Join to construct the annotation repeated elsewhere. Make a utility method called getMarkedForDeletionAnnotationValue(machineNames []string) string

elankath · 2024-12-26T07:20:01Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
+		return err
+	}
+	var incorrectlyMarkedMachines []*types.NamespacedName


minor (need not correct): types.NamespacedName is a simple struct - a value object. Modern programming practices discourages use of pointers to value objects for keeping data inline for processor cache performance. Pointers should be used only for structs holding active resource (like locks/files/etc) or class/service objects. Using a simple []types.NamespacedName and using types.NamespacedName instead of *types.NamespacedName should be followed for newer code.

elankath · 2024-12-26T07:26:32Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
 	return machinedeployment, nil
 }

+func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {


Name function to getMachineNamesMarkedByCAForDeletion

elankath · 2024-12-26T07:26:51Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

@@ -508,27 +474,32 @@ func (m *McmManager) DeleteMachines(targetMachineRefs []*Ref) error {
 	if !isRollingUpdateFinished(md) {
 		return fmt.Errorf("MachineDeployment %s is under rolling update , cannot reduce replica count", commonMachineDeployment.Name)
 	}
+	markedMachines := getMachinesMarkedByCAForDeletion(md)


markedMachines -> machineNamesMarkedByCA

rishabh-11 · 2025-02-03T11:38:56Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

-	nodeNames := getNodeNames(nodes)
-	klog.V(0).Infof("Received request to delete nodes:- %v", nodeNames)
-	size, err := machinedeployment.mcmManager.GetMachineDeploymentSize(machinedeployment)
+func (ngImpl *nodeGroup) DeleteNodes(nodes []*apiv1.Node) error {


Consider multiple calls to DeleteNodes for the same node object N1. I am not able to find the place where we are making sure that this does not cause multiple scale downs. Can you please show me where we have handled this case?

@rishabh-11 I have now separated the computation logic into computeScaledownData inside mcm_manager.go and added unit tests for duplicate and overlapping scale-downs.

elankath · 2025-02-03T15:24:36Z

@rishabh-11 unfortunely PR is failing due to license/cla as this was originally your pull request. Can you check if re-accepting https://cla-assistant.io/gardener/autoscaler?pullRequest=341 solves it or not ?

unmarshall · 2025-02-04T03:51:36Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

@@ -1075,3 +1075,35 @@ func filterExtendedResources(allResources v1.ResourceList) (extendedResources v1
 	})
 	return
 }
+
+// computeScaledownData computes fresh scaleDownData for the given MachineDeployment given the machineNamesForDeletion
+func computeScaledownData(md *v1alpha1.MachineDeployment, machineNamesForDeletion []string) (data scaleDownData) {


Can you rename the method to computeScaleDownData?

unmarshall · 2025-02-04T03:55:09Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

+	alreadyMarkedSet := sets.New(getMachineNamesTriggeredForDeletion(md)...)
+
+	uniqueForDeletionSet := forDeletionSet.Difference(alreadyMarkedSet)
+	toBeMarkedSet := alreadyMarkedSet.Union(forDeletionSet)


if you are using sets of strings then why do you need to get a difference and then take a union later on?
You can just take the existing and insert all new ones. Sets will anyways ensure that there are no duplicates.

I don't understand - it makes no difference since sets.Insert(slice...) will return a new set anyways. We are using union since parameter is a set and not a slice.

unmarshall · 2025-02-04T03:56:17Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

+	toBeMarkedSet := alreadyMarkedSet.Union(forDeletionSet)
+
+	data.RevisedToBeDeletedNames = uniqueForDeletionSet
+	data.RevisedScaledownAmount = uniqueForDeletionSet.Len()


what is the purpose of RevisedScaledownAmount if its always going to be the length of RevisedToBeDeletedNames?

It was just for documented legibility so callers can simply use scaleDownData.RevisedScaledownAmount. Its just a single int in the output type to offer clarity in consumption and logging.

unmarshall · 2025-02-04T03:59:42Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

+	} else if expectedReplicas < 0 {
+		klog.Errorf("Cannot delete machines in MachineDeployment %q, expected decrease in replicas: %d is more than current replicas: %d", md.Name, data.RevisedScaledownAmount, md.Spec.Replicas)
+	} else {
+		mdCopy := md.DeepCopy()


is there a need to create a MachineDeployment copy and keep it in the scaleDownData struct?
This could have been easily done by the caller and given that there is a single caller of this method this was not really necessary.

Yes, I believe this is needed for computational symmetry and coalescing of responsibility. This function takes in an input MachineDeployment and input parameter and returns an output parameter scaledownData that contains a revised MachineDeployment. The callers job is only to update the store and not do any construction/modification of the MachineDeployment. If there are further changes to the revised MachineDeployment it will be fully encapsulated within computeScaleDownData.

elankath · 2025-02-04T20:41:32Z

CA Integration Test Run

Integration Test Run of specs for standard scale-up, scale-down, test scale-down disabled

ill run 5 of 5 specs
------------------------------
[BeforeSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:26
Scaling Cluster Autoscaler to 0 replicas
  STEP: Adjusting node groups to initial required size @ 02/05/25 01:54:25.024
  STEP: Marking nodes present before the tests as unschedulable @ 02/05/25 01:59:00.273
  STEP: Taint node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 01:59:00.469
  STEP: Starting Cluster Autoscaler.... @ 02/05/25 01:59:01.534
[BeforeSuite] PASSED [279.858 seconds]
------------------------------
Cluster Autoscaler test Scale up and down nodes by deploying new workload requesting more resources should not lead to any errors and add 1 more node in target cluster
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:96
  STEP: Deploying workload... @ 02/05/25 01:59:01.534
  STEP: Validating Scale up @ 02/05/25 01:59:01.731
• [104.484 seconds]
------------------------------
Cluster Autoscaler test Scale up and down nodes by scaling deployed workload to 3 replicas should not lead to any errors and add 3 more node in target cluster
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:111
  STEP: Scaling up workload to 3 replicas... @ 02/05/25 02:00:46.019
  STEP: Validating Scale up @ 02/05/25 02:00:46.403
• [106.381 seconds]
------------------------------
Cluster Autoscaler test Scale up and down nodes by scaling down the deployed workload to 0 should not lead to any errors and 3 nodes to be removed from the target cluster
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:126
  STEP: Scaling down workload to zero... @ 02/05/25 02:02:32.401
  STEP: Validating Scale down @ 02/05/25 02:02:33.192
  STEP: Deleting workload @ 02/05/25 02:03:47.252
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:03:47.444
• [75.240 seconds]
------------------------------
Cluster Autoscaler test testing annotation to skip scaledown by adding annotation and then scaling the workload to zero should not scale down the extra node and should log correspondingly
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:143
  STEP: adding the annotation after deploy workload to 1 @ 02/05/25 02:03:47.642
  STEP: Validating Scale up @ 02/05/25 02:03:47.838
  STEP: getting the latest added node and adding annotation to it. @ 02/05/25 02:05:32.014
  STEP: Scaling down workload to zero... @ 02/05/25 02:05:32.416
• [105.185 seconds]
------------------------------
Cluster Autoscaler test testing annotation to skip scaledown by adding annotation and then scaling the workload to zero Should remove the unwanted node once scale down disable annotation is removed
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:168
  STEP: getting the latest added node and removing annotation from it. @ 02/05/25 02:05:33.033
  STEP: Validating Scale down @ 02/05/25 02:05:33.431
  STEP: Deleting workload @ 02/05/25 02:06:36.89
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:06:37.098
• [64.474 seconds]
------------------------------
[AfterSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:27
  STEP: Running CleanUp @ 02/05/25 02:06:37.302
  STEP: Deleting workload @ 02/05/25 02:06:37.302
  STEP: Adjusting node groups to initial required size @ 02/05/25 02:06:39.047
  STEP: Deleting workload @ 02/05/25 02:06:39.424
  STEP: Turning nodes present before the tests, back to schedulable @ 02/05/25 02:06:39.613
  STEP: Removing taint(s) from node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 02:06:39.81
  STEP: Removing taint(s) from node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:06:40.209
  STEP: Scaling CA back up to 1 in the Shoot namespace @ 02/05/25 02:06:40.629
Scaling Cluster Autoscaler to 1 replicas
[AfterSuite] PASSED [4.178 seconds]
------------------------------

Ran 5 of 5 Specs in 739.801 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestIntegration (739.80s)
PASS

elankath · 2025-02-04T20:50:30Z

CA Integration Test Run

CA Integration Test Run output for NodeGroup min-max, limits, taint-causing-scaling, etc

Will run 6 of 6 specs
------------------------------
[BeforeSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:25
Scaling Cluster Autoscaler to 0 replicas
  STEP: Adjusting node groups to initial required size @ 02/05/25 02:08:30.115
  STEP: Marking nodes present before the tests as unschedulable @ 02/05/25 02:08:30.496
  STEP: Taint node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 02:08:30.691
  STEP: Taint node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:08:31.086
  STEP: Starting Cluster Autoscaler.... @ 02/05/25 02:08:31.832
[BeforeSuite] PASSED [5.155 seconds]
------------------------------
Cluster Autoscaler test testing min and max limit for Cluster autoscaler by increasing the workload to above max shouldn't scale beyond max number of workers
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:188
  STEP: Deploying workload with replicas = max+4 @ 02/05/25 02:08:31.833
  STEP: Validating Scale up @ 02/05/25 02:08:32.03
• [105.186 seconds]
------------------------------
Cluster Autoscaler test testing min and max limit for Cluster autoscaler by decreasing the workload to below min shouldn't scale down beyond min number of workers
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:200
  STEP: Scaling down workload to zero... @ 02/05/25 02:10:17.019
  STEP: Validating Scale down @ 02/05/25 02:10:17.403
  STEP: Deleting workload @ 02/05/25 02:11:31.516
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:11:31.709
• [74.891 seconds]
------------------------------
Cluster Autoscaler test testing scaling due to taints make current nodes unschedulable should spawn more nodes for accommodating new workload
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:216
  STEP: making the only node available, to be unschedulable @ 02/05/25 02:11:31.911
  STEP: Taint node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:11:32.102
  STEP: Increasing the workload @ 02/05/25 02:11:32.502
  STEP: Validating Scale up @ 02/05/25 02:11:32.698
• [104.870 seconds]
------------------------------
Cluster Autoscaler test testing scaling due to taints make current nodes unschedulable should remove the node as the taint has been removed and node has low utilization
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:233
  STEP: making the node available to be schedulable @ 02/05/25 02:13:16.782
  STEP: Removing taint(s) from node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:13:16.978
  STEP: Validating Scale down @ 02/05/25 02:13:17.38
  STEP: Deleting workload @ 02/05/25 02:14:20.897
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:14:21.092
• [64.509 seconds]
------------------------------
Cluster Autoscaler test testing scaling due to volume pending create a volume in a zone with no node and a pod requesting it should create a node in the zone with volume and hence scale up
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:252
  STEP: Creating StorageClass with topology restrictions @ 02/05/25 02:14:21.291
  STEP: deploying PVC in zone with no nodes @ 02/05/25 02:14:21.483
  STEP: deploying the workload which requires the volume @ 02/05/25 02:14:21.678
  STEP: Validation scale up to +1 in a new Zone @ 02/05/25 02:14:21.872
  STEP: Removing pvc created earlier @ 02/05/25 02:15:55.578
  STEP: Removing storage Class created earlier @ 02/05/25 02:15:55.771
  STEP: Deleting workload @ 02/05/25 02:15:55.964
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:15:56.165
• [158.437 seconds]
------------------------------
Cluster Autoscaler test testing not able to scale due to excess demand create a pod requiring more resources than a single machine can provide shouldn't scale up and log the error
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:320
  STEP: Deploying the workload @ 02/05/25 02:16:59.669
  STEP: checking that scale up didn't trigger because of no machine satisfying the requirement @ 02/05/25 02:16:59.866
  STEP: Deleting workload @ 02/05/25 02:17:09.911
  STEP: Checking that number of Ready nodes is equal to initial @ 02/05/25 02:17:10.119
• [10.842 seconds]
------------------------------
[AfterSuite] 
/Users/I034796/go/src/k8s.io/autoscaler/cluster-autoscaler/integration/integration_test.go:26
  STEP: Running CleanUp @ 02/05/25 02:17:10.511
  STEP: Deleting workload @ 02/05/25 02:17:10.511
  STEP: Adjusting node groups to initial required size @ 02/05/25 02:17:12.652
  STEP: Deleting workload @ 02/05/25 02:17:13.04
  STEP: Turning nodes present before the tests, back to schedulable @ 02/05/25 02:17:13.229
  STEP: Removing taint(s) from node ip-10-180-24-5.eu-west-1.compute.internal @ 02/05/25 02:17:13.432
  STEP: Removing taint(s) from node ip-10-180-74-58.eu-west-1.compute.internal @ 02/05/25 02:17:13.836
  STEP: Scaling CA back up to 1 in the Shoot namespace @ 02/05/25 02:17:14.24
Scaling Cluster Autoscaler to 1 replicas
[AfterSuite] PASSED [4.570 seconds]
------------------------------

Ran 6 of 6 Specs in 528.462 seconds
SUCCESS! -- 6 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestIntegration (528.46s)
PASS

) * [WIP] fix CA marking machines for deletion * [WIP] add mutex for machine deployment * initialise machinedeployment map in mcmManager * add Refresh method in nodegrp implementation * address review comments * address review comments - part 2 * update unit tests, misc code changes * review comments addressed * fixed broken test after refactor * fixed broken test * alternate solution using single annotation for deletion by CA * fixed use of Go 1.23 functions * fixed test * added godoc for AcquireScalingMutex * correct godoc for machinesMarkedByCAForDeletionAnnotation * changed to machineutils.TriggerDeletionByMCM * removed var shadowing, added TODO to godoc for util fns * corr var names, clear mutex acquire, releasing logs * ensured IncreaseSize/DecreaseTargetSize logged delta in mutex acquire/release log * review comments addressed: fixed unit test, adjusted log, removed redundant fn * all code review comments addressed * review feedback: unexport belongs, enforce interface * addreseed review comment, refactored added test for computeScaleDownData * review comment: preset capacity for slices * review feedback: ->computeScaleDownData and revised godoc * cordon nodes fix for when node is disabled for scaledown * fixed unit test string quoting issue --------- Co-authored-by: Rishabh Patel <[email protected]> Co-authored-by: elankath <[email protected]>

* [WIP] fix CA marking machines for deletion * [WIP] add mutex for machine deployment * initialise machinedeployment map in mcmManager * add Refresh method in nodegrp implementation * address review comments * address review comments - part 2 * update unit tests, misc code changes * review comments addressed * fixed broken test after refactor * fixed broken test * alternate solution using single annotation for deletion by CA * fixed use of Go 1.23 functions * fixed test * added godoc for AcquireScalingMutex * correct godoc for machinesMarkedByCAForDeletionAnnotation * changed to machineutils.TriggerDeletionByMCM * removed var shadowing, added TODO to godoc for util fns * corr var names, clear mutex acquire, releasing logs * ensured IncreaseSize/DecreaseTargetSize logged delta in mutex acquire/release log * review comments addressed: fixed unit test, adjusted log, removed redundant fn * all code review comments addressed * review feedback: unexport belongs, enforce interface * addreseed review comment, refactored added test for computeScaleDownData * review comment: preset capacity for slices * review feedback: ->computeScaleDownData and revised godoc * cordon nodes fix for when node is disabled for scaledown * fixed unit test string quoting issue --------- Co-authored-by: Rishabh Patel <[email protected]> Co-authored-by: Rishabh Patel <[email protected]> Co-authored-by: elankath <[email protected]>

Rishabh Patel added 4 commits December 19, 2024 14:59

[WIP] fix CA marking machines for deletion

43cffaf

[WIP] add mutex for machine deployment

0d939b0

initialise machinedeployment map in mcmManager

98f20d3

add Refresh method in nodegrp implementation

56d80ac

gardener-robot added needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Dec 20, 2024

rishabh-11 marked this pull request as ready for review December 24, 2024 08:31

rishabh-11 requested review from unmarshall and a team as code owners December 24, 2024 08:31

elankath reviewed Dec 24, 2024

View reviewed changes

address review comments

f3774f4

gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Dec 24, 2024

rishabh-11 requested a review from elankath December 24, 2024 12:23

gardener-robot removed the size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) label Dec 24, 2024

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024

elankath reviewed Dec 26, 2024

View reviewed changes

gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 3, 2025

rishabh-11 commented Feb 3, 2025

View reviewed changes

addreseed review comment, refactored added test for computeScaleDownData

136ea27

gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 3, 2025

review comment: preset capacity for slices

f31884d

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 3, 2025

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 3, 2025

unmarshall reviewed Feb 4, 2025

View reviewed changes

review feedback: ->computeScaleDownData and revised godoc

36b59f1

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025

cordon nodes fix for when node is disabled for scaledown

0e21faa

gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 4, 2025

fixed unit test string quoting issue

12e747f

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 4, 2025

aaronfern approved these changes Feb 4, 2025

View reviewed changes

aaronfern merged commit edd7d34 into gardener:machine-controller-manager-provider Feb 4, 2025
9 of 10 checks passed

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Race conditions in Targeted Deletion of machines by CA #341

Fix Race conditions in Targeted Deletion of machines by CA #341

rishabh-11 commented Dec 20, 2024 •

edited by elankath

Loading

elankath Dec 24, 2024

elankath Dec 24, 2024

elankath Dec 26, 2024 •

edited

Loading

rishabh-11 Dec 26, 2024

elankath Dec 26, 2024

elankath Dec 26, 2024

rishabh-11 Dec 26, 2024

elankath Dec 26, 2024

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024

rishabh-11 Dec 26, 2024

elankath Dec 26, 2024

elankath Dec 26, 2024

elankath Dec 26, 2024

elankath Dec 26, 2024

elankath Dec 26, 2024 •

edited

Loading

rishabh-11 Feb 3, 2025

elankath Feb 3, 2025

elankath commented Feb 3, 2025 •

edited

Loading

unmarshall Feb 4, 2025

elankath Feb 4, 2025

unmarshall Feb 4, 2025

elankath Feb 4, 2025 •

edited

Loading

unmarshall Feb 4, 2025

elankath Feb 4, 2025 •

edited

Loading

unmarshall Feb 4, 2025

elankath Feb 4, 2025 •

edited

Loading

elankath commented Feb 4, 2025 •

edited

Loading

elankath commented Feb 4, 2025

Fix Race conditions in Targeted Deletion of machines by CA #341

Fix Race conditions in Targeted Deletion of machines by CA #341

Conversation

rishabh-11 commented Dec 20, 2024 • edited by elankath Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath commented Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

elankath commented Feb 4, 2025 • edited Loading

CA Integration Test Run

elankath commented Feb 4, 2025

CA Integration Test Run

rishabh-11 commented Dec 20, 2024 •

edited by elankath

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath commented Feb 3, 2025 •

edited

Loading

elankath Feb 4, 2025 •

edited

Loading

elankath Feb 4, 2025 •

edited

Loading

elankath Feb 4, 2025 •

edited

Loading

elankath commented Feb 4, 2025 •

edited

Loading