Scale down of machineset during rolling update #826

mattburgess · 2023-06-06T11:28:36Z

How to categorize this issue?

/area auto-scaling
/kind bug
/priority 2

What happened:

During a rolling update of our MachineDeployments we saw a machineset very rapidly scale down to 0. This is very similar to #802 but happened in a slightly more gradual manner:

I0606 10:36:39.331659 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 0, deleting 3
I0606 10:36:26.699210 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 3, deleting 1
I0606 10:36:20.485488 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 4, deleting 7
I0606 10:36:09.951122 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 11, deleting 1
I0606 10:36:03.741863 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 12, deleting 9
I0606 10:35:44.584218 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 21, deleting 3
I0606 10:35:39.340453 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 24, deleting 5
I0606 10:35:32.284172 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 29, deleting 1
I0606 10:35:27.788659 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 30, deleting 1

What you expected to happen:

Cluster capacity should be maintained during a rolling update

How to reproduce it (as minimally and precisely as possible):

Simply triggering a machine deployment is all that caused this. We can grab more contextual logs for you to help out with a reproducer if needs be

Anything else we need to know?:

I'm not sure if this might be a CA bug rather than MCM as I'm led to believe that there have been/are a number of bugs whereby MCM & CA can step on each other's toes during a rolling update. We're currently unable to upgrade CA due to having to migrate away from AWSMachineClasses to MachineClasses. That migration in itself will cause us to run into this issue which is a little frustrating.

Environment:

Kubernetes version (use kubectl version): 1.22.17
Cloud provider or hardware configuration: AWS
Others: MCM-0.48.2, CA-0.18.0

The text was updated successfully, but these errors were encountered:

mattburgess · 2023-06-08T09:20:32Z

After some more investigation, it definitely looks like we're being hit by gardener/autoscaler#118 (and, as a consequence, gardener/autoscaler#181. In our particular scenario, during a MachineDeployment rollingUpdate, AWS is unable to provision instances due to capacity issues in eu-central-1a. After 10 minutes those are detected as unregistered by CA which then hits those bugs. Closing on this side, and we'll eagerly watch those CA bugs for updates.

himanshu-kun · 2023-07-03T03:06:11Z

History

There was an issue which happened due to CA-MCM not being able to correctly remove the unregistered machine in certain corner cases (a shortcoming of our CA-MCM interaction for targeted removal of machine) . If two machineSets are present for a machinedeployment (in case of rolling-update) , and CA reduces replicas of the machineDeployment to remove a particular machine, then MCM could scale down any machineSet. This was dangerous and removing nodes which had rolled to latest sometimes.

Steps taken

We tried to deal with this in the best way possible by making changes on two levels:

CA to NOT direct any kind of scale-down / removal of machine during a rolling update (No scale down for rolling update machineDeployment autoscaler#160)
MCM to remove only from old machineSets on scale-down , while scale up only new machineSet on scale-up (Scale-up only new machineSet while scale-down all active mSs proportionally #765)

in your case , you are using CA-MCM combination where MCM change is present but CA change is absent, so CA is scaling down in rolling update , and MCM is only removing from old-machineSet. So its not a re-occurence of gardener/autoscaler#118

We actively support latest 3 CA versions, so kindly update to them, and your problem should get resolved.

mattburgess added the kind/bug Bug label Jun 6, 2023

gardener-robot added area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related priority/2 Priority (lower number equals higher priority) labels Jun 6, 2023

mattburgess closed this as completed Jun 8, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 8, 2023

himanshu-kun mentioned this issue Jul 3, 2023

Consider failed machine as terminating gardener/autoscaler#118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale down of machineset during rolling update #826

Scale down of machineset during rolling update #826

mattburgess commented Jun 6, 2023

mattburgess commented Jun 8, 2023

himanshu-kun commented Jul 3, 2023

Scale down of machineset during rolling update #826

Scale down of machineset during rolling update #826

Comments

mattburgess commented Jun 6, 2023

mattburgess commented Jun 8, 2023

himanshu-kun commented Jul 3, 2023

History

Steps taken