Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down of machineset during rolling update #826

Closed
mattburgess opened this issue Jun 6, 2023 · 2 comments
Closed

Scale down of machineset during rolling update #826

mattburgess opened this issue Jun 6, 2023 · 2 comments
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related kind/bug Bug priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@mattburgess
Copy link

How to categorize this issue?

/area auto-scaling
/kind bug
/priority 2

What happened:

During a rolling update of our MachineDeployments we saw a machineset very rapidly scale down to 0. This is very similar to #802 but happened in a slightly more gradual manner:

I0606 10:36:39.331659 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 0, deleting 3
I0606 10:36:26.699210 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 3, deleting 1
I0606 10:36:20.485488 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 4, deleting 7
I0606 10:36:09.951122 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 11, deleting 1
I0606 10:36:03.741863 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 12, deleting 9
I0606 10:35:44.584218 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 21, deleting 3
I0606 10:35:39.340453 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 24, deleting 5
I0606 10:35:32.284172 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 29, deleting 1
I0606 10:35:27.788659 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 30, deleting 1

What you expected to happen:

Cluster capacity should be maintained during a rolling update

How to reproduce it (as minimally and precisely as possible):

Simply triggering a machine deployment is all that caused this. We can grab more contextual logs for you to help out with a reproducer if needs be

Anything else we need to know?:

I'm not sure if this might be a CA bug rather than MCM as I'm led to believe that there have been/are a number of bugs whereby MCM & CA can step on each other's toes during a rolling update. We're currently unable to upgrade CA due to having to migrate away from AWSMachineClasses to MachineClasses. That migration in itself will cause us to run into this issue which is a little frustrating.

Environment:

  • Kubernetes version (use kubectl version): 1.22.17
  • Cloud provider or hardware configuration: AWS
  • Others: MCM-0.48.2, CA-0.18.0
@mattburgess mattburgess added the kind/bug Bug label Jun 6, 2023
@gardener-robot gardener-robot added area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related priority/2 Priority (lower number equals higher priority) labels Jun 6, 2023
@mattburgess
Copy link
Author

After some more investigation, it definitely looks like we're being hit by gardener/autoscaler#118 (and, as a consequence, gardener/autoscaler#181. In our particular scenario, during a MachineDeployment rollingUpdate, AWS is unable to provision instances due to capacity issues in eu-central-1a. After 10 minutes those are detected as unregistered by CA which then hits those bugs. Closing on this side, and we'll eagerly watch those CA bugs for updates.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 8, 2023
@himanshu-kun
Copy link
Contributor

History

There was an issue which happened due to CA-MCM not being able to correctly remove the unregistered machine in certain corner cases (a shortcoming of our CA-MCM interaction for targeted removal of machine) . If two machineSets are present for a machinedeployment (in case of rolling-update) , and CA reduces replicas of the machineDeployment to remove a particular machine, then MCM could scale down any machineSet. This was dangerous and removing nodes which had rolled to latest sometimes.

Steps taken

We tried to deal with this in the best way possible by making changes on two levels:

in your case , you are using CA-MCM combination where MCM change is present but CA change is absent, so CA is scaling down in rolling update , and MCM is only removing from old-machineSet. So its not a re-occurence of gardener/autoscaler#118

We actively support latest 3 CA versions, so kindly update to them, and your problem should get resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related kind/bug Bug priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

3 participants