Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider failed machine as terminating #125

Conversation

himanshu-kun
Copy link

What this PR does / why we need it:
CA will now consider Failed machine as Terminating to avoid race with MCM in case of machine not joining in 20min.
Which issue(s) this PR fixes:
Fixes #118

Special notes for your reviewer:

Release note:

Race b/w MCM and Cluster Autoscaler in case of node joining timeout is fixed

/invite @unmarshall

@himanshu-kun himanshu-kun requested review from hardikdr and a team as code owners June 14, 2022 08:40
@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Jun 14, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 14, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 14, 2022
@gardener-robot
Copy link

@hardikdr, @unmarshall You have pull request review open invite, please check

@himanshu-kun
Copy link
Author

himanshu-kun commented Jun 19, 2022

Logs like this would be shown for a node whose removal is skipped by CA to avoid race (machineDeployment replicas before is 2)

I0619 16:06:43.553497   21618 static_autoscaler.go:343] Some unregistered nodes were removed, skipping iteration
I0619 16:06:53.900302   21618 static_autoscaler.go:231] Starting main loop
I0619 16:06:53.902507   21618 clusterstate.go:602] Found longUnregistered Nodes [requested://shoot--i544024--ca-test-worker-1-z1-dff8d-tdcll]
I0619 16:06:53.902537   21618 static_autoscaler.go:335] 1 unregistered nodes present
I0619 16:06:53.902543   21618 static_autoscaler.go:611] Removing unregistered node requested://shoot--i544024--ca-test-worker-1-z1-dff8d-tdcll
I0619 16:06:53.902581   21618 mcm_manager.go:875] Machine "shoot--i544024--ca-test-worker-1-z1-dff8d-tdcll" is already being terminated, and hence skipping the deletion
I0619 16:06:53.902587   21618 mcm_manager.go:481] Machine "shoot--i544024--ca-test-worker-1-z1-dff8d-tdcll" priority is already set to 1, hence skipping the update
I0619 16:06:53.902600   21618 mcm_manager.go:525] MachineDeployment "shoot--i544024--ca-test-worker-1-z1" is already set to 2, skipping the update
I0619 16:06:53.902608   21618 mcm_manager.go:546] MachineDeployment shoot--i544024--ca-test-worker-1-z1 size decreased to 2
I0619 16:06:53.902629   21618 static_autoscaler.go:343] Some unregistered nodes were removed, skipping iteration

Earlier it used to be like this

I0619 16:19:56.158200   22928 static_autoscaler.go:231] Starting main loop
I0619 16:19:56.160045   22928 clusterstate.go:602] Found longUnregistered Nodes [requested://shoot--i544024--ca-test-worker-1-z1-dff8d-tdcll]
I0619 16:19:56.160208   22928 static_autoscaler.go:335] 1 unregistered nodes present
I0619 16:19:56.160225   22928 static_autoscaler.go:611] Removing unregistered node requested://shoot--i544024--ca-test-worker-1-z1-dff8d-tdcll
I0619 16:19:56.160306   22928 mcm_manager.go:481] Machine "shoot--i544024--ca-test-worker-1-z1-dff8d-tdcll" priority is already set to 1, hence skipping the update
I0619 16:19:56.343652   22928 mcm_manager.go:546] MachineDeployment shoot--i544024--ca-test-worker-1-z1 size decreased to 1
I0619 16:19:56.343686   22928 static_autoscaler.go:343] Some unregistered nodes were removed, skipping iteration

So the machineDeployment replicas is not reduced now.

@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 19, 2022
@himanshu-kun
Copy link
Author

/close as this is more of a bandaid fix which doesn't address the actual problem.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 22, 2022
@himanshu-kun himanshu-kun deleted the deal-with-machine-failure branch July 12, 2022 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider failed machine as terminating
4 participants