Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add pre-stop hook to unregister a compute node before its pod's termination #71

Merged
merged 2 commits into from
May 7, 2024

Conversation

arkbriar
Copy link
Collaborator

@arkbriar arkbriar commented Apr 5, 2024

In addition,

  • set the default terminationGracePeriodSeconds of compute pods to 10s to give the hook some time,
  • bump the RisingWave to v1.8.0.

@arkbriar
Copy link
Collaborator Author

arkbriar commented Apr 5, 2024

Tested locally.

@shanicky
Copy link

shanicky commented Apr 5, 2024

Unregistering workers will deregister the worker from meta. The corresponding worker will receive a notification that it has expired and then terminate itself. It is unclear whether early unregistering of the worker may cause the worker to quickly restart as a new one, rejoin, and then get killed by k8s 🤔

@arkbriar
Copy link
Collaborator Author

arkbriar commented Apr 5, 2024

It is unclear whether early unregistering of the worker may cause the worker to quickly restart as a new one, rejoin, and then get killed by k8s 🤔

Actually it's well defined in the doc. It should be triggered when and only when k8s thinks the container needs to be terminated. It means the case could exist when a Pod fails to pass probe and gets restarted, and will quickly rejoin the cluster. IMO, it should be fine since the cluster will never be stable in any circumstances whenever a pod's killed. And also a cluster should converge toward a stable point when the pods are all running.

This hook is called immediately before a container is terminated due to an API request or management event such as a liveness/startup probe failure, preemption, resource contention and others.

https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks

What I really concern about is the cost of recovery, as unregistration will definitely trigger one. There's no guarantee of how many times unregistration and recovery will happen before the it arrives the stable point.

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I believe this needs @shanicky's consent

@arkbriar arkbriar force-pushed the feat/pre-stop-unregister-hook branch from 6337cfb to 78a5f50 Compare May 7, 2024 05:41
@arkbriar
Copy link
Collaborator Author

arkbriar commented May 7, 2024

Added an option computeComponent.autoDeregistration.enabled to control the behavior.

Copy link

@shanicky shanicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arkbriar arkbriar added this pull request to the merge queue May 7, 2024
Merged via the queue into main with commit 9d74eb4 May 7, 2024
1 check passed
@arkbriar arkbriar deleted the feat/pre-stop-unregister-hook branch May 7, 2024 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants