-
Notifications
You must be signed in to change notification settings - Fork 130
Revive prowjob when node is terminated (enabled by default) #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Revive prowjob when node is terminated (enabled by default) #117
Conversation
✅ Deploy Preview for k8s-prow ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
db2fd80
to
cb50220
Compare
Hello @inteon
Please take a look! 😃 |
The code still seems to be used in prow-controller-manager: prow/cmd/prow-controller-manager/main.go Lines 202 to 204 in 0a35181
|
da96973
to
4cb9105
Compare
/label tide/merge-method-squash |
Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). |
This change seems like a nice feature but I am not sure if you are covering all of the generic cases of those pod terminations that are not only related to GCP. Also, this issue can be resolved by adding a pod disruption budget to your cluster. In my opinion, we shouldn't enable this feature by default. Instead, we can configure specific reasons for recreating the prowjobs, which will allow users to be more specific about their infrastructure. |
4cb9105
to
0bf2380
Compare
I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold. @petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input? |
0bf2380
to
e416d6f
Compare
I updated the PR and added a RetryCount which is now incremented every time the pod is re-created (it also counts other retries that were already present in code). There will be a hard failure after 3 retries have been reached (we might want to make this a variable in the future). |
e416d6f
to
6a85753
Compare
|
@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you. |
Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure. |
pkg/plank/reconciler.go
Outdated
// On GCP, before a new spot instance is started, the old pods are garbage | ||
// collected (if they have not been already by the Kubernetes PodGC): | ||
// https://github.com/kubernetes/cloud-provider-gcp/blob/25e5dcc715781316bc5e39f8b17c0d5b313453f7/cmd/gcp-controller-manager/node_csr_approver.go#L1035-L1058 | ||
if condition.Reason == "DeletionByGCPControllerManager" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is GCP related only, why not make the reasons configurable too, to allow the user to add more cases based on the infrastructure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the settings configurable: 648600a
I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution. |
I would prefer to let the user to decide the list of reasons to restart the prowjob and the retry count limit. For example, in my infrastructure, we run costly jobs, and this feature can potentially increase the cost since its rerunning them by default for those specific reasons. Your solution is good, but I would prefer to make it configurable so the user won't be limited on hardcoded termination reasons and retry limits. @stevekuznetsov WDYT? |
Getting the config semantic right might be hard but I'm for it. |
648600a
to
5cc1fc8
Compare
Yeah, I share some concerns about the restart implications. For example, with 5k node scale tests, we may prefer to simply take the failure and leave boskos to cleanup rather than attempt to start another run, and yet with the many many CI jobs we have it would be difficult to properly identify and opt out all of the relevant jobs. Also, even as a GCP employee, I think we should prefer to use portable Kubernetes, but I guess this is at least somewhat more portable now ... do any of the other major vendors with spot instances set a similar status that can be matched or do we need a different mechanism entirely? What's the use case where the next periodic attempt and/or |
I suspect for most jobs this is better, if bounded, but it's still a surprising behavior change and there's limited bandwidth to go comb through jobs and opt them in/out. For Kubernetes we could probably opt-out anything reference the boskos scalability pools and be in OK shape. I don't think we want another best practice field that everyone has to opt in though either .... (like What if we have a global option to enable this, in addition to per job opt-out? We could wait to turn this on in until we've opted out any jobs with cost concerns? I'm also not sure about having a single retrycount is the best approach, but I haven't put much thought into it yet. E.G. failure to schedule is pretty distinct from the other modes (Thought I think we already handle that separately?) |
5cc1fc8
to
0695225
Compare
0695225
to
9c6dd67
Compare
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
If we do merge this behavior default-enabled, please help us warn scalability / coordinate opting out sensitive Kubernetes-project CI. I kinda feel like the instance-wide default should be gated behind a config option. (if that has changed, please update the title, I'm attempting to keep up with the conversation but I'm not an active code reviewer here) |
9c6dd67
to
f5bda75
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: inteon The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I split this PR into 2 PRs: this PR and #412. In #412, I improve the existing revival logic (revival = what I call restarting in case the pod is in an error state) and add a limit. This PR is now blocked by #412. Let's first get that one merged. |
f5bda75
to
a7ac722
Compare
Signed-off-by: Tim Ramlot <[email protected]>
a7ac722
to
daa5678
Compare
With #412 merged, this PR is now unblocked and ready to be re-reviewed. |
daa5678
to
c87d991
Compare
Recreated pods when we detect they failed due to its node shutting down.
This saves us from having to rerun jobs that were terminated due to a spot instance shutdown.
Current behavior
Now, when a node is terminated, the prowjob fails (PJ gets in the FailureState).
New behavior
The prowjob is now deleted and recreated (revived) when the node is terminated.
If the pod is revived more than what is allowed through the MaxRevivals config value, the prowjob errors (PJ gets in the ErrorState).
Additionally, the ErrorOnTermination option can be used to error a prowjob directly when the pod is terminated (PJ gets in the ErrorState).
Detecting node termination
The pod status that we see on pods terminated due to a spot instance shutdown look like this:
or
or
The TerminationConditionReasons option allows users to modify what pod condition reason values are used to detect that the node was being terminated (defaults to "DeletionByPodGC" and "DeletionByGCPControllerManager").