Revive prowjob when node is terminated (enabled by default) #117

inteon · 2024-04-18T13:31:51Z

Recreated pods when we detect they failed due to its node shutting down.
This saves us from having to rerun jobs that were terminated due to a spot instance shutdown.

Current behavior

Now, when a node is terminated, the prowjob fails (PJ gets in the FailureState).

New behavior

The prowjob is now deleted and recreated (revived) when the node is terminated.

If the pod is revived more than what is allowed through the MaxRevivals config value, the prowjob errors (PJ gets in the ErrorState).

Additionally, the ErrorOnTermination option can be used to error a prowjob directly when the pod is terminated (PJ gets in the ErrorState).

Detecting node termination

The pod status that we see on pods terminated due to a spot instance shutdown look like this:

status:
  message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  reason: Terminated

or

status:
  conditions:
  - message: 'PodGC: node no longer exists'
    reason: DeletionByPodGC
    status: "True"
    type: DisruptionTarget
  phase: Failed

or

status:
  conditions:
  - message: 'GCPControllerManager: node no longer exists'
    reason: DeletionByGCPControllerManager
    status: "True"
    type: DisruptionTarget
  phase: Failed

The TerminationConditionReasons option allows users to modify what pod condition reason values are used to detect that the node was being terminated (defaults to "DeletionByPodGC" and "DeletionByGCPControllerManager").

netlify · 2024-04-18T13:32:07Z

✅ Deploy Preview for k8s-prow ready!

Name	Link
🔨 Latest commit	`c87d991`
🔍 Latest deploy log	https://app.netlify.com/sites/k8s-prow/deploys/680751fc8aae880008a9e092
😎 Deploy Preview	https://deploy-preview-117--k8s-prow.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jihoon-seo · 2024-04-18T14:19:49Z

Hello @inteon
AFAIK Plank was deprecated.
Since this PR is adding codes under 'pkg/plank/',
I wonder whether:

Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

inteon · 2024-04-18T14:33:51Z

Hello @inteon AFAIK Plank was deprecated. Since this PR is adding codes under 'pkg/plank/', I wonder whether:

Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or

There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

The code still seems to be used in prow-controller-manager:

prow/cmd/prow-controller-manager/main.go

Lines 202 to 204 in 0a35181

    
           if err := plank.Add(mgr, buildClusterManagers, knownClusters, cfg, opener, o.totURL, o.selector); err != nil { 
        
           	logrus.WithError(err).Fatal("Failed to add plank to manager") 
        
           }

jihoon-seo · 2024-04-22T01:06:28Z

/label tide/merge-method-squash

inteon · 2024-05-07T09:26:58Z

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot).
It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

droslean · 2024-05-07T09:48:37Z

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

This change seems like a nice feature but I am not sure if you are covering all of the generic cases of those pod terminations that are not only related to GCP. Also, this issue can be resolved by adding a pod disruption budget to your cluster.

In my opinion, we shouldn't enable this feature by default. Instead, we can configure specific reasons for recreating the prowjobs, which will allow users to be more specific about their infrastructure.

droslean · 2024-06-14T13:55:37Z

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

inteon · 2024-06-14T15:33:10Z

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

I updated the PR and added a RetryCount which is now incremented every time the pod is re-created (it also counts other retries that were already present in code). There will be a hard failure after 3 retries have been reached (we might want to make this a variable in the future).

inteon · 2024-07-01T13:25:45Z

~~Blocked by #196 since CRD is too large for my setup.~~

stevekuznetsov · 2024-07-01T14:38:35Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

droslean · 2024-07-01T14:45:17Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

droslean · 2024-07-01T14:47:55Z

pkg/plank/reconciler.go

+		// On GCP, before a new spot instance is started, the old pods are garbage
+		// collected (if they have not been already by the Kubernetes PodGC):
+		// https://github.com/kubernetes/cloud-provider-gcp/blob/25e5dcc715781316bc5e39f8b17c0d5b313453f7/cmd/gcp-controller-manager/node_csr_approver.go#L1035-L1058
+		if condition.Reason == "DeletionByGCPControllerManager" {


Since this is GCP related only, why not make the reasons configurable too, to allow the user to add more cases based on the infrastructure?

I made the settings configurable: 648600a

inteon · 2024-07-01T14:49:39Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

droslean · 2024-07-01T15:13:15Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

I would prefer to let the user to decide the list of reasons to restart the prowjob and the retry count limit. For example, in my infrastructure, we run costly jobs, and this feature can potentially increase the cost since its rerunning them by default for those specific reasons. Your solution is good, but I would prefer to make it configurable so the user won't be limited on hardcoded termination reasons and retry limits. @stevekuznetsov WDYT?

stevekuznetsov · 2024-07-01T16:12:54Z

Getting the config semantic right might be hard but I'm for it.

BenTheElder · 2024-07-01T21:04:35Z

Yeah, I share some concerns about the restart implications.

For example, with 5k node scale tests, we may prefer to simply take the failure and leave boskos to cleanup rather than attempt to start another run, and yet with the many many CI jobs we have it would be difficult to properly identify and opt out all of the relevant jobs.

cc @aojea @pohly

Also, even as a GCP employee, I think we should prefer to use portable Kubernetes, but I guess this is at least somewhat more portable now ... do any of the other major vendors with spot instances set a similar status that can be matched or do we need a different mechanism entirely?

What's the use case where the next periodic attempt and/or /retest is not sufficient?
Are you using the automatic /retest commenter? It's clunky but it has done the job ~OK.

BenTheElder · 2024-07-01T21:16:55Z

I suspect for most jobs this is better, if bounded, but it's still a surprising behavior change and there's limited bandwidth to go comb through jobs and opt them in/out.

For Kubernetes we could probably opt-out anything reference the boskos scalability pools and be in OK shape.

I don't think we want another best practice field that everyone has to opt in though either .... (like decorate: true which should ideally not be redundant eventually)

What if we have a global option to enable this, in addition to per job opt-out? We could wait to turn this on in until we've opted out any jobs with cost concerns?

I'm also not sure about having a single retrycount is the best approach, but I haven't put much thought into it yet. E.G. failure to schedule is pretty distinct from the other modes (Thought I think we already handle that separately?)

k8s-triage-robot · 2024-11-21T08:55:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-12-21T09:45:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

inteon · 2024-12-21T11:04:23Z

/remove-lifecycle rotten

k8s-triage-robot · 2025-03-21T11:20:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

inteon · 2025-03-27T09:55:34Z

/remove-lifecycle stale

BenTheElder · 2025-03-27T21:06:24Z

If we do merge this behavior default-enabled, please help us warn scalability / coordinate opting out sensitive Kubernetes-project CI.

I kinda feel like the instance-wide default should be gated behind a config option.

(if that has changed, please update the title, I'm attempting to keep up with the conversation but I'm not an active code reviewer here)

k8s-ci-robot · 2025-03-28T10:19:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: inteon
Once this PR has been reviewed and has the lgtm label, please assign krzyzacy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

inteon · 2025-03-28T10:24:00Z

If we do merge this behavior default-enabled, please help us warn scalability / coordinate opting out sensitive Kubernetes-project CI.

I kinda feel like the instance-wide default should be gated behind a config option.

(if that has changed, please update the title, I'm attempting to keep up with the conversation but I'm not an active code reviewer here)

I split this PR into 2 PRs: this PR and #412.

In #412, I improve the existing revival logic (revival = what I call restarting in case the pod is in an error state) and add a limit.
In this PR, I add support for reviving in case of a terminated node.

This PR is now blocked by #412. Let's first get that one merged.

Signed-off-by: Tim Ramlot <[email protected]>

inteon · 2025-04-22T08:18:46Z

With #412 merged, this PR is now unblocked and ready to be re-reviewed.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 18, 2024

k8s-ci-robot requested review from cjwagner and droslean April 18, 2024 13:31

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 18, 2024

inteon force-pushed the option_recreate_prowjob_on_termination branch from db2fd80 to cb50220 Compare April 18, 2024 13:45

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 18, 2024

inteon force-pushed the option_recreate_prowjob_on_termination branch 2 times, most recently from da96973 to 4cb9105 Compare April 20, 2024 11:58

inteon mentioned this pull request Apr 20, 2024

Add custom version of prow-controller-manager with better support for spot instances cert-manager/testing#1010

Merged

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 22, 2024

inteon force-pushed the option_recreate_prowjob_on_termination branch from 4cb9105 to 0bf2380 Compare June 13, 2024 15:10

inteon force-pushed the option_recreate_prowjob_on_termination branch from 0bf2380 to e416d6f Compare June 14, 2024 15:30

inteon force-pushed the option_recreate_prowjob_on_termination branch from e416d6f to 6a85753 Compare June 17, 2024 11:53

droslean reviewed Jul 1, 2024

View reviewed changes

inteon force-pushed the option_recreate_prowjob_on_termination branch from 648600a to 5cc1fc8 Compare July 1, 2024 16:15

inteon force-pushed the option_recreate_prowjob_on_termination branch from 5cc1fc8 to 0695225 Compare July 9, 2024 19:36

inteon force-pushed the option_recreate_prowjob_on_termination branch from 0695225 to 9c6dd67 Compare August 23, 2024 08:21

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 21, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 27, 2025

inteon changed the title ~~Add option to restart prowjob when node is terminated (enabled by default)~~ Revive prowjob when node is terminated (enabled by default) Mar 28, 2025

inteon force-pushed the option_recreate_prowjob_on_termination branch from 9c6dd67 to f5bda75 Compare March 28, 2025 10:19

inteon force-pushed the option_recreate_prowjob_on_termination branch from f5bda75 to a7ac722 Compare March 28, 2025 10:46

Revive prowjob in case of a node being terminated.

c87d991

Signed-off-by: Tim Ramlot <[email protected]>

inteon force-pushed the option_recreate_prowjob_on_termination branch from a7ac722 to daa5678 Compare April 22, 2025 08:07

inteon force-pushed the option_recreate_prowjob_on_termination branch from daa5678 to c87d991 Compare April 22, 2025 08:23

Revive prowjob when node is terminated (enabled by default) #117

Are you sure you want to change the base?

Revive prowjob when node is terminated (enabled by default) #117

Uh oh!

Conversation

inteon commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current behavior

New behavior

Detecting node termination

Uh oh!

netlify bot commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for k8s-prow ready!

Uh oh!

jihoon-seo commented Apr 18, 2024

Uh oh!

inteon commented Apr 18, 2024

Uh oh!

jihoon-seo commented Apr 22, 2024

Uh oh!

inteon commented May 7, 2024

Uh oh!

droslean commented May 7, 2024

Uh oh!

droslean commented Jun 14, 2024

Uh oh!

inteon commented Jun 14, 2024

Uh oh!

inteon commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevekuznetsov commented Jul 1, 2024

Uh oh!

droslean commented Jul 1, 2024

Uh oh!

droslean Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

inteon Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

inteon commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

droslean commented Jul 1, 2024

Uh oh!

stevekuznetsov commented Jul 1, 2024

Uh oh!

BenTheElder commented Jul 1, 2024

Uh oh!

BenTheElder commented Jul 1, 2024

Uh oh!

k8s-triage-robot commented Nov 21, 2024

Uh oh!

k8s-triage-robot commented Dec 21, 2024

Uh oh!

inteon commented Dec 21, 2024

Uh oh!

k8s-triage-robot commented Mar 21, 2025

Uh oh!

inteon commented Mar 27, 2025

Uh oh!

BenTheElder commented Mar 27, 2025

Uh oh!

k8s-ci-robot commented Mar 28, 2025

Uh oh!

inteon commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inteon commented Apr 22, 2025

Uh oh!

Uh oh!

inteon commented Apr 18, 2024 •

edited

Loading

netlify bot commented Apr 18, 2024 •

edited

Loading

inteon commented Jul 1, 2024 •

edited

Loading

inteon commented Jul 1, 2024 •

edited

Loading

inteon commented Mar 28, 2025 •

edited

Loading