Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait_until_pods_running should be able to rule out error pods that are not related to the final state of the owners #1611

Open
chizhg opened this issue Jan 6, 2020 · 8 comments · Fixed by #2440
Labels
bug Something isn't working lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@chizhg
Copy link
Member

chizhg commented Jan 6, 2020

In

function wait_until_pods_running() {
, wait_until_pods_running will only succeed if all pods in the given namespace are in Running or Completed state.

But since k8s has some retry logic, e.g. K8s Job can create a new pod if there is an error, one error pod does not necessarily mean the Job fails - https://prow.knative.dev/view/gcs/knative-prow/pr-logs/pull/knative_serving/6440/pull-knative-serving-integration-tests/1214219978266382337 is an example. In such scenario wait_until_pods_running will return an error that is not necessarily true.

This function should be general enough to consider and rule out error pods that are not related to the final state of the owners, e.g.

  1. For Deployments, all pods should be X/X Running
  2. For Jobs, it should depend on the success criteria
    ...

FYI @mattmoor

@adrcunha
Copy link
Contributor

adrcunha commented Jan 6, 2020

Please clarify what are the specific requirements for undoubtedly ruling a collection of pods as "running". If the answer is "the requirements depend on what we're waiting to be running", then this function should be removed from library.sh and the required variants implemented in each place it's used.

@adrcunha adrcunha added bug Something isn't working kind/good-first-issue Denotes an issue ready for a new contributor. labels Jan 6, 2020
@steuhs
Copy link
Contributor

steuhs commented Jan 7, 2020

Alternatively we can delete those failed ones after retry started, we don't need to make the change here

@chizhg
Copy link
Member Author

chizhg commented Mar 14, 2020

We are going to reimplement this function with Go and won't make any incremental changes to this function.

/remove-kind good-first-issue
/lifecycle frozen

@knative-prow-robot knative-prow-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed kind/good-first-issue Denotes an issue ready for a new contributor. labels Mar 14, 2020
@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. \nReopen the issue with /reopen.\nMark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2020
@chizhg
Copy link
Member Author

chizhg commented Jul 17, 2020

/reopen

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2020
@mattmoor
Copy link
Member

Seems to me like this entire function could just be:

kubectl wait pod --for=condition=Ready -n $1 -l '!job-name'

This doesn't properly check jobs, but what we have today is fairly hit or miss.

@chizhg
Copy link
Member Author

chizhg commented Sep 18, 2020

Thanks @mattmoor ! I have created #2440

@chizhg
Copy link
Member Author

chizhg commented Sep 20, 2020

#2440 will be partially reverted by #2443, so reopen this issue.

@chizhg chizhg reopened this Sep 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants