-
Notifications
You must be signed in to change notification settings - Fork 426
OCPBUGS-55967: Add hot loop detection in the boot image controller #5037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@djoshy: This pull request references Jira Issue OCPBUGS-55967, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Pre-merge verification: Verified using IPI based AWS and GCP based 4.20 cluster. 1.Manually change the boot image for a MachineSet for more than 3 times:
/label qe-approved |
@djoshy: This pull request references Jira Issue OCPBUGS-55967, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Sufficient solution as a safety, will at least allow us some detection. Do you think it's worth adding metrics?
} else { | ||
hotLoopCount := 1 | ||
// If the controller is updating to a value that was previously updated to, increase the hot loop counter | ||
if bytes.Equal(bis.value, machineSet.Spec.Template.Spec.ProviderSpec.Value.Raw) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there should be no valid scenario where this would be updating back right? Or e.g. a fake "update" from the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in slack; but basically we should only be calling the check here if there is a diff between the old/new bootimage, so we should be covered from empty syncs.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think there might be value in that as a follow-up! I'll make a card for it. |
@djoshy: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@djoshy: An error was encountered updating to the MODIFIED state for bug OCPBUGS-55967 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details. Full error message.
No response returned: Post "https://issues.redhat.com/rest/api/2/issue/OCPBUGS-55967/transitions": POST https://issues.redhat.com/rest/api/2/issue/OCPBUGS-55967/transitions giving up after 5 attempt(s)
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] Distgit: ose-machine-config-operator |
/cherry-pick release-4.19 |
@djoshy: new pull request created: #5050 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
- What I did
I added a simple hot loop detection counter in the boot image controller. If a machineset is updated more than 3 times to the same target boot image, the MSBIC will error and degrade the cluster. To fix this, one could:
- How to verify it
disk.image
field in the providerSpec and on AWS, this would theAMI.ID
field in the providerSpec. The MSBIC should immediately update the boot image back to the correct value.