OCPBUGS-55967: Add hot loop detection in the boot image controller #5037

djoshy · 2025-05-08T20:06:55Z

- What I did
I added a simple hot loop detection counter in the boot image controller. If a machineset is updated more than 3 times to the same target boot image, the MSBIC will error and degrade the cluster. To fix this, one could:

Opt the cluster out of boot image updates. You can stop here if you do not want/care about boot image updates.
Fix the other actor in your cluster that is also reconciling the boot image. This will vary depending on the platform.
Opt the cluster back in for boot image updates.

- How to verify it

Bring up a cluster on GCP/AWS.
Update the boot image to a different value. On GCP, this would be the disk.image field in the providerSpec and on AWS, this would the AMI.ID field in the providerSpec. The MSBIC should immediately update the boot image back to the correct value.
Repeat this 3 more times. This should cause the MSBIC to error and degrade the operator.
Now, opt-out of boot image management - this should clear the degrade.

openshift-ci-robot · 2025-05-08T20:07:03Z

@djoshy: This pull request references Jira Issue OCPBUGS-55967, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
I added a simple hot loop detection counter in the boot image controller. If a machineset is updated more than 3 times to the same target boot image, the MSBIC will error and degrade the cluster. To fix this, one could:

Opt the cluster out of boot image updates. You can stop here if you do not want/care about boot image updates.

Fix the other actor in your cluster that is also reconciling the boot image. This will vary depending on the platform.

Opt the cluster back in for boot image updates.

- How to verify it

Bring up a cluster on GCP/AWS.

Update the boot image to a different value. On GCP, this would be the disk.image field in the providerSpec and on AWS, this would the AMI.ID field in the providerSpec. The MSBIC should immediately update the boot image back to the correct value.

Repeat this 3 more times. This should cause the MSBIC to error and degrade the operator.

Now, opt-out of boot image management - this should clear the degrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ptalgulk01 · 2025-05-13T07:13:29Z

Pre-merge verification:

Verified using IPI based AWS and GCP based 4.20 cluster.

1.Manually change the boot image for a MachineSet for more than 3 times:

On GCP: Change the disk.image in the providerSpec.
On AWS: Change the ami.id in the providerSpec.

$ oc edit machinesets.machine.openshift.io  -n openshift-machine-api ci-ln-wrsgd3k-76ef8-hsbfn-worker-us-east-2b
....
      providerSpec:
       .....
          disks:
        .....
            image: projects/rhcos-cloud/global/images/rhcos-abcd

At first few time the value is changed to original. Later able to see the edited value with machine-config operator been degraded.

$ oc get co machine-config 
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.20.0-0-2025-05-12-144624-test-ci-ln-dlgwm9b-latest   True        False         True       110m    Failed to resync 4.20.0-0-2025-05-12-144624-test-ci-ln-dlgwm9b-latest because: bootimage update failed: 1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ppt-12-20-h69rh-worker-a: refusing to reconcile machineset ppt-12-20-h69rh-worker-a, hot loop detected. Please opt-out of boot image updates, adjust your machine provisioning workflow to prevent hot loops and opt back in to resume boot image updates

$ oc  get machineconfigurations -o yaml
  - lastTransitionTime: "2025-05-13T06:58:52Z"
    message: '1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      | Error(s): error syncing MAPI MachineSet ppt-12-20-h69rh-worker-a: refusing
      to reconcile machineset ppt-12-20-h69rh-worker-a, hot loop detected. Please
      opt-out of boot image updates, adjust your machine provisioning workflow to
      prevent hot loops and opt back in to resume boot image updates'
    reason: MAPIMachinesetUpdated
    status: "True"
    type: BootImageUpdateDegraded

Opt-out of boot-mage by editing the machieconfiguration

....
  spec:
    logLevel: Normal
    managedBootImages:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: None
    managementState: Managed
    operatorLogLevel: Normal
  status:
....
    managedBootImagesStatus:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: None

/label qe-approved

openshift-ci-robot · 2025-05-13T07:13:37Z

@djoshy: This pull request references Jira Issue OCPBUGS-55967, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
I added a simple hot loop detection counter in the boot image controller. If a machineset is updated more than 3 times to the same target boot image, the MSBIC will error and degrade the cluster. To fix this, one could:

Opt the cluster out of boot image updates. You can stop here if you do not want/care about boot image updates.

Fix the other actor in your cluster that is also reconciling the boot image. This will vary depending on the platform.

Opt the cluster back in for boot image updates.

- How to verify it

Bring up a cluster on GCP/AWS.

Update the boot image to a different value. On GCP, this would be the disk.image field in the providerSpec and on AWS, this would the AMI.ID field in the providerSpec. The MSBIC should immediately update the boot image back to the correct value.

Repeat this 3 more times. This should cause the MSBIC to error and degrade the operator.

Now, opt-out of boot image management - this should clear the degrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

yuqi-zhang

/lgtm

Sufficient solution as a safety, will at least allow us some detection. Do you think it's worth adding metrics?

yuqi-zhang · 2025-05-13T20:52:58Z

pkg/controller/machine-set-boot-image/machine_set_boot_image_controller.go

+	} else {
+		hotLoopCount := 1
+		// If the controller is updating to a value that was previously updated to, increase the hot loop counter
+		if bytes.Equal(bis.value, machineSet.Spec.Template.Spec.ProviderSpec.Value.Raw) {


Hmm, there should be no valid scenario where this would be updating back right? Or e.g. a fake "update" from the client?

Addressed in slack; but basically we should only be calling the check here if there is a diff between the old/new bootimage, so we should be covered from empty syncs.

openshift-ci · 2025-05-13T20:55:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [djoshy,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

djoshy · 2025-05-13T23:00:57Z

/lgtm

Sufficient solution as a safety, will at least allow us some detection. Do you think it's worth adding metrics?

I think there might be value in that as a follow-up! I'll make a card for it.

openshift-ci · 2025-05-14T03:37:39Z

@djoshy: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-05-14T03:40:25Z

@djoshy: An error was encountered updating to the MODIFIED state for bug OCPBUGS-55967 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message.


No response returned: Post "https://issues.redhat.com/rest/api/2/issue/OCPBUGS-55967/transitions": POST https://issues.redhat.com/rest/api/2/issue/OCPBUGS-55967/transitions giving up after 5 attempt(s)

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

In response to this:

- What I did
I added a simple hot loop detection counter in the boot image controller. If a machineset is updated more than 3 times to the same target boot image, the MSBIC will error and degrade the cluster. To fix this, one could:

Opt the cluster out of boot image updates. You can stop here if you do not want/care about boot image updates.

Fix the other actor in your cluster that is also reconciling the boot image. This will vary depending on the platform.

Opt the cluster back in for boot image updates.

- How to verify it

Bring up a cluster on GCP/AWS.

Update the boot image to a different value. On GCP, this would be the disk.image field in the providerSpec and on AWS, this would the AMI.ID field in the providerSpec. The MSBIC should immediately update the boot image back to the correct value.

Repeat this 3 more times. This should cause the MSBIC to error and degrade the operator.

Now, opt-out of boot image management - this should clear the degrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2025-05-14T08:31:23Z

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.20.0-202505140744.p0.ga09f116.assembly.stream.el9.
All builds following this will include this PR.

djoshy · 2025-05-14T11:20:24Z

/cherry-pick release-4.19

openshift-cherrypick-robot · 2025-05-14T11:21:05Z

@djoshy: new pull request created: #5050

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

msbic: add hot loop detection

d7c647a

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 8, 2025

openshift-ci bot requested review from dkhater-redhat and RishabhSaini May 8, 2025 20:07

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2025

djoshy changed the title ~~OCPBUGS-55967: Add hot loop detection in the boot image controlelr~~ OCPBUGS-55967: Add hot loop detection in the boot image controller May 8, 2025

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 13, 2025

yuqi-zhang approved these changes May 13, 2025

View reviewed changes

openshift-ci bot assigned yuqi-zhang May 13, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2025

openshift-merge-bot bot merged commit a09f116 into openshift:main May 14, 2025
18 checks passed

openshift-cherrypick-robot mentioned this pull request May 14, 2025

[release-4.19] OCPBUGS-56180: Add hot loop detection in the boot image controller #5050

Merged

djoshy deleted the hot-loop-detect branch May 15, 2025 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-55967: Add hot loop detection in the boot image controller #5037

OCPBUGS-55967: Add hot loop detection in the boot image controller #5037

djoshy commented May 8, 2025

openshift-ci-robot commented May 8, 2025

ptalgulk01 commented May 13, 2025 •

edited

Loading

openshift-ci-robot commented May 13, 2025

yuqi-zhang left a comment

yuqi-zhang May 13, 2025

djoshy May 13, 2025

openshift-ci bot commented May 13, 2025

djoshy commented May 13, 2025

openshift-ci bot commented May 14, 2025

openshift-ci-robot commented May 14, 2025

openshift-bot commented May 14, 2025

djoshy commented May 14, 2025

openshift-cherrypick-robot commented May 14, 2025

OCPBUGS-55967: Add hot loop detection in the boot image controller #5037

OCPBUGS-55967: Add hot loop detection in the boot image controller #5037

Conversation

djoshy commented May 8, 2025

openshift-ci-robot commented May 8, 2025

ptalgulk01 commented May 13, 2025 • edited Loading

openshift-ci-robot commented May 13, 2025

yuqi-zhang left a comment

Choose a reason for hiding this comment

yuqi-zhang May 13, 2025

Choose a reason for hiding this comment

djoshy May 13, 2025

Choose a reason for hiding this comment

openshift-ci bot commented May 13, 2025

djoshy commented May 13, 2025

openshift-ci bot commented May 14, 2025

openshift-ci-robot commented May 14, 2025

openshift-bot commented May 14, 2025

djoshy commented May 14, 2025

openshift-cherrypick-robot commented May 14, 2025

ptalgulk01 commented May 13, 2025 •

edited

Loading