Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-25860: Disable managed interrupts for smartpqi #915

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MarSik
Copy link
Contributor

@MarSik MarSik commented Jan 12, 2024

This driver does not obey the isolcpus=managed_irq hint and is causing interference.

This kernel argument makes sure the interrupt affinity can be managed by userspace services.

The alternative approach using /etc/modprobe.d/smartpqi.conf with option smartpqi ... does not work, because the driver is loaded early at the initrd stage and we would have to rebuild the RHCOS initrd. That is much heavier than a simple kernel arg.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jan 12, 2024
@openshift-ci-robot
Copy link
Contributor

@MarSik: This pull request references Jira Issue OCPBUGS-25860, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This driver does not obey the isolcpus=managed_irq hint and is causing interference.

This kernel argument makes sure the interrupt affinity can be managed by userspace services.

The alternative approach using /etc/modprobe.d/smartpqi.conf with option smartpqi ... does not work, because the driver is loaded early at the initrd stage and we would have to rebuild the RHCOS initrd. That is much heavier than a simple kernel arg.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jan 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2024
@MarSik
Copy link
Contributor Author

MarSik commented Jan 12, 2024

@jmencak @yarda Could you please check my syntax for the udev regex?

@jmencak
Copy link
Contributor

jmencak commented Jan 12, 2024

@jmencak @yarda Could you please check my syntax for the udev regex?

Hmm, I'm getting the following:

2024-01-12 08:30:32,715 ERROR    tuned.units.manager: failed to initialize plugin bootloader_smartpqi
2024-01-12 08:30:32,716 ERROR    tuned.units.manager: No module named 'tuned.plugins.plugin_bootloader_smartpqi'
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/tuned/units/manager.py", line 88, in create
    plugin = self._plugins_repository.create(plugin_name)
  File "/usr/lib/python3.9/site-packages/tuned/plugins/repository.py", line 34, in create
    plugin_cls = self.load_plugin(plugin_name)
  File "/usr/lib/python3.9/site-packages/tuned/utils/plugin_loader.py", line 32, in load_plugin
    return self._get_class(module_name)
  File "/usr/lib/python3.9/site-packages/tuned/utils/plugin_loader.py", line 36, in _get_class
    module = __import__(module_name)
ModuleNotFoundError: No module named 'tuned.plugins.plugin_bootloader_smartpqi'
2024-01-12 08:30:32,718 INFO     tuned.daemon.daemon: static tuning from profile 'bz' applied

With:

[main]
summary=PoC

[bootloader_smartpqi]
devices_udev_regex==DRIVER=smartpqi
cmdline_smartpqi=+smartpqi.disable_managed_interrupts=1

Did you test this?

@MarSik
Copy link
Contributor Author

MarSik commented Jan 12, 2024

Not yet, I should have added WIP to the name..
I know what the issue is, will fix it soon.

@jmencak
Copy link
Contributor

jmencak commented Jan 12, 2024

This seems to somewhat work now. But I can also see the /etc/tuned/bootcmdline set to:

TUNED_BOOT_CMDLINE="smartpqi.disable_managed_interrupts=1"

even when the smartpqi module is not loaded. Is this intended? I suspect the devices_udev_regex==DRIVER=smartpqi doesn't take any effect. WDYT?

@MarSik
Copy link
Contributor Author

MarSik commented Jan 25, 2024

/hold

The udev regex matching does not work for the bootloader plugin.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2024
@yanirq
Copy link
Contributor

yanirq commented Feb 10, 2024

@MarSik lets rebase and check for the right udev rule

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 10, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2024
@yanirq
Copy link
Contributor

yanirq commented Mar 4, 2024

/hold
Depends on #980
to pass ci/prow/e2e-no-cluster tests

@yanirq
Copy link
Contributor

yanirq commented Mar 19, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2024
@MarSik
Copy link
Contributor Author

MarSik commented Apr 18, 2024

/retest

@yanirq
Copy link
Contributor

yanirq commented Apr 18, 2024

@MarSik you'll probably need to run render sync

This driver does not obey the isolcpus=managed_irq hint and
is causing interference.

This kernel argument makes sure the interrupt affinity
can be managed by userspace services.

The alternative approach using /etc/modprobe.d/smartpqi.conf
with option smartpqi ... does not work, because the driver
is loaded early at the initrd stage and we would have to
rebuild the RHCOS initrd. That is much heavier than a simple
kernel arg.
@MarSik
Copy link
Contributor Author

MarSik commented Jun 12, 2024

/retest-required

Copy link
Contributor

openshift-ci bot commented Jun 12, 2024

@MarSik: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 11, 2024
@MarSik
Copy link
Contributor Author

MarSik commented Oct 18, 2024

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants