Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

[PATCH] SCX: scx_ops_bypass() can't grab a mutex #198

Merged
merged 1 commit into from
May 5, 2024

Conversation

htejun
Copy link
Collaborator

@htejun htejun commented May 4, 2024

This fixes a bug introduced by 34c6075 ("scx: Fix scx_ops_bypass_depth going out of sync"). It added mutex_lock(scx_ops_enable_mutex) to scx_ops_bypass(), which can obviously lead to problem if scx_ops_bypass() is invoked from the disable path while someone is holding the mutex and scheduled out.

[ 246.906433] INFO: task sched_ext_ops_h:1531 blocked for more than 122 seconds.
[ 246.907862] Not tainted 6.9.0-rc5-work-02067-g7f2a15fc2eb2-dirty #34
[ 246.909213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.910973] task:sched_ext_ops_h state:D stack:0 pid:1531 tgid:1531 ppid:2 flags:0x00004000
[ 246.913004] Sched_ext: qmap (enabled+all)
[ 246.913007] Call Trace:
[ 246.914486]
[ 246.914956] __schedule+0x750/0xcd0
[ 246.915770] schedule+0x41/0xb0
[ 246.916503] schedule_preempt_disabled+0x15/0x20
[ 246.917542] __mutex_lock+0x511/0xb00
[ 246.918379] ? scx_ops_bypass+0x54/0x1a0
[ 246.919232] mutex_lock_nested+0x1b/0x20
[ 246.920110] scx_ops_bypass+0x54/0x1a0
[ 246.920961] ? irqentry_exit+0x59/0x80
[ 246.921814] ? sysvec_irq_work+0x41/0x80
[ 246.922705] scx_ops_disable_workfn+0xd1/0x920
[ 246.923714] ? lock_acquire+0xcb/0x230
[ 246.924567] ? kthread_worker_fn+0x80/0x2a0
[ 246.925504] ? lock_release+0xd8/0x2d0
[ 246.926355] ? kthread_worker_fn+0xbe/0x2a0
[ 246.927262] kthread_worker_fn+0x10a/0x2a0
[ 246.928182] ? scx_dump_task+0x2f0/0x2f0
[ 246.929080] kthread+0xec/0x110
[ 246.929814] ? __kthread_init_worker+0x100/0x100
[ 246.930855] ? kthread_blkcg+0x40/0x40
[ 246.931718] ret_from_fork+0x36/0x40
[ 246.932545] ? kthread_blkcg+0x40/0x40
[ 246.933411] ret_from_fork_asm+0x11/0x20
[ 246.934311]
[ 246.934798]
[ 246.934798] Showing all locks held in the system:
[ 246.936169] 1 lock held by khungtaskd/60:
[ 246.937081] #0: ffffffff8335b128 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x30/0x190
[ 246.939051] 1 lock held by scx_qmap/1529:
[ 246.939963] 1 lock held by sched_ext_ops_h/1531:
[ 246.940999] #0: ffffffff83272e28 (scx_ops_enable_mutex){+.+.}-{3:3}, at: scx_ops_bypass+0x54/0x1a0

There's no reason to grab the mutex there. We can just test scx_enabled() without holding the mutex. Drop the broken locking.

This fixes a bug introduced by 34c6075 ("scx: Fix scx_ops_bypass_depth
going out of sync"). It added mutex_lock(scx_ops_enable_mutex) to
scx_ops_bypass(), which can obviously lead to problem if scx_ops_bypass() is
invoked from the disable path while someone is holding the mutex and
scheduled out.

  [  246.906433] INFO: task sched_ext_ops_h:1531 blocked for more than 122 seconds.
  [  246.907862]       Not tainted 6.9.0-rc5-work-02067-g7f2a15fc2eb2-dirty #34
  [  246.909213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  246.910973] task:sched_ext_ops_h state:D stack:0     pid:1531  tgid:1531  ppid:2      flags:0x00004000
  [  246.913004] Sched_ext: qmap (enabled+all)
  [  246.913007] Call Trace:
  [  246.914486]  <TASK>
  [  246.914956]  __schedule+0x750/0xcd0
  [  246.915770]  schedule+0x41/0xb0
  [  246.916503]  schedule_preempt_disabled+0x15/0x20
  [  246.917542]  __mutex_lock+0x511/0xb00
  [  246.918379]  ? scx_ops_bypass+0x54/0x1a0
  [  246.919232]  mutex_lock_nested+0x1b/0x20
  [  246.920110]  scx_ops_bypass+0x54/0x1a0
  [  246.920961]  ? irqentry_exit+0x59/0x80
  [  246.921814]  ? sysvec_irq_work+0x41/0x80
  [  246.922705]  scx_ops_disable_workfn+0xd1/0x920
  [  246.923714]  ? lock_acquire+0xcb/0x230
  [  246.924567]  ? kthread_worker_fn+0x80/0x2a0
  [  246.925504]  ? lock_release+0xd8/0x2d0
  [  246.926355]  ? kthread_worker_fn+0xbe/0x2a0
  [  246.927262]  kthread_worker_fn+0x10a/0x2a0
  [  246.928182]  ? scx_dump_task+0x2f0/0x2f0
  [  246.929080]  kthread+0xec/0x110
  [  246.929814]  ? __kthread_init_worker+0x100/0x100
  [  246.930855]  ? kthread_blkcg+0x40/0x40
  [  246.931718]  ret_from_fork+0x36/0x40
  [  246.932545]  ? kthread_blkcg+0x40/0x40
  [  246.933411]  ret_from_fork_asm+0x11/0x20
  [  246.934311]  </TASK>
  [  246.934798]
  [  246.934798] Showing all locks held in the system:
  [  246.936169] 1 lock held by khungtaskd/60:
  [  246.937081]  #0: ffffffff8335b128 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x30/0x190
  [  246.939051] 1 lock held by scx_qmap/1529:
  [  246.939963] 1 lock held by sched_ext_ops_h/1531:
  [  246.940999]  #0: ffffffff83272e28 (scx_ops_enable_mutex){+.+.}-{3:3}, at: scx_ops_bypass+0x54/0x1a0

There's no reason to grab the mutex there. We can just test scx_enabled()
without holding the mutex. Drop the broken locking.
@Byte-Lab Byte-Lab merged commit 1d02538 into sched_ext May 5, 2024
1 check passed
@Byte-Lab Byte-Lab deleted the htejun/fix-stop-hang branch May 5, 2024 02:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants