[PATCH] SCX: scx_ops_bypass() can't grab a mutex #198

htejun · 2024-05-04T16:07:56Z

This fixes a bug introduced by 34c6075 ("scx: Fix scx_ops_bypass_depth going out of sync"). It added mutex_lock(scx_ops_enable_mutex) to scx_ops_bypass(), which can obviously lead to problem if scx_ops_bypass() is invoked from the disable path while someone is holding the mutex and scheduled out.

[ 246.906433] INFO: task sched_ext_ops_h:1531 blocked for more than 122 seconds.
[ 246.907862] Not tainted 6.9.0-rc5-work-02067-g7f2a15fc2eb2-dirty #34
[ 246.909213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.910973] task:sched_ext_ops_h state:D stack:0 pid:1531 tgid:1531 ppid:2 flags:0x00004000
[ 246.913004] Sched_ext: qmap (enabled+all)
[ 246.913007] Call Trace:
[ 246.914486]
[ 246.914956] __schedule+0x750/0xcd0
[ 246.915770] schedule+0x41/0xb0
[ 246.916503] schedule_preempt_disabled+0x15/0x20
[ 246.917542] __mutex_lock+0x511/0xb00
[ 246.918379] ? scx_ops_bypass+0x54/0x1a0
[ 246.919232] mutex_lock_nested+0x1b/0x20
[ 246.920110] scx_ops_bypass+0x54/0x1a0
[ 246.920961] ? irqentry_exit+0x59/0x80
[ 246.921814] ? sysvec_irq_work+0x41/0x80
[ 246.922705] scx_ops_disable_workfn+0xd1/0x920
[ 246.923714] ? lock_acquire+0xcb/0x230
[ 246.924567] ? kthread_worker_fn+0x80/0x2a0
[ 246.925504] ? lock_release+0xd8/0x2d0
[ 246.926355] ? kthread_worker_fn+0xbe/0x2a0
[ 246.927262] kthread_worker_fn+0x10a/0x2a0
[ 246.928182] ? scx_dump_task+0x2f0/0x2f0
[ 246.929080] kthread+0xec/0x110
[ 246.929814] ? __kthread_init_worker+0x100/0x100
[ 246.930855] ? kthread_blkcg+0x40/0x40
[ 246.931718] ret_from_fork+0x36/0x40
[ 246.932545] ? kthread_blkcg+0x40/0x40
[ 246.933411] ret_from_fork_asm+0x11/0x20
[ 246.934311]
[ 246.934798]
[ 246.934798] Showing all locks held in the system:
[ 246.936169] 1 lock held by khungtaskd/60:
[ 246.937081] #0: ffffffff8335b128 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x30/0x190
[ 246.939051] 1 lock held by scx_qmap/1529:
[ 246.939963] 1 lock held by sched_ext_ops_h/1531:
[ 246.940999] #0: ffffffff83272e28 (scx_ops_enable_mutex){+.+.}-{3:3}, at: scx_ops_bypass+0x54/0x1a0

There's no reason to grab the mutex there. We can just test scx_enabled() without holding the mutex. Drop the broken locking.

This fixes a bug introduced by 34c6075 ("scx: Fix scx_ops_bypass_depth going out of sync"). It added mutex_lock(scx_ops_enable_mutex) to scx_ops_bypass(), which can obviously lead to problem if scx_ops_bypass() is invoked from the disable path while someone is holding the mutex and scheduled out. [ 246.906433] INFO: task sched_ext_ops_h:1531 blocked for more than 122 seconds. [ 246.907862] Not tainted 6.9.0-rc5-work-02067-g7f2a15fc2eb2-dirty #34 [ 246.909213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.910973] task:sched_ext_ops_h state:D stack:0 pid:1531 tgid:1531 ppid:2 flags:0x00004000 [ 246.913004] Sched_ext: qmap (enabled+all) [ 246.913007] Call Trace: [ 246.914486] <TASK> [ 246.914956] __schedule+0x750/0xcd0 [ 246.915770] schedule+0x41/0xb0 [ 246.916503] schedule_preempt_disabled+0x15/0x20 [ 246.917542] __mutex_lock+0x511/0xb00 [ 246.918379] ? scx_ops_bypass+0x54/0x1a0 [ 246.919232] mutex_lock_nested+0x1b/0x20 [ 246.920110] scx_ops_bypass+0x54/0x1a0 [ 246.920961] ? irqentry_exit+0x59/0x80 [ 246.921814] ? sysvec_irq_work+0x41/0x80 [ 246.922705] scx_ops_disable_workfn+0xd1/0x920 [ 246.923714] ? lock_acquire+0xcb/0x230 [ 246.924567] ? kthread_worker_fn+0x80/0x2a0 [ 246.925504] ? lock_release+0xd8/0x2d0 [ 246.926355] ? kthread_worker_fn+0xbe/0x2a0 [ 246.927262] kthread_worker_fn+0x10a/0x2a0 [ 246.928182] ? scx_dump_task+0x2f0/0x2f0 [ 246.929080] kthread+0xec/0x110 [ 246.929814] ? __kthread_init_worker+0x100/0x100 [ 246.930855] ? kthread_blkcg+0x40/0x40 [ 246.931718] ret_from_fork+0x36/0x40 [ 246.932545] ? kthread_blkcg+0x40/0x40 [ 246.933411] ret_from_fork_asm+0x11/0x20 [ 246.934311] </TASK> [ 246.934798] [ 246.934798] Showing all locks held in the system: [ 246.936169] 1 lock held by khungtaskd/60: [ 246.937081] #0: ffffffff8335b128 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x30/0x190 [ 246.939051] 1 lock held by scx_qmap/1529: [ 246.939963] 1 lock held by sched_ext_ops_h/1531: [ 246.940999] #0: ffffffff83272e28 (scx_ops_enable_mutex){+.+.}-{3:3}, at: scx_ops_bypass+0x54/0x1a0 There's no reason to grab the mutex there. We can just test scx_enabled() without holding the mutex. Drop the broken locking.

htejun assigned Byte-Lab May 4, 2024

htejun force-pushed the htejun/fix-stop-hang branch from 0dd603c to a95e0bf Compare May 4, 2024 16:08

Byte-Lab approved these changes May 5, 2024

View reviewed changes

Byte-Lab merged commit 1d02538 into sched_ext May 5, 2024
1 check passed

Byte-Lab deleted the htejun/fix-stop-hang branch May 5, 2024 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PATCH] SCX: scx_ops_bypass() can't grab a mutex #198

[PATCH] SCX: scx_ops_bypass() can't grab a mutex #198

htejun commented May 4, 2024 •

edited

Loading

[PATCH] SCX: scx_ops_bypass() can't grab a mutex #198

[PATCH] SCX: scx_ops_bypass() can't grab a mutex #198

Conversation

htejun commented May 4, 2024 • edited Loading

htejun commented May 4, 2024 •

edited

Loading