Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: rcu: INFO: rcu_preempt self-detected stall on CPU in cleanup_net tasklet #20368

Open
vivekrnv opened this issue Sep 28, 2024 · 1 comment · May be fixed by sonic-net/sonic-linux-kernel#437
Assignees
Labels
Issue for 202405 NVIDIA Triaged this issue has been triaged

Comments

@vivekrnv
Copy link
Contributor

vivekrnv commented Sep 28, 2024

Description

We have an internal test which triggers a health event and checks for techsupport. Triggering a health event causes dockers to restart.

Very rarely (Only twice until now) during docker stop, we observed a RCU stall on the CPU. Looks like some user space process is removing a network namespace which is triggering the cleanup_net tasklet, which gets stuck in the devlinks_xa_find_get function.

This is also reported in the syzkaller multiple times in the linux 6.1.y branch during syzkaller fuzzy testing. https://syzkaller.appspot.com/bug?extid=e361cbd0ce11d6558816. Crash reported there is also on devlinks_xa_find_get during cleanup_net tasklet

2024 Aug 28 00:39:46.066192 r-anaconda-15 WARNING kernel: [  572.183244] 	(t=5297 jiffies g=165593 q=95908 ncpus=4)
2024 Aug 28 00:39:46.066194 r-anaconda-15 WARNING kernel: [  572.183248] CPU: 1 PID: 36 Comm: kworker/u8:2 Tainted: G           O       6.1.0-11-2-amd64 #1  Debian 6.1.38-4
2024 Aug 28 00:39:46.066195 r-anaconda-15 WARNING kernel: [  572.183252] Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0005, BIOS 5.11 07/12/2021
2024 Aug 28 00:39:46.066196 r-anaconda-15 WARNING kernel: [  572.183255] Workqueue: netns cleanup_net
2024 Aug 28 00:39:46.066197 r-anaconda-15 WARNING kernel: [  572.183263] RIP: 0010:xas_find_marked+0x7a/0x300
2024 Aug 28 00:39:46.066199 r-anaconda-15 WARNING kernel: [  572.183270] Code: c5 c0 ff ff ff 48 c7 c3 ff ff ff ff 4c 8d 14 cd 28 02 00 00 0f b6 70 12 48 8b 78 18 40 80 fe 40 0f 84 86 01 00 00 4e 8d 2c 17 <45> 84 c9 0f 85 59 01 00 00 89 f1 48 83 c1 04 4c 8b 64 cf 08 89 f1
2024 Aug 28 00:39:46.066213 r-anaconda-15 WARNING kernel: [  572.183272] RSP: 0018:ffffafc580187d40 EFLAGS: 00000283
2024 Aug 28 00:39:46.066214 r-anaconda-15 WARNING kernel: [  572.183276] RAX: ffffafc580187d70 RBX: ffffffffffffffff RCX: 0000000000000001
2024 Aug 28 00:39:46.066215 r-anaconda-15 WARNING kernel: [  572.183278] RDX: 0000000000000004 RSI: 0000000000000004 RDI: ffff8ab34e681da8
2024 Aug 28 00:39:46.066216 r-anaconda-15 WARNING kernel: [  572.183280] RBP: ffffffffffffffc0 R08: ffffffffffffffff R09: 0000000000000000
2024 Aug 28 00:39:46.066218 r-anaconda-15 WARNING kernel: [  572.183282] R10: 0000000000000230 R11: 0000000000000001 R12: ffffffffffffffff
2024 Aug 28 00:39:46.066219 r-anaconda-15 WARNING kernel: [  572.183284] R13: ffff8ab34e681fd8 R14: ffffafc580187e48 R15: ffff8ab460d30060
2024 Aug 28 00:39:46.066220 r-anaconda-15 WARNING kernel: [  572.183286] FS:  0000000000000000(0000) GS:ffff8ab5b7c80000(0000) knlGS:0000000000000000
2024 Aug 28 00:39:46.066221 r-anaconda-15 WARNING kernel: [  572.183289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024 Aug 28 00:39:46.066230 r-anaconda-15 WARNING kernel: [  572.183291] CR2: 00007fa4f8010c78 CR3: 00000001213c2006 CR4: 00000000003706e0
2024 Aug 28 00:39:46.066231 r-anaconda-15 WARNING kernel: [  572.183294] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024 Aug 28 00:39:46.066232 r-anaconda-15 WARNING kernel: [  572.183296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2024 Aug 28 00:39:46.066233 r-anaconda-15 WARNING kernel: [  572.183298] Call Trace:
2024 Aug 28 00:39:46.066234 r-anaconda-15 WARNING kernel: [  572.183302]  <IRQ>
2024 Aug 28 00:39:46.066235 r-anaconda-15 WARNING kernel: [  572.183304]  ? rcu_dump_cpu_stacks+0xc8/0x100
2024 Aug 28 00:39:46.066236 r-anaconda-15 WARNING kernel: [  572.183309]  ? rcu_sched_clock_irq.cold+0x69/0x2fb
2024 Aug 28 00:39:46.066239 r-anaconda-15 WARNING kernel: [  572.183313]  ? sched_slice+0x87/0x140
2024 Aug 28 00:39:46.066240 r-anaconda-15 WARNING kernel: [  572.183319]  ? perf_event_task_tick+0x64/0x370
2024 Aug 28 00:39:46.066241 r-anaconda-15 WARNING kernel: [  572.183325]  ? nohz_balance_exit_idle+0x16/0xc0
2024 Aug 28 00:39:46.066242 r-anaconda-15 WARNING kernel: [  572.183327]  ? account_process_tick+0xd2/0x170
2024 Aug 28 00:39:46.066243 r-anaconda-15 WARNING kernel: [  572.183331]  ? update_process_times+0x77/0xb0
2024 Aug 28 00:39:46.066244 r-anaconda-15 WARNING kernel: [  572.183335]  ? tick_sched_handle+0x22/0x60
2024 Aug 28 00:39:46.066253 r-anaconda-15 WARNING kernel: [  572.183338]  ? tick_sched_timer+0x6f/0x80
2024 Aug 28 00:39:46.066254 r-anaconda-15 WARNING kernel: [  572.183340]  ? tick_sched_do_timer+0xa0/0xa0
2024 Aug 28 00:39:46.066255 r-anaconda-15 WARNING kernel: [  572.183342]  ? __hrtimer_run_queues+0x112/0x2b0
2024 Aug 28 00:39:46.066256 r-anaconda-15 WARNING kernel: [  572.183345]  ? hrtimer_interrupt+0xfe/0x220
2024 Aug 28 00:39:46.066257 r-anaconda-15 WARNING kernel: [  572.183349]  ? __sysvec_apic_timer_interrupt+0x7f/0x170
2024 Aug 28 00:39:46.066257 r-anaconda-15 WARNING kernel: [  572.183354]  ? sysvec_apic_timer_interrupt+0x99/0xc0
2024 Aug 28 00:39:46.066261 r-anaconda-15 WARNING kernel: [  572.183359]  </IRQ>
2024 Aug 28 00:39:46.066262 r-anaconda-15 WARNING kernel: [  572.183360]  <TASK>
2024 Aug 28 00:39:46.066263 r-anaconda-15 WARNING kernel: [  572.183361]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
2024 Aug 28 00:39:46.066264 r-anaconda-15 WARNING kernel: [  572.183366]  ? xas_find_marked+0x7a/0x300
2024 Aug 28 00:39:46.066265 r-anaconda-15 WARNING kernel: [  572.183369]  xa_find+0x72/0xe0
2024 Aug 28 00:39:46.066266 r-anaconda-15 WARNING kernel: [  572.183373]  ? xas_find+0x1d0/0x1d0
2024 Aug 28 00:39:46.066267 r-anaconda-15 WARNING kernel: [  572.183376]  devlinks_xa_find_get.constprop.0+0x35/0x90
2024 Aug 28 00:39:46.066271 r-anaconda-15 WARNING kernel: [  572.183380]  devlink_pernet_pre_exit+0x44/0xf0
2024 Aug 28 00:39:46.066272 r-anaconda-15 WARNING kernel: [  572.183384]  cleanup_net+0x1df/0x3b0
2024 Aug 28 00:39:46.066273 r-anaconda-15 WARNING kernel: [  572.183387]  process_one_work+0x1c7/0x380
2024 Aug 28 00:39:46.066274 r-anaconda-15 WARNING kernel: [  572.183391]  worker_thread+0x4d/0x380
2024 Aug 28 00:39:46.066275 r-anaconda-15 WARNING kernel: [  572.183394]  ? rescuer_thread+0x3a0/0x3a0
2024 Aug 28 00:39:46.066276 r-anaconda-15 WARNING kernel: [  572.183398]  kthread+0xe9/0x110
2024 Aug 28 00:39:46.066280 r-anaconda-15 WARNING kernel: [  572.183401]  ? kthread_complete_and_exit+0x20/0x20
2024 Aug 28 00:39:46.066281 r-anaconda-15 WARNING kernel: [  572.183404]  ret_from_fork+0x22/0x30
2024 Aug 28 00:39:46.066281 r-anaconda-15 WARNING kernel: [  572.183409]  </TASK>
@vivekrnv
Copy link
Contributor Author

vivekrnv commented Oct 1, 2024

This is found to be a generic kernel issue and the patch is in review here: https://lore.kernel.org/stable/[email protected]/t/#u

I will raise a fix to port this in SONiC

@vivekrnv vivekrnv added the Triaged this issue has been triaged label Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202405 NVIDIA Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant