Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel oops page fault triggered by Docker in arc_prune #16324

Open
maxpoulin64 opened this issue Jul 6, 2024 · 82 comments
Open

Kernel oops page fault triggered by Docker in arc_prune #16324

maxpoulin64 opened this issue Jul 6, 2024 · 82 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@maxpoulin64
Copy link

System information

Type Version/Name
Distribution Name ArchLinux
Distribution Version Latest
Kernel Version 6.8.9-zen1-1-zen
Architecture x86_64
OpenZFS Version 2.2.4-2

I'm holding to 6.8.9 specifically to stay within official supported kernel versions.

Describe the problem you're observing

Extracting large container images in Docker causes ZFS to trigger an unhandled page fault, and permanently locks up the filesystem until reboot. Sync will never complete, and normal shutdown also doesn't complete.

Describe how to reproduce the problem

Running this particular container reliably hangs ZFS on my system during extraction, using Docker's ZFS storage driver.

docker run -it --rm -p 8080:8080 --gpus all --name localai quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas

It gets stuck on a line such as this one and never completes, killing the Docker daemon makes it a zombie, IO is completely hosed.

6ddbee975253: Extracting  352.2MB/352.2MB

Include any warning/errors/backtraces from the system logs

[184791.050957] BUG: unable to handle page fault for address: 00000000208db6e0
[184791.050969] #PF: supervisor instruction fetch in kernel mode
[184791.050972] #PF: error_code(0x0010) - not-present page
[184791.050975] PGD 0 P4D 0 
[184791.050981] Oops: 0010 [#1] PREEMPT SMP NOPTI
[184791.050985] CPU: 11 PID: 482 Comm: arc_prune Tainted: P        W  OE      6.8.9-zen1-1-zen #1 b3e4ad3c9dbde87c9fb9d46fb90ca62a28a66a12
[184791.050992] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.B0 08/09/2018
[184791.050995] RIP: 0010:0x208db6e0
[184791.051042] Code: Unable to access opcode bytes at 0x208db6b6.
[184791.051045] RSP: 0018:ffffb417d2293ce0 EFLAGS: 00010246
[184791.051049] RAX: 00000000208db6e0 RBX: ffffb417d2293d94 RCX: 0000000000000000
[184791.051052] RDX: 0000000000000000 RSI: ffffb417d2293d30 RDI: ffff97e1ac586a80
[184791.051056] RBP: 0000000000003ae0 R08: 0000000000006d66 R09: ffff97e4860e2e90
[184791.051059] R10: ffff97e4860e2e80 R11: ffff97e1f96c0000 R12: ffff97e538d00000
[184791.051063] R13: ffff97e48bf9d780 R14: ffff97e4860e2e28 R15: ffff97e1ac586a80
[184791.051066] FS:  0000000000000000(0000) GS:ffff97e46e4c0000(0000) knlGS:0000000000000000
[184791.051070] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[184791.051074] CR2: 00000000208db6e0 CR3: 000000019b3e6000 CR4: 00000000003506f0
[184791.051077] Call Trace:
[184791.051082]  <TASK>
[184791.051085]  ? __die+0x10f/0x120
[184791.051092]  ? page_fault_oops+0x171/0x4e0
[184791.051101]  ? exc_page_fault+0x7f/0x180
[184791.051107]  ? asm_exc_page_fault+0x26/0x30
[184791.051119]  ? zfs_prune+0xb0/0x4e0 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051438]  ? zpl_prune_sb+0x36/0x60 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051653]  ? arc_prune_task+0x22/0x40 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051880]  ? taskq_thread+0x2d4/0x6f0 [spl 44541b25f59ba0491e81482257bd475148318e14]
[184791.051901]  ? srso_return_thunk+0x5/0x5f
[184791.051907]  ? finish_task_switch.isra.0+0x94/0x2f0
[184791.051914]  ? __pfx_default_wake_function+0x10/0x10
[184791.051924]  ? __pfx_taskq_thread+0x10/0x10 [spl 44541b25f59ba0491e81482257bd475148318e14]
[184791.051941]  ? kthread+0xe8/0x120
[184791.051946]  ? __pfx_kthread+0x10/0x10
[184791.051951]  ? ret_from_fork+0x34/0x50
[184791.051955]  ? __pfx_kthread+0x10/0x10
[184791.051960]  ? ret_from_fork_asm+0x1b/0x30
[184791.051969]  </TASK>
[184791.051971] Modules linked in: xt_conntrack nf_conntrack_netlink xfrm_user xfrm_algo ip6table_nat ip6table_filter ip6_tables xt_addrtype br_netfilter overlay rfcomm snd_seq_dummy snd_hrtimer snd_seq wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bridge stp llc uhid cmac algif_hash algif_skcipher af_alg xt_MASQUERADE bnep xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic iptable_filter dm_crypt cbc encrypted_keys vfat fat intel_rapl_msr intel_rapl_common btusb snd_hda_codec_realtek btrtl crct10dif_pclmul snd_hda_codec_generic btintel crc32_pclmul iwlmvm snd_hda_codec_hdmi btbcm crc32c_intel snd_usb_audio btmtk polyval_clmulni snd_hda_intel snd_usbmidi_lib polyval_generic mac80211 gf128mul libarc4 snd_intel_dspcfg snd_ump ghash_clmulni_intel snd_intel_sdw_acpi bluetooth snd_rawmidi sha512_ssse3 joydev snd_seq_device snd_hda_codec sha256_ssse3 ecdh_generic iwlwifi mousedev sha1_ssse3 mc
[184791.052084]  razerkbd(OE) crc16 aesni_intel snd_hda_core crypto_simd snd_hwdep cryptd snd_pcm igb cfg80211 rapl ptp snd_timer sp5100_tco pps_core gpio_amdpt snd dca wmi_bmof rfkill pcspkr soundcore gpio_generic mxm_wmi i2c_piix4 k10temp mac_hid kvmfr(OE) sg crypto_user loop nfnetlink ip_tables x_tables hid_steam ff_memless hid_logitech_hidpp hid_logitech_dj hid_generic trusted asn1_encoder tee dm_mod usbhid amdgpu vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd video amdxcp i2c_algo_bit drm_ttm_helper ttm kvm_amd drm_exec gpu_sched drm_suballoc_helper kvm nvme drm_buddy nvme_core drm_display_helper xhci_pci irqbypass cec ccp nvme_auth xhci_pci_renesas wmi zfs(POE) spl(OE) vendor_reset(OE) nct6775 nct6775_core hwmon_vid i2c_dev
[184791.052189] CR2: 00000000208db6e0
[184791.052193] ---[ end trace 0000000000000000 ]---
[184791.052196] RIP: 0010:0x208db6e0
[184791.052216] Code: Unable to access opcode bytes at 0x208db6b6.
[184791.052219] RSP: 0018:ffffb417d2293ce0 EFLAGS: 00010246
[184791.052223] RAX: 00000000208db6e0 RBX: ffffb417d2293d94 RCX: 0000000000000000
[184791.052226] RDX: 0000000000000000 RSI: ffffb417d2293d30 RDI: ffff97e1ac586a80
[184791.052229] RBP: 0000000000003ae0 R08: 0000000000006d66 R09: ffff97e4860e2e90
[184791.052232] R10: ffff97e4860e2e80 R11: ffff97e1f96c0000 R12: ffff97e538d00000
[184791.052235] R13: ffff97e48bf9d780 R14: ffff97e4860e2e28 R15: ffff97e1ac586a80
[184791.052238] FS:  0000000000000000(0000) GS:ffff97e46e4c0000(0000) knlGS:0000000000000000
[184791.052241] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[184791.052244] CR2: 00000000208db6e0 CR3: 000000019b3e6000 CR4: 00000000003506f0
[184791.052248] note: arc_prune[482] exited with irqs disabled

The stack trace is always the same. Disk passes scrub with 0 errors after rebooting.

@maxpoulin64 maxpoulin64 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 6, 2024
@TheUbuntuGuy
Copy link

I've seen this several times on several systems running ZFS 2.2.4 and Linux 6.8. Reverting to Linux 6.6 is stable.
We also heavily use Docker with ZFS. Our systems are 24 - 128 thread systems running software build jobs in parallel using Docker containers. They see heavy CPU usage and IO over NVMe.

I have a trace which looks similar to yours (arc_prune), and I also started seeing at the same time another NULL pointer deference in a different stack trace referencing iptables. They both happen together so I'll include that trace here, but I don't know if it is actually related.

An example of arc_prune:

May 09 12:08:04 pingu kernel: BUG: unable to handle page fault for address: 0000000200000000
May 09 12:08:04 pingu kernel: #PF: supervisor instruction fetch in kernel mode
May 09 12:08:04 pingu kernel: #PF: error_code(0x0010) - not-present page
May 09 12:08:04 pingu kernel: PGD 15e5cce067 P4D 15e5cce067 PUD 0 
May 09 12:08:04 pingu kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
May 09 12:08:04 pingu kernel: CPU: 102 PID: 1366 Comm: arc_prune Tainted: P           O       6.8.9 #1
May 09 12:08:04 pingu kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020
May 09 12:08:04 pingu kernel: RIP: 0010:0x200000000
May 09 12:08:04 pingu kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
May 09 12:08:04 pingu kernel: RSP: 0018:ffffb59e438cfd08 EFLAGS: 00010246
May 09 12:08:04 pingu kernel: RAX: 0000000200000000 RBX: ffff9c0fab2f0000 RCX: 0000000000000000
May 09 12:08:04 pingu kernel: RDX: 0000000000000001 RSI: ffffb59e438cfd58 RDI: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: RBP: ffffb59e438cfdbc R08: 000000000000fdd9 R09: ffff9c38fe7ba4c0
May 09 12:08:04 pingu kernel: R10: 000000000000050a R11: 0000000000000066 R12: 000000000000fdd9
May 09 12:08:04 pingu kernel: R13: ffff9bfaf9386000 R14: ffff9c0fab2f00f8 R15: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: FS:  0000000000000000(0000) GS:ffff9c38fe780000(0000) knlGS:0000000000000000
May 09 12:08:04 pingu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 09 12:08:04 pingu kernel: CR2: 0000000200000000 CR3: 000000027fb1a000 CR4: 0000000000350ef0
May 09 12:08:04 pingu kernel: Call Trace:
May 09 12:08:04 pingu kernel:  <TASK>
May 09 12:08:04 pingu kernel:  ? __die_body+0x1b/0x60
May 09 12:08:04 pingu kernel:  ? page_fault_oops+0x15d/0x470
May 09 12:08:04 pingu kernel:  ? __mod_node_page_state+0x82/0xc0
May 09 12:08:04 pingu kernel:  ? exc_page_fault+0x74/0x170
May 09 12:08:04 pingu kernel:  ? asm_exc_page_fault+0x22/0x30
May 09 12:08:04 pingu kernel:  ? _raw_spin_unlock+0x15/0x30
May 09 12:08:04 pingu kernel:  ? zfs_prune+0x9c/0x4a0 [zfs]
May 09 12:08:04 pingu kernel:  ? autoremove_wake_function+0xe/0x30
May 09 12:08:04 pingu kernel:  ? zpl_prune_sb+0x34/0x50 [zfs]
May 09 12:08:04 pingu kernel:  ? arc_prune_task+0x1b/0x30 [zfs]
May 09 12:08:04 pingu kernel:  ? taskq_thread+0x26e/0x470 [spl]
May 09 12:08:04 pingu kernel:  ? wake_up_state+0x10/0x10
May 09 12:08:04 pingu kernel:  ? task_done+0x90/0x90 [spl]
May 09 12:08:04 pingu kernel:  ? kthread+0xee/0x120
May 09 12:08:04 pingu kernel:  ? kthread_complete_and_exit+0x20/0x20
May 09 12:08:04 pingu kernel:  ? ret_from_fork+0x2d/0x50
May 09 12:08:04 pingu kernel:  ? kthread_complete_and_exit+0x20/0x20
May 09 12:08:04 pingu kernel:  ? ret_from_fork_asm+0x11/0x20
May 09 12:08:04 pingu kernel:  </TASK>
May 09 12:08:04 pingu kernel: Modules linked in: xt_nat macvtap macvlan rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs vhost_net vhost vhost_iotlb tap iptable_nat iptable_filter veth tcp_diag udp_diag inet_diag nf_conntrack_netlink xfrm_user xt_addrtype br_netfi
lter xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_chain_nat nf_nat bridge stp llc xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink nvme_fabrics overlay bonding tls cfg80211 sunrpc binfmt_misc amdgpu drm
_exec amdxcp drm_buddy gpu_sched intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 snd_hda_codec_hdmi sha1_ssse3 radeon aesni_intel nls_iso8859_1 snd_hda_inte
l snd_intel_dspcfg crypto_simd snd_intel_sdw_acpi drm_suballoc_helper cryptd drm_ttm_helper snd_hda_codec ttm snd_hda_core drm_display_helper snd_pcsp snd_hwdep cec snd_pcm rc_core rapl snd_timer drm_kms_helper wmi_bmof snd video
May 09 12:08:04 pingu kernel:  soundcore ccp mxm_wmi sp5100_tco k10temp joydev mac_hid nct6775 nct6775_core hwmon_vid drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) hid_generic usbhid hid crc32_pclmul ixgbe igb ahci xfrm_algo i2c_algo_bit mdio libahci
 dca xhci_pci xhci_pci_renesas i2c_piix4 wmi
May 09 12:08:04 pingu kernel: CR2: 0000000200000000
May 09 12:08:04 pingu kernel: ---[ end trace 0000000000000000 ]---
May 09 12:08:04 pingu kernel: RIP: 0010:0x200000000
May 09 12:08:04 pingu kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
May 09 12:08:04 pingu kernel: RSP: 0018:ffffb59e438cfd08 EFLAGS: 00010246
May 09 12:08:04 pingu kernel: RAX: 0000000200000000 RBX: ffff9c0fab2f0000 RCX: 0000000000000000
May 09 12:08:04 pingu kernel: RDX: 0000000000000001 RSI: ffffb59e438cfd58 RDI: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: RBP: ffffb59e438cfdbc R08: 000000000000fdd9 R09: ffff9c38fe7ba4c0
May 09 12:08:04 pingu kernel: R10: 000000000000050a R11: 0000000000000066 R12: 000000000000fdd9
May 09 12:08:04 pingu kernel: R13: ffff9bfaf9386000 R14: ffff9c0fab2f00f8 R15: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: FS:  0000000000000000(0000) GS:ffff9c38fe780000(0000) knlGS:0000000000000000
May 09 12:08:04 pingu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 09 12:08:04 pingu kernel: CR2: 0000000200000000 CR3: 000000027fb1a000 CR4: 0000000000350ef0
May 09 12:08:04 pingu kernel: note: arc_prune[1366] exited with irqs disabled

An example of iptables on the same system, same kernel:

May 09 15:03:05 pingu kernel: BUG: kernel NULL pointer dereference, address: 0000000000000015
May 09 15:03:05 pingu kernel: #PF: supervisor write access in kernel mode
May 09 15:03:05 pingu kernel: #PF: error_code(0x0002) - not-present page
May 09 15:03:05 pingu kernel: PGD 0 P4D 0 
May 09 15:03:05 pingu kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
May 09 15:03:05 pingu kernel: CPU: 65 PID: 232279 Comm: iptables Tainted: P           O       6.8.9 #1
May 09 15:03:05 pingu kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020
May 09 15:03:05 pingu kernel: RIP: 0010:iptable_nat_table_init+0xe6/0x170 [iptable_nat]
May 09 15:03:05 pingu kernel: Code: 48 89 ef 48 89 0c 24 e8 98 e9 d7 ff 48 8b 0c 24 85 c0 41 89 c4 75 30 41 83 c7 01 48 83 c1 28 41 83 ff 04 75 d4 48 8b 44 24 08 <4c> 89 30 4c 89 ef e8 3f ec 3c ee 48 83 c4 10 44 89 e0 5b 5d 41 5c
May 09 15:03:05 pingu kernel: RSP: 0018:ffffb71601a93bc8 EFLAGS: 00010246
May 09 15:03:05 pingu kernel: RAX: 0000000000000015 RBX: ffff99c83a710120 RCX: ffff99bba6d73960
May 09 15:03:05 pingu kernel: RDX: 0000000000000000 RSI: 0000000000000064 RDI: ffffffffc1edf1e0
May 09 15:03:05 pingu kernel: RBP: ffff99bd0a29ec00 R08: ffff99f47efd3000 R09: 0000000000000000
May 09 15:03:05 pingu kernel: R10: ffffb71601a93b20 R11: ffff99b696c19088 R12: 0000000000000000
May 09 15:03:05 pingu kernel: R13: ffff99cc9cd6a400 R14: ffff99bba6d738c0 R15: 0000000000000004
May 09 15:03:05 pingu kernel: FS:  00007f9c8074bb48(0000) GS:ffff99f37de40000(0000) knlGS:0000000000000000
May 09 15:03:05 pingu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 09 15:03:05 pingu kernel: CR2: 0000000000000015 CR3: 00000017f53e2000 CR4: 0000000000350ef0
May 09 15:03:05 pingu kernel: Call Trace:
May 09 15:03:05 pingu kernel:  <TASK>
May 09 15:03:05 pingu kernel:  ? __die_body+0x1b/0x60
May 09 15:03:05 pingu kernel:  ? page_fault_oops+0x15d/0x470
May 09 15:03:05 pingu kernel:  ? do_user_addr_fault+0x65/0x830
May 09 15:03:05 pingu kernel:  ? __kmalloc_node+0x3d3/0x3e0
May 09 15:03:05 pingu kernel:  ? __nf_hook_entries_try_shrink+0x140/0x140
May 09 15:03:05 pingu kernel:  ? exc_page_fault+0x74/0x170
May 09 15:03:05 pingu kernel:  ? asm_exc_page_fault+0x22/0x30
May 09 15:03:05 pingu kernel:  ? iptable_nat_table_init+0xe6/0x170 [iptable_nat]
May 09 15:03:05 pingu kernel:  ? iptable_nat_table_init+0xc8/0x170 [iptable_nat]
May 09 15:03:05 pingu kernel:  xt_find_table_lock+0x128/0x1b0 [x_tables]
May 09 15:03:05 pingu kernel:  xt_request_find_table_lock+0x1b/0x70 [x_tables]
May 09 15:03:05 pingu kernel:  get_info+0x82/0x300 [ip_tables]
May 09 15:03:05 pingu kernel:  ? mntput_no_expire+0x4a/0x240
May 09 15:03:05 pingu kernel:  ? __local_bh_enable_ip+0x37/0x80
May 09 15:03:05 pingu kernel:  ? do_ip_getsockopt+0x85d/0xca0
May 09 15:03:05 pingu kernel:  do_ipt_get_ctl+0x6c/0x330 [ip_tables]
May 09 15:03:05 pingu kernel:  ? obj_cgroup_charge+0xf0/0x110
May 09 15:03:05 pingu kernel:  ? kmem_cache_alloc+0x122/0x2a0
May 09 15:03:05 pingu kernel:  nf_getsockopt+0x44/0x70
May 09 15:03:05 pingu kernel:  ip_getsockopt+0x82/0xc0
May 09 15:03:05 pingu kernel:  do_sock_getsockopt+0x9b/0x220
May 09 15:03:05 pingu kernel:  __sys_getsockopt+0x72/0xc0
May 09 15:03:05 pingu kernel:  __x64_sys_getsockopt+0x21/0x30
May 09 15:03:05 pingu kernel:  do_syscall_64+0x44/0xd0
May 09 15:03:05 pingu kernel:  entry_SYSCALL_64_after_hwframe+0x4b/0x53
May 09 15:03:05 pingu kernel: RIP: 0033:0x7f9c806ed484
May 09 15:03:05 pingu kernel: Code: 31 c9 b8 33 00 00 00 0f 05 48 89 c7 e8 da 29 fe ff 5a c3 49 89 ca 50 48 63 d2 48 63 f6 48 63 ff 45 31 c9 b8 37 00 00 00 0f 05 <48> 63 f8 e8 b9 29 fe ff 5a c3 64 48 8b 04 25 00 00 00 00 48 83 78
May 09 15:03:05 pingu kernel: RSP: 002b:00007ffcccafdb40 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
May 09 15:03:05 pingu kernel: RAX: ffffffffffffffda RBX: 00007ffcccafdf18 RCX: 00007f9c806ed484
May 09 15:03:05 pingu kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000004
May 09 15:03:05 pingu kernel: RBP: 0000000000000004 R08: 00007ffcccafdb6c R09: 0000000000000000
May 09 15:03:05 pingu kernel: R10: 00007ffcccafdb74 R11: 0000000000000246 R12: 00007ffcccafdb74
May 09 15:03:05 pingu kernel: R13: 0000000000000002 R14: 00007f9c8074b940 R15: 0000000000000000
May 09 15:03:05 pingu kernel:  </TASK>

@bpwats
Copy link

bpwats commented Jul 6, 2024

I have the same problem on the latest Unraid 7.0.0-beta-1 prerelease.
kernel: Linux version 6.8.12-Unraid (root@Develop) (gcc (GCC) 13.2.0, GNU ld version 2.42-slack151) #3 SMP PREEMPT_DYNAMIC Tue Jun 18 07:52:57 PDT 2024

Jul 4 21:21:40 Fractal kernel: BUG: unable to handle page fault for address: 0000000200000002 Jul 4 21:21:40 Fractal kernel: #PF: supervisor instruction fetch in kernel mode Jul 4 21:21:40 Fractal kernel: #PF: error_code(0x0010) - not-present page Jul 4 21:21:40 Fractal kernel: PGD 24af2f067 P4D 24af2f067 PUD 0 Jul 4 21:21:40 Fractal kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI Jul 4 21:21:40 Fractal kernel: CPU: 5 PID: 1324 Comm: arc_prune Tainted: P O 6.8.12-Unraid #3 Jul 4 21:21:40 Fractal kernel: Hardware name: ASUS System Product Name/PRIME H510M-E, BIOS 2402 12/18/2023 Jul 4 21:21:40 Fractal kernel: RIP: 0010:0x200000002 Jul 4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8. Jul 4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246 Jul 4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011 Jul 4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580 Jul 4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c Jul 4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda Jul 4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100 Jul 4 21:21:40 Fractal kernel: FS: 0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000 Jul 4 21:21:40 Fractal kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0 Jul 4 21:21:40 Fractal kernel: Call Trace: Jul 4 21:21:40 Fractal kernel: <TASK> Jul 4 21:21:40 Fractal kernel: ? __die_body+0x1a/0x5c Jul 4 21:21:40 Fractal kernel: ? page_fault_oops+0x332/0x37f Jul 4 21:21:40 Fractal kernel: ? put_cpu_partial+0x62/0x8e Jul 4 21:21:40 Fractal kernel: ? spl_kmem_cache_free+0x3a/0x180 [spl] Jul 4 21:21:40 Fractal kernel: ? exc_page_fault+0xf9/0x116 Jul 4 21:21:40 Fractal kernel: ? asm_exc_page_fault+0x22/0x30 Jul 4 21:21:40 Fractal kernel: ? zfs_prune+0xec/0x2ec [zfs] Jul 4 21:21:40 Fractal kernel: ? zpl_prune_sb+0x32/0x50 [zfs] Jul 4 21:21:40 Fractal kernel: ? arc_prune_task+0x1b/0x2e [zfs] Jul 4 21:21:40 Fractal kernel: ? taskq_thread+0x2d4/0x3c1 [spl] Jul 4 21:21:40 Fractal kernel: ? __pfx_default_wake_function+0x10/0x10 Jul 4 21:21:40 Fractal kernel: ? __pfx_taskq_thread+0x10/0x10 [spl] Jul 4 21:21:40 Fractal kernel: ? kthread+0xf4/0xff Jul 4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10 Jul 4 21:21:40 Fractal kernel: ? ret_from_fork+0x21/0x36 Jul 4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10 Jul 4 21:21:40 Fractal kernel: ? ret_from_fork_asm+0x1b/0x30 Jul 4 21:21:40 Fractal kernel: </TASK> Jul 4 21:21:40 Fractal kernel: Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip_set nf_tables xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge stp llc nfsd auth_rpcgss oid_registry lockd grace sunrpc bluetooth ecdh_generic ecc md_mod tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap intel_rapl_common x86_pkg_temp_thermal i915 intel_powerclamp coretemp zfs(PO) kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 iosf_mbi sha256_ssse3 drm_buddy sha1_ssse3 ttm aesni_intel crypto_simd i2c_algo_bit cryptd drm_display_helper drm_kms_helper input_leds rapl spl(O) mei_hdcp mei_pxp intel_cstate wmi_bmof drm nvme intel_uncore e1000e hid_apple led_class nvme_core mei_me intel_gtt i2c_i801 agpgart i2c_smbus mei ahci Jul 4 21:21:40 Fractal kernel: i2c_core libahci thermal fan tpm_crb video tpm_tis tpm_tis_core tpm wmi backlight button acpi_tad acpi_pad Jul 4 21:21:40 Fractal kernel: CR2: 0000000200000002 Jul 4 21:21:40 Fractal kernel: ---[ end trace 0000000000000000 ]--- Jul 4 21:21:40 Fractal kernel: pstore: backend (efi_pstore) writing error (-5) Jul 4 21:21:40 Fractal kernel: RIP: 0010:0x200000002 Jul 4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8. Jul 4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246 Jul 4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011 Jul 4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580 Jul 4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c Jul 4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda Jul 4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100 Jul 4 21:21:40 Fractal kernel: FS: 0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000 Jul 4 21:21:40 Fractal kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0 Jul 4 21:21:40 Fractal kernel: note: arc_prune[1324] exited with irqs disabled Jul 4 21:21:50 Fractal kernel: veth95a1b70: renamed from eth0

@1JorgeB
Copy link

1JorgeB commented Jul 7, 2024

I'm a mod at the Unraid forums and we have seen multiple users with this issue with Docker on zfs since kernel 6.8 (openzfs 2.2.4-1), there was also one report with kernel 6.7 during beta testing, call traces look all very similar, some example in case it helps.

Jul  4 21:21:40 Fractal kernel: BUG: unable to handle page fault for address: 0000000200000002
Jul  4 21:21:40 Fractal kernel: #PF: supervisor instruction fetch in kernel mode
Jul  4 21:21:40 Fractal kernel: #PF: error_code(0x0010) - not-present page
Jul  4 21:21:40 Fractal kernel: PGD 24af2f067 P4D 24af2f067 PUD 0 
Jul  4 21:21:40 Fractal kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Jul  4 21:21:40 Fractal kernel: CPU: 5 PID: 1324 Comm: arc_prune Tainted: P           O       6.8.12-Unraid #3
Jul  4 21:21:40 Fractal kernel: Hardware name: ASUS System Product Name/PRIME H510M-E, BIOS 2402 12/18/2023
Jul  4 21:21:40 Fractal kernel: RIP: 0010:0x200000002
Jul  4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jul  4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246
Jul  4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011
Jul  4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580
Jul  4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c
Jul  4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda
Jul  4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100
Jul  4 21:21:40 Fractal kernel: FS:  0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000
Jul  4 21:21:40 Fractal kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0
Jul  4 21:21:40 Fractal kernel: Call Trace:
Jul  4 21:21:40 Fractal kernel: <TASK>
Jul  4 21:21:40 Fractal kernel: ? __die_body+0x1a/0x5c
Jul  4 21:21:40 Fractal kernel: ? page_fault_oops+0x332/0x37f
Jul  4 21:21:40 Fractal kernel: ? put_cpu_partial+0x62/0x8e
Jul  4 21:21:40 Fractal kernel: ? spl_kmem_cache_free+0x3a/0x180 [spl]
Jul  4 21:21:40 Fractal kernel: ? exc_page_fault+0xf9/0x116
Jul  4 21:21:40 Fractal kernel: ? asm_exc_page_fault+0x22/0x30
Jul  4 21:21:40 Fractal kernel: ? zfs_prune+0xec/0x2ec [zfs]
Jul  4 21:21:40 Fractal kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jul  4 21:21:40 Fractal kernel: ? arc_prune_task+0x1b/0x2e [zfs]
Jul  4 21:21:40 Fractal kernel: ? taskq_thread+0x2d4/0x3c1 [spl]
Jul  4 21:21:40 Fractal kernel: ? __pfx_default_wake_function+0x10/0x10
Jul  4 21:21:40 Fractal kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jul  4 21:21:40 Fractal kernel: ? kthread+0xf4/0xff
Jul  4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10
Jul  4 21:21:40 Fractal kernel: ? ret_from_fork+0x21/0x36
Jul  4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10
Jul  4 21:21:40 Fractal kernel: ? ret_from_fork_asm+0x1b/0x30
Jul  4 21:21:40 Fractal kernel: </TASK>
Jul  4 21:21:40 Fractal kernel: Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip_set nf_tables xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge stp llc nfsd auth_rpcgss oid_registry lockd grace sunrpc bluetooth ecdh_generic ecc md_mod tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap intel_rapl_common x86_pkg_temp_thermal i915 intel_powerclamp coretemp zfs(PO) kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 iosf_mbi sha256_ssse3 drm_buddy sha1_ssse3 ttm aesni_intel crypto_simd i2c_algo_bit cryptd drm_display_helper drm_kms_helper input_leds rapl spl(O) mei_hdcp mei_pxp intel_cstate wmi_bmof drm nvme intel_uncore e1000e hid_apple led_class nvme_core mei_me intel_gtt i2c_i801 agpgart i2c_smbus mei ahci
Jul  4 21:21:40 Fractal kernel: i2c_core libahci thermal fan tpm_crb video tpm_tis tpm_tis_core tpm wmi backlight button acpi_tad acpi_pad
Jul  4 21:21:40 Fractal kernel: CR2: 0000000200000002
Jul  4 21:21:40 Fractal kernel: ---[ end trace 0000000000000000 ]---
Jul  4 21:21:40 Fractal kernel: pstore: backend (efi_pstore) writing error (-5)
Jul  4 21:21:40 Fractal kernel: RIP: 0010:0x200000002
Jul  4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jul  4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246
Jul  4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011
Jul  4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580
Jul  4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c
Jul  4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda
Jul  4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100
Jul  4 21:21:40 Fractal kernel: FS:  0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000
Jul  4 21:21:40 Fractal kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0
Jul  4 21:21:40 Fractal kernel: note: arc_prune[1324] exited with irqs disabled
Jul  5 06:58:09 Sirius kernel: BUG: unable to handle page fault for address: 0000000200000000
Jul  5 06:58:09 Sirius kernel: #PF: supervisor instruction fetch in kernel mode
Jul  5 06:58:09 Sirius kernel: #PF: error_code(0x0010) - not-present page
Jul  5 06:58:09 Sirius kernel: PGD 8000000170a48067 P4D 8000000170a48067 PUD 0 
Jul  5 06:58:09 Sirius kernel: Oops: 0010 [#1] PREEMPT SMP PTI
Jul  5 06:58:09 Sirius kernel: CPU: 3 PID: 1079 Comm: arc_prune Tainted: P     U     O       6.8.12-Unraid #3
Jul  5 06:58:09 Sirius kernel: Hardware name: Gigabyte Technology Co., Ltd. C246N-WU2/C246N-WU2-CF, BIOS F2 11/09/2021
Jul  5 06:58:09 Sirius kernel: RIP: 0010:0x200000000
Jul  5 06:58:09 Sirius kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  5 06:58:09 Sirius kernel: RSP: 0018:ffffc900005c7d30 EFLAGS: 00010246
Jul  5 06:58:09 Sirius kernel: RAX: 0000000200000000 RBX: ffff8883cb33a000 RCX: 0000000000000011
Jul  5 06:58:09 Sirius kernel: RDX: ffffffffa0fe34b8 RSI: ffffc900005c7d68 RDI: ffff8882ad5d7380
Jul  5 06:58:09 Sirius kernel: RBP: ffffc900005c7dcc R08: 0000000000000000 R09: 0000000000000000
Jul  5 06:58:09 Sirius kernel: R10: 0000000000017c78 R11: 000000000000f40e R12: 000000000000cb06
Jul  5 06:58:09 Sirius kernel: R13: ffff8882ad5d7380 R14: ffff8881363196c0 R15: ffff8881060c8000
Jul  5 06:58:09 Sirius kernel: FS:  0000000000000000(0000) GS:ffff88884e580000(0000) knlGS:0000000000000000
Jul  5 06:58:09 Sirius kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  5 06:58:09 Sirius kernel: CR2: 0000000200000000 CR3: 0000000170508004 CR4: 00000000003706f0
Jul  5 06:58:09 Sirius kernel: Call Trace:
Jul  5 06:58:09 Sirius kernel: <TASK>
Jul  5 06:58:09 Sirius kernel: ? __die_body+0x1a/0x5c
Jul  5 06:58:09 Sirius kernel: ? page_fault_oops+0x332/0x37f
Jul  5 06:58:09 Sirius kernel: ? call_rcu+0x530/0x5e6
Jul  5 06:58:09 Sirius kernel: ? exc_page_fault+0xf9/0x116
Jul  5 06:58:09 Sirius kernel: ? asm_exc_page_fault+0x22/0x30
Jul  5 06:58:09 Sirius kernel: ? zfs_prune+0xec/0x2ec [zfs]
Jul  5 06:58:09 Sirius kernel: ? __schedule+0x69c/0x6e8
Jul  5 06:58:09 Sirius kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jul  5 06:58:09 Sirius kernel: ? arc_prune_task+0x1b/0x2e [zfs]
Jul  5 06:58:09 Sirius kernel: ? taskq_thread+0x2d4/0x3c1 [spl]
Jul  5 06:58:09 Sirius kernel: ? __pfx_default_wake_function+0x10/0x10
Jul  5 06:58:09 Sirius kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jul  5 06:58:09 Sirius kernel: ? kthread+0xf4/0xff
Jul  5 06:58:09 Sirius kernel: ? __pfx_kthread+0x10/0x10
Jul  5 06:58:09 Sirius kernel: ? ret_from_fork+0x21/0x36
Jul  5 06:58:09 Sirius kernel: ? __pfx_kthread+0x10/0x10
Jul  5 06:58:09 Sirius kernel: ? ret_from_fork_asm+0x1b/0x30
Jul  5 06:58:09 Sirius kernel: </TASK>
Jul  5 06:58:09 Sirius kernel: Modules linked in: bluetooth ecdh_generic ecc xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter md_mod tcp_diag inet_diag kvmgt mdev i915 drm_buddy ttm drm_display_helper drm_kms_helper drm intel_gtt agpgart ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs af_packet 8021q garp mrp bridge stp llc bonding tls e1000e igb i2c_algo_bit intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp zfs(PO) kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_pxp mei_hdcp rapl spl(O) gigabyte_wmi wmi_bmof i2c_i801 mei_me intel_cstate intel_uncore i2c_smbus nvme mei joydev i2c_core ahci input_leds led_class nvme_core libahci intel_pch_thermal fan thermal video tpm_crb wmi backlight tpm_tis tpm_tis_core tpm button acpi_pad [last unloaded: e1000e]
Jul  5 06:58:09 Sirius kernel: CR2: 0000000200000000
Jul  5 06:58:09 Sirius kernel: ---[ end trace 0000000000000000 ]---
Jul  5 06:58:09 Sirius kernel: RIP: 0010:0x200000000
Jul  5 06:58:09 Sirius kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  5 06:58:09 Sirius kernel: RSP: 0018:ffffc900005c7d30 EFLAGS: 00010246
Jul  5 06:58:09 Sirius kernel: RAX: 0000000200000000 RBX: ffff8883cb33a000 RCX: 0000000000000011
Jul  5 06:58:09 Sirius kernel: RDX: ffffffffa0fe34b8 RSI: ffffc900005c7d68 RDI: ffff8882ad5d7380
Jul  5 06:58:09 Sirius kernel: RBP: ffffc900005c7dcc R08: 0000000000000000 R09: 0000000000000000
Jul  5 06:58:09 Sirius kernel: R10: 0000000000017c78 R11: 000000000000f40e R12: 000000000000cb06
Jul  5 06:58:09 Sirius kernel: R13: ffff8882ad5d7380 R14: ffff8881363196c0 R15: ffff8881060c8000
Jul  5 06:58:09 Sirius kernel: FS:  0000000000000000(0000) GS:ffff88884e580000(0000) knlGS:0000000000000000
Jul  5 06:58:09 Sirius kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  5 06:58:09 Sirius kernel: CR2: 0000000200000000 CR3: 0000000170508004 CR4: 00000000003706f0
Jul  5 06:58:09 Sirius kernel: note: arc_prune[1079] exited with irqs disabled
Jul  1 09:09:32 SofaKing kernel: BUG: unable to handle page fault for address: 0000000200000000
Jul  1 09:09:32 SofaKing kernel: #PF: supervisor instruction fetch in kernel mode
Jul  1 09:09:32 SofaKing kernel: #PF: error_code(0x0010) - not-present page
Jul  1 09:09:32 SofaKing kernel: PGD 0 P4D 0 
Jul  1 09:09:32 SofaKing kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Jul  1 09:09:32 SofaKing kernel: CPU: 8 PID: 1547 Comm: arc_prune Tainted: P           O       6.8.12-Unraid #3
Jul  1 09:09:32 SofaKing kernel: Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS (WI-FI), BIOS 4602 02/23/2023
Jul  1 09:09:32 SofaKing kernel: RIP: 0010:0x200000000
Jul  1 09:09:32 SofaKing kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  1 09:09:32 SofaKing kernel: RSP: 0018:ffffc90001267d30 EFLAGS: 00010246
Jul  1 09:09:32 SofaKing kernel: RAX: 0000000200000000 RBX: ffff8884254b0000 RCX: 0000000000000011
Jul  1 09:09:32 SofaKing kernel: RDX: ffffffffa43294b8 RSI: ffffc90001267d68 RDI: ffff88866784b780
Jul  1 09:09:32 SofaKing kernel: RBP: ffffc90001267dcc R08: 0000000000000000 R09: 00000000001d001c
Jul  1 09:09:32 SofaKing kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000c1d1
Jul  1 09:09:32 SofaKing kernel: R13: ffff88866784b780 R14: ffff88810160a0c0 R15: ffff88811025e180
Jul  1 09:09:32 SofaKing kernel: FS:  0000000000000000(0000) GS:ffff888feea00000(0000) knlGS:0000000000000000
Jul  1 09:09:32 SofaKing kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  1 09:09:32 SofaKing kernel: CR2: 0000000200000000 CR3: 0000000210406000 CR4: 0000000000750ef0
Jul  1 09:09:32 SofaKing kernel: PKRU: 55555554
Jul  1 09:09:32 SofaKing kernel: Call Trace:
Jul  1 09:09:32 SofaKing kernel: <TASK>
Jul  1 09:09:32 SofaKing kernel: ? __die_body+0x1a/0x5c
Jul  1 09:09:32 SofaKing kernel: ? page_fault_oops+0x332/0x37f
Jul  1 09:09:32 SofaKing kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jul  1 09:09:32 SofaKing kernel: ? spl_kmem_cache_free+0x3a/0x180 [spl]
Jul  1 09:09:32 SofaKing kernel: ? exc_page_fault+0xf9/0x116
Jul  1 09:09:32 SofaKing kernel: ? asm_exc_page_fault+0x22/0x30
Jul  1 09:09:32 SofaKing kernel: ? zfs_prune+0xef/0x2ec [zfs]
Jul  1 09:09:32 SofaKing kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jul  1 09:09:32 SofaKing kernel: ? arc_prune_task+0x1e/0x2e [zfs]
Jul  1 09:09:32 SofaKing kernel: ? taskq_thread+0x2d7/0x3c1 [spl]
Jul  1 09:09:32 SofaKing kernel: ? __pfx_default_wake_function+0x10/0x10
Jul  1 09:09:32 SofaKing kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jul  1 09:09:32 SofaKing kernel: ? kthread+0xf7/0xff
Jul  1 09:09:32 SofaKing kernel: ? __pfx_kthread+0x10/0x10
Jul  1 09:09:32 SofaKing kernel: ? ret_from_fork+0x24/0x36
Jul  1 09:09:32 SofaKing kernel: ? __pfx_kthread+0x10/0x10
Jul  1 09:09:32 SofaKing kernel: ? ret_from_fork_asm+0x1b/0x30
Jul  1 09:09:32 SofaKing kernel: </TASK>
Jul  1 09:09:32 SofaKing kernel: Modules linked in: xt_nat veth nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_mangle iptable_mangle vhost_net vhost vhost_iotlb nvidia_uvm(PO) nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod xt_tcpudp xt_mark tun nf_tables nfnetlink ip6table_nat tcp_diag inet_diag nct6775 nct6775_core hwmon_vid iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs 8021q garp mrp macvtap macvlan tap bridge stp llc atlantic r8169 realtek edac_mce_amd edac_core intel_rapl_common iosf_mbi nvidia_drm(PO) nvidia_modeset(PO) kvm_amd zfs(PO) nvidia(PO) video kvm drm_kms_helper spl(O) crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel btusb sha512_ssse3 drm sha256_ssse3 sha1_ssse3 btrtl aesni_intel btbcm
Jul  1 09:09:32 SofaKing kernel: crypto_simd btintel cp210x cryptd wmi_bmof rapl joydev bluetooth input_leds acpi_cpufreq i2c_piix4 backlight k10temp usbserial nvme ccp ahci i2c_core ecdh_generic libahci ecc nvme_core led_class wmi tpm_crb tpm_tis tpm_tis_core tpm button [last unloaded: atlantic]
Jul  1 09:09:32 SofaKing kernel: CR2: 0000000200000000
Jul  1 09:09:32 SofaKing kernel: ---[ end trace 0000000000000000 ]---
Jul  1 09:09:32 SofaKing kernel: RIP: 0010:0x200000000
Jul  1 09:09:32 SofaKing kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  1 09:09:32 SofaKing kernel: RSP: 0018:ffffc90001267d30 EFLAGS: 00010246
Jul  1 09:09:32 SofaKing kernel: RAX: 0000000200000000 RBX: ffff8884254b0000 RCX: 0000000000000011
Jul  1 09:09:32 SofaKing kernel: RDX: ffffffffa43294b8 RSI: ffffc90001267d68 RDI: ffff88866784b780
Jul  1 09:09:32 SofaKing kernel: RBP: ffffc90001267dcc R08: 0000000000000000 R09: 00000000001d001c
Jul  1 09:09:32 SofaKing kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000c1d1
Jul  1 09:09:32 SofaKing kernel: R13: ffff88866784b780 R14: ffff88810160a0c0 R15: ffff88811025e180
Jul  1 09:09:32 SofaKing kernel: FS:  0000000000000000(0000) GS:ffff888feea00000(0000) knlGS:0000000000000000
Jul  1 09:09:32 SofaKing kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  1 09:09:32 SofaKing kernel: CR2: 0000000200000000 CR3: 0000000210406000 CR4: 0000000000750ef0
Jul  1 09:09:32 SofaKing kernel: PKRU: 55555554
Jul  1 09:09:32 SofaKing kernel: note: arc_prune[1547] exited with irqs disabled
Jun 30 03:00:45 unRAID kernel: BUG: unable to handle page fault for address: 0000000200000002
Jun 30 03:00:45 unRAID kernel: #PF: supervisor instruction fetch in kernel mode
Jun 30 03:00:45 unRAID kernel: #PF: error_code(0x0010) - not-present page
Jun 30 03:00:45 unRAID kernel: PGD c1420c067 P4D c1420c067 PUD 0 
Jun 30 03:00:45 unRAID kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Jun 30 03:00:45 unRAID kernel: CPU: 3 PID: 9677 Comm: arc_prune Tainted: P           O       6.8.12-Unraid #3
Jun 30 03:00:45 unRAID kernel: Hardware name: ASRock Z690 PG Riptide/Z690 PG Riptide, BIOS 18.04 06/07/2024
Jun 30 03:00:45 unRAID kernel: RIP: 0010:0x200000002
Jun 30 03:00:45 unRAID kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jun 30 03:00:45 unRAID kernel: RSP: 0018:ffffc9000086bd30 EFLAGS: 00010246
Jun 30 03:00:45 unRAID kernel: RAX: 0000000200000002 RBX: ffff8885cf7f4000 RCX: 0000000000000011
Jun 30 03:00:45 unRAID kernel: RDX: ffffffffa0fd84b8 RSI: ffffc9000086bd68 RDI: ffff888294f3f780
Jun 30 03:00:45 unRAID kernel: RBP: ffffc9000086bdcc R08: 0000000000000000 R09: ffff88810b63f548
Jun 30 03:00:45 unRAID kernel: R10: 000000000000065a R11: 0000000000000699 R12: 000000000002040e
Jun 30 03:00:45 unRAID kernel: R13: ffff888294f3f780 R14: ffff88813cdfbe00 R15: ffff88813beba080
Jun 30 03:00:45 unRAID kernel: FS:  0000000000000000(0000) GS:ffff88904f2c0000(0000) knlGS:0000000000000000
Jun 30 03:00:45 unRAID kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 30 03:00:45 unRAID kernel: CR2: 0000000200000002 CR3: 00000002c443a000 CR4: 0000000000752ef0
Jun 30 03:00:45 unRAID kernel: PKRU: 55555554
Jun 30 03:00:45 unRAID kernel: Call Trace:
Jun 30 03:00:45 unRAID kernel: <TASK>
Jun 30 03:00:45 unRAID kernel: ? __die_body+0x1a/0x5c
Jun 30 03:00:45 unRAID kernel: ? page_fault_oops+0x332/0x37f
Jun 30 03:00:45 unRAID kernel: ? call_rcu+0x530/0x5e6
Jun 30 03:00:45 unRAID kernel: ? exc_page_fault+0xf9/0x116
Jun 30 03:00:45 unRAID kernel: ? asm_exc_page_fault+0x22/0x30
Jun 30 03:00:45 unRAID kernel: ? zfs_prune+0xec/0x2ec [zfs]
Jun 30 03:00:45 unRAID kernel: ? autoremove_wake_function+0xe/0x33
Jun 30 03:00:45 unRAID kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jun 30 03:00:45 unRAID kernel: ? arc_prune_task+0x1b/0x2e [zfs]
Jun 30 03:00:45 unRAID kernel: ? taskq_thread+0x2d4/0x3c1 [spl]
Jun 30 03:00:45 unRAID kernel: ? __pfx_default_wake_function+0x10/0x10
Jun 30 03:00:45 unRAID kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jun 30 03:00:45 unRAID kernel: ? kthread+0xf4/0xff
Jun 30 03:00:45 unRAID kernel: ? __pfx_kthread+0x10/0x10
Jun 30 03:00:45 unRAID kernel: ? ret_from_fork+0x21/0x36
Jun 30 03:00:45 unRAID kernel: ? __pfx_kthread+0x10/0x10
Jun 30 03:00:45 unRAID kernel: ? ret_from_fork_asm+0x1b/0x30
Jun 30 03:00:45 unRAID kernel: </TASK>
Jun 30 03:00:45 unRAID kernel: Modules linked in: veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_nat xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge dm_crypt dm_mod nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) spl(O) tcp_diag inet_diag nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap 8021q garp mrp stp llc mlx4_en xe drm_gpuvm drm_exec gpu_sched drm_ttm_helper drm_suballoc_helper x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm drm_buddy crct10dif_pclmul crc32_pclmul ttm crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 i2c_algo_bit sha1_ssse3 drm_display_helper aesni_intel crypto_simd cryptd drm_kms_helper mei_hdcp mei_pxp rapl drm processor_thermal_device_pci i2c_i801 mei_me processor_thermal_device intel_cstate
Jun 30 03:00:45 unRAID kernel: processor_thermal_wt_hint wmi_bmof mlx4_core intel_gtt intel_uncore processor_thermal_rfim nvme processor_thermal_rapl intel_rapl_common processor_thermal_wt_req i2c_smbus ahci agpgart mei input_leds processor_thermal_power_floor nvme_core libahci joydev processor_thermal_mbox led_class i2c_core int340x_thermal_zone iosf_mbi tpm_crb tpm_tis video tpm_tis_core wmi tpm int3400_thermal backlight acpi_thermal_rel acpi_pad acpi_tad button
Jun 30 03:00:45 unRAID kernel: CR2: 0000000200000002
Jun 30 03:00:45 unRAID kernel: ---[ end trace 0000000000000000 ]---
Jun 30 03:00:45 unRAID kernel: RIP: 0010:0x200000002
Jun 30 03:00:45 unRAID kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jun 30 03:00:45 unRAID kernel: RSP: 0018:ffffc9000086bd30 EFLAGS: 00010246
Jun 30 03:00:45 unRAID kernel: RAX: 0000000200000002 RBX: ffff8885cf7f4000 RCX: 0000000000000011
Jun 30 03:00:45 unRAID kernel: RDX: ffffffffa0fd84b8 RSI: ffffc9000086bd68 RDI: ffff888294f3f780
Jun 30 03:00:45 unRAID kernel: RBP: ffffc9000086bdcc R08: 0000000000000000 R09: ffff88810b63f548
Jun 30 03:00:45 unRAID kernel: R10: 000000000000065a R11: 0000000000000699 R12: 000000000002040e
Jun 30 03:00:45 unRAID kernel: R13: ffff888294f3f780 R14: ffff88813cdfbe00 R15: ffff88813beba080
Jun 30 03:00:45 unRAID kernel: FS:  0000000000000000(0000) GS:ffff88904f2c0000(0000) knlGS:0000000000000000
Jun 30 03:00:45 unRAID kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 30 03:00:45 unRAID kernel: CR2: 0000000200000002 CR3: 00000002c443a000 CR4: 0000000000752ef0
Jun 30 03:00:45 unRAID kernel: PKRU: 55555554
Jun 30 03:00:45 unRAID kernel: note: arc_prune[9677] exited with irqs disabled

@webdock-io
Copy link

webdock-io commented Jul 23, 2024

We've been hitting what looks to be this issue ever since we launched our new infrastructure all running on the latest Ubuntu Kernel and ZFS. Once we had migrated some hundreds of container workloads we started experiencing crashes. We've been very unfortunate in our crash dump collection, but we've ascertained that the crashes we are seeing is very similar to this, please see the LXCFS issue linked above.

Symptoms as we see them:

  1. Triggered by read-heavy workloads. The more read you have the more likely this is to happen.
  2. Upping the ARC cache to the maximum we are able to, has greatly mitigated these crashes. Seems like reading from ARC in RAM prevents this from happening to some extent. (We also spread out read-heavy workloads as much as we could to different hosts, which also seems to have helped)
  3. The issue seems to affect LXCFS, likely due to memory getting corrupted

ZFS version where we've seen this is zfs-2.2.2-0ubuntu9 to zfs-2.2.4-1 / zfs-kmod-2.2.4-1 (Zabbly)
Kernel versions we've seen this is: 6.8.0-31-generic #31-Ubuntu to 6.9.10-zabbly+ #ubuntu24.04

These workloads were all stable for years on older kernels.

This is a real issue and I would not be surprised to learn that a lot of zfs users out there are being affected by this now. It took us a long time to track down the source of our crashes, and I expect others may be in the same situation. I believe this issue warrants immediate attention, especially since upgrading to the latest mainline-ish kernel and zfs does not seem to resolve this.

@mihalicyn
Copy link

Taking into account that this crash happens in a shrinkers-related code I can make a wild guess that this issue should be provokable by something like echo 3 > /proc/sys/vm/drop_caches (don't try on production systems!).

@maxpoulin64
Copy link
Author

Taking into account that this crash happens in a shrinkers-related code I can make a wild guess that this issue should be provokable by something like echo 3 > /proc/sys/vm/drop_caches (don't try on production systems!).

I tried that and it did not crash my system or trigger the issue. AFAIK the ZFS ARC is separate.

My guess would be it's hitting a race condition of sorts on heavy IO where it has to be evicting a lot out of ARC.

@danieldietsch
Copy link

I have a similar issue with kernel 6.8.12, docker 27.0.3, and zfs 2.2.4 (zfs-2.2.4-r0-gentoo) on Gentoo. My trace does not include arc_prune but zfs_prune.

[Thu Jul 25 10:11:50 2024] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[Thu Jul 25 10:11:50 2024] BUG: unable to handle page fault for address: ffff888209535180
[Thu Jul 25 10:11:50 2024] #PF: supervisor instruction fetch in kernel mode
[Thu Jul 25 10:11:50 2024] #PF: error_code(0x0011) - permissions violation
[Thu Jul 25 10:11:50 2024] PGD 3001067 P4D 3001067 PUD 81f5f2067 PMD 40b61e063 PTE 8000000209535063
[Thu Jul 25 10:11:50 2024] Oops: 0011 [#1] SMP PTI
[Thu Jul 25 10:11:50 2024] CPU: 2 PID: 3297 Comm: arc_prune Tainted: P           O       6.8.12-gentoo #1
[Thu Jul 25 10:11:50 2024] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Z77X-D3H, BIOS F18i 01/06/2014
[Thu Jul 25 10:11:50 2024] RIP: 0010:0xffff888209535180
[Thu Jul 25 10:11:50 2024] Code: 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 85 e8 4b 8d 82 88
[Thu Jul 25 10:11:50 2024] RSP: 0018:ffffc900004ebd20 EFLAGS: 00010246
[Thu Jul 25 10:11:50 2024] RAX: ffff8881031a5c40 RBX: ffff8883be0ce000 RCX: 00000000031a5c31
[Thu Jul 25 10:11:50 2024] RDX: 0000000000000000 RSI: ffffc900004ebd68 RDI: ffff88828d4be880
[Thu Jul 25 10:11:50 2024] RBP: ffff88828d4be880 R08: 0000000000000000 R09: 0000000000000000
[Thu Jul 25 10:11:50 2024] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000206
[Thu Jul 25 10:11:50 2024] R13: ffffc900004ebdcc R14: ffff8883be0ce0f8 R15: ffff888108505200
[Thu Jul 25 10:11:50 2024] FS:  0000000000000000(0000) GS:ffff8887ff300000(0000) knlGS:0000000000000000
[Thu Jul 25 10:11:50 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jul 25 10:11:50 2024] CR2: ffff888209535180 CR3: 000000000262e001 CR4: 00000000001706f0
[Thu Jul 25 10:11:50 2024] Call Trace:
[Thu Jul 25 10:11:50 2024]  <TASK>
[Thu Jul 25 10:11:50 2024]  ? __die+0x1a/0x60
[Thu Jul 25 10:11:50 2024]  ? page_fault_oops+0x158/0x430
[Thu Jul 25 10:11:50 2024]  ? search_extable+0x22/0x30
[Thu Jul 25 10:11:50 2024]  ? search_module_extables+0x9/0x30
[Thu Jul 25 10:11:50 2024]  ? fixup_exception+0x1d/0x240
[Thu Jul 25 10:11:50 2024]  ? exc_page_fault+0x28a/0x580
[Thu Jul 25 10:11:50 2024]  ? asm_exc_page_fault+0x22/0x30
[Thu Jul 25 10:11:50 2024]  ? zfs_prune+0x9b/0x3f0 [zfs]
[Thu Jul 25 10:11:50 2024]  ? __switch_to_asm+0x3a/0x60
[Thu Jul 25 10:11:50 2024]  ? __switch_to_asm+0x34/0x60
[Thu Jul 25 10:11:50 2024]  ? zpl_prune_sb+0x2f/0x1780 [zfs]
[Thu Jul 25 10:11:50 2024]  ? arc_getbuf_func+0x26/0x340 [zfs]
[Thu Jul 25 10:11:50 2024]  ? taskq_dispatch+0x48f/0x680 [spl]
[Thu Jul 25 10:11:50 2024]  ? wake_up_state+0x10/0x10
[Thu Jul 25 10:11:50 2024]  ? taskq_dispatch+0x240/0x680 [spl]
[Thu Jul 25 10:11:50 2024]  ? kthread+0xc4/0xf0
[Thu Jul 25 10:11:50 2024]  ? kthread_complete_and_exit+0x20/0x20
[Thu Jul 25 10:11:50 2024]  ? ret_from_fork+0x28/0x40
[Thu Jul 25 10:11:50 2024]  ? kthread_complete_and_exit+0x20/0x20
[Thu Jul 25 10:11:50 2024]  ? ret_from_fork_asm+0x11/0x20
[Thu Jul 25 10:11:50 2024]  </TASK>
[Thu Jul 25 10:11:50 2024] Modules linked in: wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 br_netfilter em28xx_rc si2157 si2168 bridge stp llc xt_MASQUERADE xt_addrtype zfs(PO) spl(O) xt_LOG nf_log_syslog ip6t_REJECT nf_reject_ipv6 em28xx_alsa ip6table_filter ip6_tables drxk em28xx_dvb snd_hda_codec_hdmi snd_hda_codec_via snd_hda_codec_generic led_class snd_hda_intel adm1021 snd_intel_dspcfg snd_hda_codec em28xx x86_pkg_temp_thermal snd_hda_core i915 cdc_acm tveeprom snd_pcm atl1c mpt3sas snd_timer i2c_algo_bit raid_class drm_buddy scsi_transport_sas drm_display_helper fan ttm video evdev wmi
[Thu Jul 25 10:11:50 2024] CR2: ffff888209535180
[Thu Jul 25 10:11:50 2024] ---[ end trace 0000000000000000 ]---
[Thu Jul 25 10:11:50 2024] RIP: 0010:0xffff888209535180
[Thu Jul 25 10:11:50 2024] Code: 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 85 e8 4b 8d 82 88
[Thu Jul 25 10:11:50 2024] RSP: 0018:ffffc900004ebd20 EFLAGS: 00010246
[Thu Jul 25 10:11:50 2024] RAX: ffff8881031a5c40 RBX: ffff8883be0ce000 RCX: 00000000031a5c31
[Thu Jul 25 10:11:50 2024] RDX: 0000000000000000 RSI: ffffc900004ebd68 RDI: ffff88828d4be880
[Thu Jul 25 10:11:50 2024] RBP: ffff88828d4be880 R08: 0000000000000000 R09: 0000000000000000
[Thu Jul 25 10:11:50 2024] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000206
[Thu Jul 25 10:11:50 2024] R13: ffffc900004ebdcc R14: ffff8883be0ce0f8 R15: ffff888108505200
[Thu Jul 25 10:11:50 2024] FS:  0000000000000000(0000) GS:ffff8887ff300000(0000) knlGS:0000000000000000
[Thu Jul 25 10:11:50 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jul 25 10:11:50 2024] CR2: ffff888209535180 CR3: 000000000262e001 CR4: 00000000001706f0
[Thu Jul 25 10:11:50 2024] note: arc_prune[3297] exited with irqs disabled

@andrebrait
Copy link

@1JorgeB any chance switching storage drivers might be a reliable workaround until this gets resolved, versus using a btrfs image?

@Xynonners
Copy link

Xynonners commented Aug 1, 2024

same issue here, on 6.8.9. crashes after an AI training workload for a few hours.

will revert to 6.6.

EDIT: 6.6 is stable (am not using docker, just standard filesystem reading and writing).

@satmandu
Copy link
Contributor

satmandu commented Aug 1, 2024

FYI OpenZFS has supported docker's overlay2 storage driver since 2.2. See moby/moby#46337 (comment)

I gave up on the docker zfs storage driver some time ago, as it was pretty buggy.

If you are starting docker with systemd you can modify the startup line to be:

ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock -s overlay2 

Using overlay2 I don't have any issues loading docker run -it --rm -p 8080:8080 --gpus all --name localai quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas using a kernel newer than 6.9 using zfs built from #16359 . In fairness I'm also adding some patches from PRs on top of that so that I can be on kernel 6.11.0-rc1...

@satmandu
Copy link
Contributor

satmandu commented Aug 1, 2024

But maybe also #16401 and #16404 might help track down the issue?

@putnam
Copy link

putnam commented Aug 29, 2024

Similar situation here, this time with 6.10.4 and zfs 2.2.5-1 on Debian.

This occurred during a docker compose pull while some image layers were being extracted in tandem. I am also using the zfs driver.

I was forced to power cycle after this, which incidentally upgraded my kernel to 6.10.6, and the same pull succeeded fine afterward. But the ARC conditions would also be entirely different after a fresh boot so I doubt the kernel upgrade mattered. Just some info.

FWIW, my kernel is in lockdown mode due to secure boot.

Aug 29 04:25:08 server kernel: PGD 0 P4D 0
Aug 29 04:25:08 server kernel: Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI
Aug 29 04:25:08 server kernel: CPU: 8 PID: 1168 Comm: arc_prune Tainted: P           O       6.10.4-amd64 #1  Debian 6.10.4-1
Aug 29 04:25:08 server kernel: Hardware name: Supermicro Super Server/H12SSL-CT, BIOS 2.8 02/27/2024
Aug 29 04:25:08 server kernel: RIP: 0010:0x200000000
Aug 29 04:25:08 server kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Aug 29 04:25:08 server kernel: RSP: 0018:ffffbdfb4be4fce8 EFLAGS: 00010246
Aug 29 04:25:08 server kernel: RAX: 0000000200000000 RBX: ffffbdfb4be4fd9c RCX: 0000000000000000
Aug 29 04:25:08 server kernel: RDX: 0000000000000000 RSI: ffffbdfb4be4fd38 RDI: ffff961b87569400
Aug 29 04:25:08 server kernel: RBP: 0000000000023f7e R08: ffff96279b820000 R09: ffff9618fdb7da28
Aug 29 04:25:08 server kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff961b87569400
Aug 29 04:25:08 server kernel: R13: ffff96190575ff70 R14: ffff9618fdb7da90 R15: ffff96279b8200f8
Aug 29 04:25:08 server kernel: FS:  0000000000000000(0000) GS:ffff96378dc00000(0000) knlGS:0000000000000000
Aug 29 04:25:08 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 04:25:08 server kernel: CR2: 0000000200000000 CR3: 0000000f6e890000 CR4: 0000000000350ef0
Aug 29 04:25:08 server kernel: Call Trace:
Aug 29 04:25:08 server kernel:  <TASK>
Aug 29 04:25:08 server kernel:  ? __die+0x23/0x70
Aug 29 04:25:08 server kernel:  ? page_fault_oops+0x173/0x5a0
Aug 29 04:25:08 server kernel:  ? spl_kmem_cache_free+0x130/0x1e0 [spl]
Aug 29 04:25:08 server kernel:  ? exc_page_fault+0x7e/0x180
Aug 29 04:25:08 server kernel:  ? asm_exc_page_fault+0x26/0x30
Aug 29 04:25:08 server kernel:  ? zfs_prune+0xba/0x4e0 [zfs]
Aug 29 04:25:08 server kernel:  ? finish_task_switch.isra.0+0x97/0x2c0
Aug 29 04:25:08 server kernel:  ? srso_return_thunk+0x5/0x5f
Aug 29 04:25:08 server kernel:  ? zpl_prune_sb+0x38/0x60 [zfs]
Aug 29 04:25:08 server kernel:  ? arc_prune_task+0x22/0x40 [zfs]
Aug 29 04:25:08 server kernel:  ? taskq_thread+0x2ba/0x500 [spl]
Aug 29 04:25:08 server kernel:  ? __pfx_default_wake_function+0x10/0x10
Aug 29 04:25:08 server kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Aug 29 04:25:08 server kernel:  ? kthread+0xd2/0x100
Aug 29 04:25:08 server kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 04:25:08 server kernel:  ? ret_from_fork+0x34/0x50
Aug 29 04:25:08 server kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 04:25:08 server kernel:  ? ret_from_fork_asm+0x1a/0x30
Aug 29 04:25:08 server kernel:  </TASK>
Aug 29 04:25:08 server kernel: Modules linked in: cpuid udp_diag tcp_diag inet_diag wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip>
Aug 29 04:25:08 server kernel:  watchdog k10temp ipmi_msghandler joydev evdev sg nvme_fabrics drm efi_pstore configfs nfnetlink ip_tables x_tables autofs4 zfs(PO) spl(O) efivarfs raid10 raid0 raid>
Aug 29 04:25:08 server kernel: CR2: 0000000200000000
Aug 29 04:25:08 server kernel: ---[ end trace 0000000000000000 ]---
Aug 29 04:25:08 server kernel: RIP: 0010:0x200000000
Aug 29 04:25:08 server kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Aug 29 04:25:08 server kernel: RSP: 0018:ffffbdfb4be4fce8 EFLAGS: 00010246
Aug 29 04:25:08 server kernel: RAX: 0000000200000000 RBX: ffffbdfb4be4fd9c RCX: 0000000000000000
Aug 29 04:25:08 server kernel: RDX: 0000000000000000 RSI: ffffbdfb4be4fd38 RDI: ffff961b87569400
Aug 29 04:25:08 server kernel: RBP: 0000000000023f7e R08: ffff96279b820000 R09: ffff9618fdb7da28
Aug 29 04:25:08 server kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff961b87569400
Aug 29 04:25:08 server kernel: R13: ffff96190575ff70 R14: ffff9618fdb7da90 R15: ffff96279b8200f8
Aug 29 04:25:08 server kernel: FS:  0000000000000000(0000) GS:ffff96378dc00000(0000) knlGS:0000000000000000
Aug 29 04:25:08 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 04:25:08 server kernel: CR2: 0000000200000000 CR3: 0000000f6e890000 CR4: 0000000000350ef0
Aug 29 04:25:08 server kernel: note: arc_prune[1168] exited with irqs disabled

@maxpoulin64
Copy link
Author

Still reproducing in ZFS 2.2.6 and kernel 6.10.8-zen1-1-zen.

@mihalicyn
Copy link

I was playing with this one today and trying to reproduce it:

# uname -a
Linux hoodin 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# modinfo zfs
filename:       /lib/modules/6.8.0-45-generic/kernel/zfs/zfs.ko.zst
version:        2.2.2-0ubuntu9

I've also enabled SLUB debugging (in /etc/default/grub)

GRUB_CMDLINE_LINUX_DEFAULT="slub_debug=FZPU"

and KFENCE:

echo 1 > /sys/module/kfence/parameters/sample_interval
echo 10 > /sys/module/kfence/parameters/skip_covered_thresh

With no results, unfortunately. I also tried to limit physical memory amount from 256GiB to 128G, 64G and 32G with the same results. No crashes on docker run -it --rm quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas.

Likely, it's a tricky race condition.

@andrebrait
Copy link

@mihalicyn maybe a silly question, but did you use Docker's zfs storage driver? On OpenZFS 2.2, it defaults to overlay2 IIRC

@mihalicyn
Copy link

mihalicyn commented Oct 9, 2024

@mihalicyn maybe a silly question, but did you use Docker's zfs storage driver? On OpenZFS 2.2, it defaults to overlay2 IIRC

not a silly question at all ;-) When debugging stuff everything must be checked twice!

Yeah, I do have ZFS storage driver enabled manually:

# docker info | grep zfs
 Storage Driver: zfs
# cat /proc/self/mountinfo | grep docker
48 31 0:36 / /var/lib/docker rw,relatime shared:46 - zfs zpool-docker rw,xattr,noacl,casesensitive
# zpool status
  pool: zpool-docker
 state: ONLINE
config:

	NAME          STATE     READ WRITE CKSUM
	zpool-docker  ONLINE       0     0     0
	  sdb         ONLINE       0     0     0

errors: No known data errors

@maxpoulin64
Copy link
Author

That is very odd, it crashes every single time for me. It's a guaranteed crash whenever I run that command.

I'll see if I can make a VM that reproduces. In the meantime, I can provide any debug log asked.

@mihalicyn
Copy link

mihalicyn commented Oct 9, 2024

That is very odd, it crashes every single time for me. It's a guaranteed crash whenever I run that command.

Please can you try to enable SLAB debugging with slub_debug=FZPU kernel parameter and KFENCE (as I've described above)?
This can help to identify where the issue is. KFENCE is safe to be enabled in production environment.

I'll see if I can make a VM that reproduces. In the meantime, I can provide any debug log asked.

That would be awesome too!

@maxpoulin64
Copy link
Author

maxpoulin64 commented Oct 10, 2024

Alright, I've been working on this for a bit and I haven't been able to reproduce on my laptop, nor in a VM on said laptop. I then did a quick sanity check and yup, still crashes first time on my desktop.

So I decided to instrument the desktop, and then... nothing. It seems like slub_debug=FZPU slows the system too much, the extracting was much slower in Docker and it just never crashed. I wiped the whole Docker state and its dangling datasets, rebooted without the slub_debug=FZPU annnd it crashed on the first try. I'm also unable to reproduce with KFENCE enabled either, and I also notice with that one enabled that the extraction process is rather slow.

The trace I got this time however is different: it tried to execute an NX-protected page?

The only thing of note I can think of that might be a contributor is this is a Threadripper 1950X system, 16C/32T with 32 GB of RAM, so it's a NUMA system and relatively high core count, which leaves a lot of room for a race condition. Maybe if others in here can share their specs we can correlate some things.

[  270.176092] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[  270.176099] BUG: unable to handle page fault for address: ffff94c9c44fc400
[  270.176101] #PF: supervisor instruction fetch in kernel mode
[  270.176103] #PF: error_code(0x0011) - permissions violation
[  270.176106] PGD 5de601067 P4D 5de601067 PUD 1266f6063 PMD 172372063 PTE 80000001c44fc163
[  270.176112] Oops: Oops: 0011 [#1] PREEMPT SMP NOPTI
[  270.176116] CPU: 18 PID: 469 Comm: arc_prune Tainted: P        W  OE      6.10.10-zen1-1-zen #1 bb2e27e975e263b999d5cf1514b44f0d982487fe
[  270.176119] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.B0 08/09/2018
[  270.176122] RIP: 0010:0xffff94c9c44fc400
[  270.176155] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 80 3e 45 00 c9 94 ff ff 00 00 00 00 01 01
[  270.176158] RSP: 0018:fffface8d0517cd8 EFLAGS: 00010246
[  270.176161] RAX: ffff94c9c44fc400 RBX: ffff94c96940e000 RCX: 0000000000000000
[  270.176163] RDX: 0000000000000000 RSI: fffface8d0517d20 RDI: ffff94c95b798000
[  270.176166] RBP: fffface8d0517d84 R08: 0000000000000000 R09: 0000000000000000
[  270.176168] R10: 0000000000000012 R11: 0000000000000001 R12: 00000000000007a3
[  270.176170] R13: ffff94c95b798000 R14: ffff94c96940e0f8 R15: ffff94cc842d0628
[  270.176172] FS:  0000000000000000(0000) GS:ffff94d06f700000(0000) knlGS:0000000000000000
[  270.176174] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.176177] CR2: ffff94c9c44fc400 CR3: 000000055dc9c000 CR4: 00000000003506f0
[  270.176179] Call Trace:
[  270.176182]  <TASK>
[  270.176184]  ? __die_body.cold+0x8/0x12
[  270.176190]  ? page_fault_oops+0x15a/0x2d0
[  270.176196]  ? exc_page_fault+0x18a/0x190
[  270.176200]  ? asm_exc_page_fault+0x26/0x30
[  270.176209]  ? zfs_prune+0xa9/0x4d0 [zfs a183bb1cfd40f08b536c8d802e186e2d90463da3]
[  270.176341]  ? zpl_prune_sb+0x36/0x60 [zfs a183bb1cfd40f08b536c8d802e186e2d90463da3]
[  270.176458]  ? arc_prune_task+0x22/0x40 [zfs a183bb1cfd40f08b536c8d802e186e2d90463da3]
[  270.176589]  ? taskq_thread+0x2e3/0x5d0 [spl 242ab66a9bc0261b2d10aafc27fd8fb6601368f0]
[  270.176600]  ? finish_task_switch.isra.0+0x99/0x2e0
[  270.176606]  ? __pfx_default_wake_function+0x10/0x10
[  270.176613]  ? __pfx_taskq_thread+0x10/0x10 [spl 242ab66a9bc0261b2d10aafc27fd8fb6601368f0]
[  270.176623]  ? kthread+0xd2/0x100
[  270.176627]  ? __pfx_kthread+0x10/0x10
[  270.176631]  ? ret_from_fork+0x34/0x50
[  270.176634]  ? __pfx_kthread+0x10/0x10
[  270.176637]  ? ret_from_fork_asm+0x1a/0x30
[  270.176644]  </TASK>
[  270.176646] Modules linked in: vhost_net vhost vhost_iotlb tap tun xt_conntrack nf_conntrack_netlink xfrm_user xfrm_algo ip6table_nat ip6table_filter ip6_tables xt_addrtype overlay rfcomm snd_seq_dummy snd_hrtimer snd_seq wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bridge stp llc uhid cmac algif_hash algif_skcipher af_alg xt_MASQUERADE bnep xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic iptable_filter vfat fat intel_rapl_msr amd_atl intel_rapl_common iwlmvm snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul dm_crypt crc32_pclmul snd_hda_scodec_component snd_hda_codec_hdmi crc32c_intel btusb mac80211 polyval_clmulni snd_usb_audio btrtl snd_hda_intel polyval_generic btintel snd_intel_dspcfg gf128mul snd_usbmidi_lib ghash_clmulni_intel btbcm snd_intel_sdw_acpi snd_ump btmtk sha512_ssse3 libarc4 snd_hda_codec snd_rawmidi joydev sha256_ssse3 bluetooth snd_hda_core
[  270.176729]  snd_seq_device sha1_ssse3 iwlwifi mc snd_hwdep aesni_intel snd_pcm crypto_simd mousedev cryptd razerkbd(OE) crc16 wmi_bmof snd_timer rapl igb cbc cfg80211 snd ptp encrypted_keys pcspkr sp5100_tco mxm_wmi pps_core soundcore gpio_amdpt rfkill dca i2c_piix4 k10temp gpio_generic mac_hid kvmfr(OE) sg crypto_user loop nfnetlink ip_tables x_tables hid_logitech_hidpp hid_steam ff_memless hid_logitech_dj hid_generic trusted asn1_encoder tee dm_mod usbhid amdgpu vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd video amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper kvm_amd nvme drm_buddy drm_display_helper nvme_core xhci_pci kvm cec ccp nvme_auth xhci_pci_renesas wmi zfs(POE) spl(OE) vendor_reset(OE) nct6775 nct6775_core hwmon_vid i2c_dev
[  270.176812] CR2: ffff94c9c44fc400
[  270.176815] ---[ end trace 0000000000000000 ]---
[  270.176817] RIP: 0010:0xffff94c9c44fc400
[  270.176825] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 80 3e 45 00 c9 94 ff ff 00 00 00 00 01 01
[  270.176827] RSP: 0018:fffface8d0517cd8 EFLAGS: 00010246
[  270.176830] RAX: ffff94c9c44fc400 RBX: ffff94c96940e000 RCX: 0000000000000000
[  270.176832] RDX: 0000000000000000 RSI: fffface8d0517d20 RDI: ffff94c95b798000
[  270.176834] RBP: fffface8d0517d84 R08: 0000000000000000 R09: 0000000000000000
[  270.176836] R10: 0000000000000012 R11: 0000000000000001 R12: 00000000000007a3
[  270.176839] R13: ffff94c95b798000 R14: ffff94c96940e0f8 R15: ffff94cc842d0628
[  270.176841] FS:  0000000000000000(0000) GS:ffff94d06f700000(0000) knlGS:0000000000000000
[  270.176843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.176845] CR2: ffff94c9c44fc400 CR3: 000000055dc9c000 CR4: 00000000003506f0
[  270.176848] note: arc_prune[469] exited with irqs disabled

@maxpoulin64
Copy link
Author

maxpoulin64 commented Oct 10, 2024

Additional information that I think could help narrow it down: I think it's possibly related to the creation and destruction of datasets and snapshots. I never see it die during the extraction, I see it die at the very end of it when it commits the Docker layer, whatever it's doing.

6ddbee975253: Extracting  352.2MB/352.2MB  <-- that's reporting 100% but never completes

That smells like a use after free race condition triggered by changes in datasets and snapshots. Is there an existing stress test for that I could try? I'll see if I can write one later this weekend when I have more time.

@mihalicyn
Copy link

mihalicyn commented Oct 10, 2024

Alright, I've been working on this for a bit and I haven't been able to reproduce on my laptop, nor in a VM on said laptop. I then did a quick sanity check and yup, still crashes first time on my desktop.

Thanks for doing that!

So I decided to instrument the desktop, and then... nothing. It seems like slub_debug=FZPU slows the system too much, the extracting was much slower in Docker and it just never crashed. I wiped the whole Docker state and its dangling datasets, rebooted without the slub_debug=FZPU annnd it crashed on the first try.

yeah, it can make things way too slow. You can play with parameters slub_debug and consider another combination, for example ZP can be very useful.

I'm also unable to reproduce with KFENCE enabled either, and I also notice with that one enabled that the extraction process is rather slow.

You can try to make kfence.sample_interval parameter to a bigger value instead of 1, let say 50 or 100. It make KFENCE precision worse but may help to get a reproduction of the issue.

@arcenik
Copy link

arcenik commented Oct 14, 2024

I've observed the same error.

Context:

  • debian trixie
  • kernel 6.10.11-amd64
  • zfs 2.2.6-1
  • docker-ce 27.3.1
  • AMD Opteron(tm) X3421 APU
[Oct13 01:06] docker0: port 1(veth27f1644) entered disabled state
[  +0.000085] veth6b86ee8: renamed from eth0
[  +0.027785] docker0: port 1(veth27f1644) entered disabled state
[  +0.001856] veth27f1644 (unregistering): left allmulticast mode
[  +0.000010] veth27f1644 (unregistering): left promiscuous mode
[  +0.000025] docker0: port 1(veth27f1644) entered disabled state
[ +22.849012] BUG: unable to handle page fault for address: 0000000200000000
[  +0.000012] #PF: supervisor instruction fetch in kernel mode
[  +0.000003] #PF: error_code(0x0010) - not-present page
[  +0.000003] PGD 0 P4D 0 
[  +0.000004] Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI
[  +0.000006] CPU: 3 PID: 983 Comm: arc_prune Tainted: P           OE      6.10.11-amd64 #1  Debian 6.10.11-1
[  +0.000004] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 02/19/2020
[  +0.000003] RIP: 0010:0x200000000
[  +0.000026] Code: Unable to access opcode bytes at 0x1ffffffd6.
[  +0.000002] RSP: 0018:ffffa84e80c7fce8 EFLAGS: 00010246
[  +0.000004] RAX: 0000000200000000 RBX: ffffa84e80c7fd9c RCX: 0000000000000000
[  +0.000002] RDX: 0000000000000000 RSI: ffffa84e80c7fd38 RDI: ffff995e5c8e0100
[  +0.000003] RBP: 0000000000000f23 R08: ffff995d13300000 R09: ffff995e025c8428
[  +0.000002] R10: 0000000000000001 R11: 0000000000000000 R12: ffff995e5c8e0100
[  +0.000002] R13: ffff995e02a1f4f0 R14: ffff995e025c8490 R15: ffff995d133000f8
[  +0.000002] FS:  0000000000000000(0000) GS:ffff9960ef580000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 0000000200000000 CR3: 00000001719a6000 CR4: 00000000001506f0
[  +0.000003] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  ? __die+0x23/0x70
[  +0.000008]  ? page_fault_oops+0x173/0x5a0
[  +0.000006]  ? exc_page_fault+0x7e/0x180
[  +0.000006]  ? asm_exc_page_fault+0x26/0x30
[  +0.000010]  ? zfs_prune+0xba/0x4e0 [zfs]
[  +0.000631]  ? finish_task_switch.isra.0+0x97/0x2c0
[  +0.000012]  ? zpl_prune_sb+0x38/0x60 [zfs]
[  +0.000579]  ? arc_prune_task+0x22/0x40 [zfs]
[  +0.000519]  ? taskq_thread+0x2ba/0x500 [spl]
[  +0.000025]  ? __pfx_default_wake_function+0x10/0x10
[  +0.000008]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  +0.000015]  ? kthread+0xd2/0x100
[  +0.000004]  ? __pfx_kthread+0x10/0x10
[  +0.000004]  ? ret_from_fork+0x34/0x50
[  +0.000004]  ? __pfx_kthread+0x10/0x10
[  +0.000003]  ? ret_from_fork_asm+0x1a/0x30
[  +0.000005]  </TASK>
[  +0.000002] Modules linked in: xt_nat xt_tcpudp veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtyp>
[  +0.000091]  crct10dif_common usbcore crc32c_intel scsi_mod libphy usb_common i2c_piix4 scsi_common video wmi
[  +0.000016] CR2: 0000000200000000
[  +0.000004] ---[ end trace 0000000000000000 ]---

@TheUbuntuGuy
Copy link

The only thing of note I can think of that might be a contributor is this is a Threadripper 1950X system, 16C/32T with 32 GB of RAM, so it's a NUMA system and relatively high core count, which leaves a lot of room for a race condition. Maybe if others in here can share their specs we can correlate some things.

I've reproduced on every system I have in production, which are all "above average" core count systems. AMD Threadripper 3990X (64c/128t), Ryzen 3900X, 5900X (12c/24t). I can't try slub_debug or KFENCE in production, but I could try within KVM on the same systems.

@webdock-io
Copy link

The only thing of note I can think of that might be a contributor is this is a Threadripper 1950X system, 16C/32T with 32 GB of RAM, so it's a NUMA system and relatively high core count, which leaves a lot of room for a race condition. Maybe if others in here can share their specs we can correlate some things.

We also observed this across many high core count systems, in our case dual Xeon Platinum systems with 56C/112T

@mihalicyn
Copy link

Hi @TheUbuntuGuy,

I can't try slub_debug or KFENCE in production, but I could try within KVM on the same systems.

actually, you can enable KFENCE in production. It is designed to be enabled in production to debug issues like this one.

@TheUbuntuGuy
Copy link

I reverted the kernel on my production systems to v6.6 as they were unusable due to this bug. I meant that I couldn't test using those options due to it being a production system and I can't crash it. I will try in a VM using the same CPU layout when I get the chance, probably with some automation to try the crash over and over, since KFENCE is sampled and may not catch the issue quickly.

@snajpa
Copy link
Contributor

snajpa commented Oct 30, 2024

@TheUbuntuGuy thanks a lot, can you please try with compression=off? there seems to also be a bug in lz4 lurking that's best to rule out when we can :)

@snajpa
Copy link
Contributor

snajpa commented Oct 30, 2024

either I'm just hilariously wandering in the dark and am creating more problems on the way than I'm solving, or we're really uncovering a bug after a bug in low memory conditions...

Update: compression=off test not needed, I shot myself in the foot, forgot to zhold in zfs_mknode when not allocating the znode.

@TheOnlyMarkus
Copy link

This bug occurred to me today on my backup server which is not running any docker containers.

The backup server is based on Ubuntu 22.04.5 LTS

~# uname -a
Linux bkp1 5.15.0-124-generic #134-Ubuintu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The Intel Core i7-4770 is rather old featuring only 4C/8T and 32GB DDR3 RAM.

I installed zfs using apt install zfsutils-linux. Currently running on version 2.1.5-1ubuntu6~22.04.4 which makes me wonder if the bug has been in zfs for a while now.

dirvish is handling my backups which uses rsync with --link-dest for hardlink deduplication.

The crash happened 24 hours (+- 15 minutes according to dmesg) after I started to move very big directories with lots of hard linked files from my base pool to freshly created datasets.

How the bug occurred:
First I created the datasets for each server using:

  • zfs create backup/dirvish
  • zfs create backup/dirvish/srv1
  • zfs create backup/dirvish/srv2
    ...

Then I moved the directories to the newly created datasets:

  • mv /backup/srv1/* /backup/dirvish/srv1 &
  • mv /backup/srv2/* /backup/dirvish/srv2 &
    ...

I did this for all 11 servers. This, of course, caused a very heavy iowait of up to 70%.

Currently, there are numerous processes stalled in D state waiting for disk io and the whole server is idling with no disk activity. Beside the mv processes there are 3 kernel threads in that same stalled D state called: dbuf_evict, z_wr_int_1 and txg_sync

I've managed to dump the entire system memory to my /root directoy on ext4 using makedumpfile and analyze it using crash along with installed dbgsym's of the currently running kernel. Here are the back traces of each kernel thread:

crash> bt 879
PID: 879    TASK: ffff9bb686fd9980  CPU: 3   COMMAND: "dbuf_evict"
 #0 [ffffb31680e03ce0] __schedule at ffffffff89fd585e
 #1 [ffffb31680e03d68] schedule at ffffffff89fd5c09
 #2 [ffffb31680e03d88] schedule_preempt_disabled at ffffffff89fd606e
 #3 [ffffb31680e03d98] __mutex_lock.constprop.0 at ffffffff89fd6db7
 #4 [ffffb31680e03e10] __mutex_lock_slowpath at ffffffff89fd70d3
 #5 [ffffb31680e03e20] mutex_lock at ffffffff89fd7118
 #6 [ffffb31680e03e38] arc_buf_destroy at ffffffffc1089acd [zfs]
 #7 [ffffb31680e03e68] dbuf_destroy at ffffffffc109cbf1 [zfs]
 #8 [ffffb31680e03eb0] dbuf_evict_one at ffffffffc109d0a7 [zfs]
 #9 [ffffb31680e03ed8] dbuf_evict_thread at ffffffffc109d1b2 [zfs]
#10 [ffffb31680e03ee8] thread_generic_wrapper at ffffffffc0dd5b44 [spl]
#11 [ffffb31680e03f10] kthread at ffffffff892f13ea
#12 [ffffb31680e03f50] ret_from_fork at ffffffff89204ca2
crash> bt 1000
PID: 1000   TASK: ffff9bb68c970000  CPU: 4   COMMAND: "z_wr_int_1"
 #0 [ffffb31682007a20] __schedule at ffffffff89fd585e
 #1 [ffffb31682007aa8] schedule at ffffffff89fd5c09
 #2 [ffffb31682007ac8] schedule_preempt_disabled at ffffffff89fd606e
 #3 [ffffb31682007ad8] __mutex_lock.constprop.0 at ffffffff89fd6db7
 #4 [ffffb31682007b50] __mutex_lock_slowpath at ffffffff89fd70d3
 #5 [ffffb31682007b60] mutex_lock at ffffffff89fd7118
 #6 [ffffb31682007b78] buf_hash_find at ffffffffc1081703 [zfs]
 #7 [ffffb31682007bc0] arc_freed at ffffffffc108e635 [zfs]
 #8 [ffffb31682007be8] zio_free_sync at ffffffffc119f6b9 [zfs]
 #9 [ffffb31682007c30] zio_free at ffffffffc119f840 [zfs]
#10 [ffffb31682007c58] dsl_free at ffffffffc10fa781 [zfs]
#11 [ffffb31682007c68] dsl_dataset_block_kill at ffffffffc10d64a7 [zfs]
#12 [ffffb31682007cd0] dbuf_write_done at ffffffffc109dbb1 [zfs]
#13 [ffffb31682007d10] arc_write_done at ffffffffc1087cff [zfs]
#14 [ffffb31682007d58] zio_done at ffffffffc11a2f0d [zfs]
#15 [ffffb31682007dc8] zio_execute at ffffffffc119bb92 [zfs]
#16 [ffffb31682007e00] taskq_thread at ffffffffc0dd4fd1 [spl]
#17 [ffffb31682007f10] kthread at ffffffff892f13ea
#18 [ffffb31682007f50] ret_from_fork at ffffffff89204ca2
crash> bt 2781
PID: 2781   TASK: ffff9bb69bd7cc80  CPU: 7   COMMAND: "txg_sync"
 #0 [ffffb3168e32bb48] __schedule at ffffffff89fd585e
 #1 [ffffb3168e32bbd0] schedule at ffffffff89fd5c09
 #2 [ffffb3168e32bbf0] schedule_timeout at ffffffff89fd9fb7
 #3 [ffffb3168e32bc50] io_schedule_timeout at ffffffff89fd55e1
 #4 [ffffb3168e32bc70] __cv_timedwait_common at ffffffffc0dcd71c [spl]
 #5 [ffffb3168e32bce0] __cv_timedwait_io at ffffffffc0dcd799 [spl]
 #6 [ffffb3168e32bcf0] zio_wait at ffffffffc11a0096 [zfs]
 #7 [ffffb3168e32bd30] dsl_pool_sync at ffffffffc10ef436 [zfs]
 #8 [ffffb3168e32bdb8] spa_sync_iterate_to_convergence at ffffffffc111eb70 [zfs]
 #9 [ffffb3168e32be20] spa_sync at ffffffffc111f23c [zfs]
#10 [ffffb3168e32be58] txg_sync_thread at ffffffffc1139d66 [zfs]
#11 [ffffb3168e32bee8] thread_generic_wrapper at ffffffffc0dd5b44 [spl]
#12 [ffffb3168e32bf10] kthread at ffffffff892f13ea
#13 [ffffb3168e32bf50] ret_from_fork at ffffffff89204ca2

I will let the system running in its bugged state for a few days and I'm happy to provide any additional information if needed.

FYI: the stack trace shows some aes and crypt lines because of zfs based disk encryption.

[85931.523608] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[85931.523615] BUG: unable to handle page fault for address: ffffb316815fbce4
[85931.523616] #PF: supervisor instruction fetch in kernel mode
[85931.523618] #PF: error_code(0x0011) - permissions violation
[85931.523620] PGD 100000067 P4D 100000067 PUD 1001fd067 PMD 11a9de067 PTE 800000010511c063
[85931.523624] Oops: 0011 [#1] SMP PTI
[85931.523627] CPU: 5 PID: 996 Comm: z_rd_int_1 Tainted: P           O      5.15.0-124-generic #134-Ubuntu
[85931.523630] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
[85931.523632] RIP: 0010:0xffffb316815fbce4
[85931.523634] Code: 56 89 ff ff ff ff f0 bc 5f 81 16 b3 ff ff 00 80 9f 9b b6 9b ff ff b8 97 5e 5f b9 9b ff ff 00 00 00 00 46 31 b0 53 fe 39 b2 e0 <ce> 7e e1 be b4 b4 21 a0 89 d3 04 a4 54 64 69 19 24 60 55 ad 3a bf
[85931.523638] RSP: 0018:ffffb316815fb588 EFLAGS: 00010002
[85931.523640] RAX: 02000000a404d389 RBX: 0000000000007fe0 RCX: ffffb316815fb718
[85931.523642] RDX: 01000000a404d389 RSI: ffffb316815fbce4 RDI: 0000000000000000
[85931.523643] RBP: ffffb316815fb5e0 R08: ffffb316815fb718 R09: 0000000000000010
[85931.523645] R10: a404d389a021b4b4 R11: a021b4b4bee17ece R12: 0000000000000000
[85931.523647] R13: ffffb316815fb708 R14: ffffb316815fb690 R15: 0000000000000000
[85931.523648] FS:  0000000000000000(0000) GS:ffff9bbd7fb40000(0000) knlGS:0000000000000000
[85931.523650] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[85931.523652] CR2: ffffb316815fbce4 CR3: 0000000138810003 CR4: 00000000001706e0
[85931.523654] Call Trace:
[85931.523655]  <TASK>
[85931.523657]  ? show_trace_log_lvl+0x1d6/0x2ea
[85931.523662]  ? show_trace_log_lvl+0x1d6/0x2ea
[85931.523665]  ? gcm_init_ctx+0xb2/0x1b0 [icp]
[85931.523676]  ? show_regs.part.0+0x23/0x29
[85931.523679]  ? __die_body.cold+0x8/0xd
[85931.523681]  ? __die+0x2b/0x37
[85931.523683]  ? page_fault_oops+0x13b/0x170
[85931.523688]  ? kernelmode_fixup_or_oops.constprop.0+0x61/0x80
[85931.523691]  ? __bad_area_nosemaphore+0x155/0x190
[85931.523694]  ? bad_area_nosemaphore+0x16/0x20
[85931.523696]  ? do_kern_addr_fault+0x62/0x80
[85931.523699]  ? exc_page_fault+0xe7/0x170
[85931.523702]  ? asm_exc_page_fault+0x27/0x30
[85931.523707]  ? aes_copy_block+0xb0/0xb0 [icp]
[85931.523715]  gcm_init_ctx+0xb2/0x1b0 [icp]
[85931.523721]  ? icp_aes_impl_set+0x20/0x20 [icp]
[85931.523728]  ? icp_aes_impl_get+0x120/0x120 [icp]
[85931.523735]  aes_common_init_ctx+0x244/0x260 [icp]
[85931.523741]  aes_decrypt_atomic+0xd2/0x3f0 [icp]
[85931.523748]  crypto_decrypt+0x2eb/0x440 [icp]
[85931.523753]  ? i_mod_hash+0x21/0x50 [icp]
[85931.523759]  ? mod_hash_strkey_cmp+0xe/0x20 [icp]
[85931.523765]  ? i_mod_hash_find_nosync+0x47/0x80 [icp]
[85931.523772]  ? RW_WRITE_HELD+0x9/0x30 [icp]
[85931.523778]  ? mod_hash_find+0x5e/0x70 [icp]
[85931.523784]  ? kcf_mech_hash_find+0x2f/0x60 [icp]
[85931.523790]  zio_do_crypt_uio+0x227/0x310 [zfs]
[85931.523883]  ? spl_kmem_alloc_impl+0x80/0xd0 [spl]
[85931.523892]  zio_do_crypt_data+0x228/0x4e0 [zfs]
[85931.523971]  ? RW_WRITE_HELD+0x9/0x30 [zfs]
[85931.524028]  ? spa_keystore_lookup_key+0x81/0x120 [zfs]
[85931.524084]  spa_do_crypt_abd+0x1cd/0x320 [zfs]
[85931.524139]  zio_decrypt+0x11c/0x470 [zfs]
[85931.524218]  ? kmem_cache_free+0x272/0x290
[85931.524221]  zio_pop_transforms+0x74/0xa0 [zfs]
[85931.524299]  zio_done+0x24c/0xe60 [zfs]
[85931.524376]  ? spl_kmem_free_impl+0x29/0x40 [spl]
[85931.524382]  zio_execute+0x92/0xe0 [zfs]
[85931.524460]  taskq_thread+0x221/0x400 [spl]
[85931.524467]  ? wake_up_q+0x90/0x90
[85931.524472]  ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
[85931.524587]  ? param_set_taskq_kick+0xf0/0xf0 [spl]
[85931.524594]  kthread+0x12a/0x150
[85931.524597]  ? set_kthread_struct+0x50/0x50
[85931.524600]  ret_from_fork+0x22/0x30
[85931.524605]  </TASK>
[85931.524606] Modules linked in: cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs cpuid tls nvme_fabrics binfmt_misc intel_rapl_msr mei_hdcp intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm rapl zfs(PO) intel_cstate zunicode(PO) zzstd(O) zlua(O) zavl(PO) nls_iso8859_1 snd_hda_codec_realtek at24 snd_hda_codec_generic icp(PO) ledtrig_audio zcommon(PO) snd_hda_codec_hdmi znvpair(PO) spl(O) snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep mei_me snd_pcm snd_timer snd soundcore mei mac_hid acpi_pad dm_multipath sch_fq_codel scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_ssse3 i2c_i801 sha1_ssse3 aesni_intel raid1 mxm_wmi i915 crypto_simd cryptd i2c_smbus i2c_algo_bit ttm
[85931.524646]  drm_kms_helper syscopyarea sysfillrect atlantic sysimgblt fb_sys_fops cec rc_core lpc_ich macsec ahci drm nvme xhci_pci xhci_pci_renesas libahci nvme_core wmi video
[85931.524664] CR2: ffffb316815fbce4
[85931.524666] ---[ end trace e68085b9bcb5b3aa ]---

@mihalicyn
Copy link

mihalicyn commented Nov 12, 2024

@TheOnlyMarkus you have relatively old kernel 5.15.0-124-generic and callstack is absolutely different. IMHO, Looks not related to the problem being discussed in this issue. [I would fill a separate issue with all info you've provided.] Also, it's important if you have a fresh ZFS installation or you have your system up and running for a while and this issue started to happen recently after some upgrade?

@snajpa thanks for your work on the fix! I believe that vpsfreecz@12e235d should fix a crash in zfs_prune. At the same time this fix related to a shrinker-related change change introduced in v6.7. So I believe that we can try to split out different issues by testing your changes on v6.7 and only after that go to v6.10 (because there were a lot of changes in vfs code between 6.7 and 6.10). As far as I understand only @TheUbuntuGuy is lucky enough to have a reproducer. Mark, please can you try your reproducer on 6.7 with and without Pavel's fixes?

@TheUbuntuGuy
Copy link

There is a more detailed back and forth in vpsfreecz#1. It appears easier to reproduce with certain kernel config, but we aren't sure why. I don't want to speak for @snajpa, but it looks like it may be a memory management behavioural change in the kernel. I can test 6.7.12 separately if that is of value.

@snajpa
Copy link
Contributor

snajpa commented Nov 13, 2024

@mihalicyn yup that one should avoid the sb going away from underneath the reclaim; I also hunted down 7fd4d7f which fixes #16608 but there's this thing with seemingly arc loaned dbuf being overwritten, or, more generally, it's just some crap that's corrupting memory on low memory conditions, UAF style IMO, but still can't quite pinpoint where is that happening

@snajpa
Copy link
Contributor

snajpa commented Nov 13, 2024

arc_write/arc_write_done callback vs. dbuf_evict is where the stacks lead me but I don't see nothing so far

@arcenik
Copy link

arcenik commented Nov 15, 2024

Another stacktrace from kernel 6.11.7 / zfs 2.2.6

[  +0.000005] [   T1103] Call Trace:
[  +0.000006] [   T1103]  <TASK>
[   +0.000006] [       T1103] ? __die_body.cold (linux-6.11.7/arch/x86/kernel/dumpstack.c:478 (discriminator 1) linux-6.11.7/arch/x86/kernel/dumpstack.c:465 (discriminator 1) linux-6.11.7/arch/x86/kernel/dumpstack.c:420 (discriminator 1)) 
[   +0.000012] [       T1103] ? page_fault_oops (linux-6.11.7/arch/x86/mm/fault.c:711 (discriminator 1)) 
[   +0.000009] [       T1103] ? exc_page_fault (linux-6.11.7/./arch/x86/include/asm/paravirt.h:693 linux-6.11.7/arch/x86/mm/fault.c:1489 linux-6.11.7/arch/x86/mm/fault.c:1539) 
[   +0.000007] [       T1103] ? asm_exc_page_fault (linux-6.11.7/./arch/x86/include/asm/idtentry.h:623) 
[   +0.000011] [       T1103] ? zfs_prune (/var/lib/dkms/zfs/2.2.6/build/module/os/linux/zfs/zfs_vfsops.c:1281 (discriminator 1) /var/lib/dkms/zfs/2.2.6/build/module/os/linux/zfs/zfs_vfsops.c:1250 (discriminator 1)) zfs
[   +0.000776] [       T1103] ? finish_task_switch.isra.0 (linux-6.11.7/./arch/x86/include/asm/paravirt.h:698 linux-6.11.7/kernel/sched/sched.h:1427 linux-6.11.7/kernel/sched/core.c:4942 linux-6.11.7/kernel/sched/core.c:5060) 
[   +0.000013] [       T1103] ? zpl_prune_sb (/var/lib/dkms/zfs/2.2.6/build/module/os/linux/zfs/zpl_super.c:384) zfs
[   +0.000734] [       T1103] ? arc_prune_task (/usr/src/linux-headers-6.11.7-test/./arch/x86/include/asm/atomic64_64.h:79 (discriminator 5) /usr/src/linux-headers-6.11.7-test/./include/linux/atomic/atomic-arch-fallback.h:2913 (discriminator 5) /usr/src/linux-headers-6.11.7-test/./include/linux/atomic/atomic-arch-fallback.h:3364 (discriminator 5) /usr/src/linux-headers-6.11.7-test/./include/linux/atomic/atomic-instrumented.h:2189 (discriminator 5) /var/lib/dkms/zfs/2.2.6/build/module/zfs/arc.c:6080 (discriminator 5)) zfs
[   +0.000800] [       T1103] ? taskq_thread (/var/lib/dkms/zfs/2.2.6/build/module/os/linux/spl/spl-taskq.c:963) spl
[   +0.000038] [       T1103] ? __pfx_default_wake_function (linux-6.11.7/kernel/sched/core.c:6868) 
[   +0.000014] [       T1103] ? __pfx_taskq_thread (/var/lib/dkms/zfs/2.2.6/build/module/os/linux/spl/spl-taskq.c:870) spl
[   +0.000022] [       T1103] ? kthread (linux-6.11.7/kernel/kthread.c:389) 
[   +0.000007] [       T1103] ? __pfx_kthread (linux-6.11.7/kernel/kthread.c:342) 
[   +0.000006] [       T1103] ? ret_from_fork (linux-6.11.7/arch/x86/kernel/process.c:153) 
[   +0.000006] [       T1103] ? __pfx_kthread (linux-6.11.7/kernel/kthread.c:342) 
[   +0.000006] [       T1103] ? ret_from_fork_asm (linux-6.11.7/arch/x86/entry/entry_64.S:257) 
[  +0.000007] [   T1103]  </TASK>

The exact line from zfs_prune:

zfs_prune(struct super_block *sb, unsigned long nr_to_scan, int *objects)
{
...
#if defined(HAVE_SPLIT_SHRINKER_CALLBACK) && \
        defined(SHRINK_CONTROL_HAS_NID) && \
        defined(SHRINKER_NUMA_AWARE)
        if (shrinker->flags & SHRINKER_NUMA_AWARE) {
                *objects = 0; 
                for_each_online_node(sc.nid) {
                        *objects += (*shrinker->scan_objects)(shrinker, &sc);
                        /*
                         * reset sc.nr_to_scan, modified by
                         * scan_objects == super_cache_scan
                         */
                        sc.nr_to_scan = nr_to_scan;
                }
        } else {
                        *objects = (*shrinker->scan_objects)(shrinker, &sc); <---- crashed here
        }
...

It looks like the function pointer shrinker->scan_objects is invalid

[  +1.156876] [   T1103] RIP: 0010:0x200000002
[  +0.000048] [   T1103] Code: Unable to access opcode bytes at 0x1ffffffd8.
[  +0.000005] [   T1103] RSP: 0018:ffffab8e8084bce8 EFLAGS: 00010246
[  +0.000007] [   T1103] RAX: 0000000200000002 RBX: ffffab8e8084bd9c RCX: 0000000000000000
[  +0.000005] [   T1103] RDX: 0000000000000000 RSI: ffffab8e8084bd38 RDI: ffffa06de10e7800
[  +0.000004] [   T1103] RBP: 0000000000000257 R08: ffffa06fb423c000 R09: ffffa06d06328228
[  +0.000004] [   T1103] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa06dc950cd80
[  +0.000003] [   T1103] R13: ffffa06de10e7800 R14: ffffa06d06328290 R15: ffffa06fb423c0f8
[  +0.000004] [   T1103] FS:  0000000000000000(0000) GS:ffffa06fef500000(0000) knlGS:0000000000000000
[  +0.000005] [   T1103] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000004] [   T1103] CR2: 0000000200000002 CR3: 0000000197b14000 CR4: 00000000001506f0

@snajpa
Copy link
Contributor

snajpa commented Nov 15, 2024

yes and that is already solved by ^ my patch but it's no good until another problem is solved

@snajpa
Copy link
Contributor

snajpa commented Nov 15, 2024

I'll try KCSAN on it, but I'm exhausted and need to take few days off. We need to run current kernel versions b/c of all the development it's getting making container workloads run more smoothly, it's a lot of changes, it picked up steam somewhere at 5.7 and isn't slowing down...

@satmandu
Copy link
Contributor

I assume there are unit tests that can be set to test behavior in low-memory conditions. (And if not, should there be some?)

@snajpa
Copy link
Contributor

snajpa commented Nov 15, 2024

currently there aren't any lowmem test scenarios, but there should be, I'd be willing to set it up and look after the needed infra (as I don't think it'll make sense to do it in the cloud), but I also have to make a living somehow, so if any org making actualy money with OpenZFS would like to hire me to work on these things, I'm all in :))

even with RHEL 7 behind us, the landscape of supported kernels isn't getting any easier, the lowmem tests would have to run for all of them

while at the topic of tests, to my best knowledge, there also isn't any up to date alternative to @behlendorf's xfstests fork, that's another area could use some love

@snajpa
Copy link
Contributor

snajpa commented Nov 15, 2024

meh, KCSAN unhelpful so far, I guess I need new perspectives so a break for me it really is

@snajpa
Copy link
Contributor

snajpa commented Nov 16, 2024

so that UAF seems to be fixed in master truly :) on that PR now, thanks everyone, especially @TheUbuntuGuy

@TheUbuntuGuy
Copy link

@snajpa what is the commit/PR that fixed the UAF? Also big thanks to you for putting in a huge amount of time into this issue. I am truly grateful

@snajpa
Copy link
Contributor

snajpa commented Nov 16, 2024

@TheUbuntuGuy it seems to be a bug in 6.10 series in the end, can you verify it's ok with #16770 and 6.11?

looks to me as if those kcsan things I saw with 6.10 weren't false alerts & were actually a bug in the kernel :) it works with 6.11 (tried 6.11.6 and higher) with init_on_alloc=0 init_on_free=0 for me

@TheUbuntuGuy
Copy link

Sorry for my absence, I was moving house.

I patched 2.2.6 with just #16770 and #16788. Ran it on 6.11.9 and for the first time I ran out of disk space without it crashing! I have tried my reproducer multiple times to completion without error, so this is confirmed fixed.

@snajpa
Copy link
Contributor

snajpa commented Nov 22, 2024

@TheUbuntuGuy awesome, thank you for all your help with this :)

luckily 6.10 isn't LTS, but perhaps we could still use some mechanism to blacklist known problematic kernel versions... @robn what do you think?

@robn
Copy link
Member

robn commented Nov 24, 2024

@snajpa sorry, can you give me a summary? Maybe a link to the upstream bug?

Until then, I can guess at how I'll respond:

  • bug in old 6.10: sorry, get latest
  • bug in current/all 6.10: work around it
  • bug with us: fix it

@snajpa
Copy link
Contributor

snajpa commented Nov 27, 2024

@robn vpsfreecz#1 here we were trying to debug really weirdly manifesting problems with 6.10.x... your generic answer is good enough for me as is, works out to the fact that I have to return to 6.10 and retest. But now I have yet another fancy crash of 2.3.0-rc3+bit of rc4 +my patches to attend to :D

@snajpa
Copy link
Contributor

snajpa commented Nov 28, 2024

@TheUbuntuGuy @satmandu @maxpoulin64 @AllKind @IvanVolosyuk please excuse my screw up for which I offer a fix here #16817

There's a possibility that dirty data might not get written-back from pagecache to ZFS upper layers on time when a container shuts down...

It might be somewhat mitigated in situations that don't involve cleanup_mnt (shutdown of a mount namespace where ZFS was used as the backing FS for the apps) - so sadly data written out from Docker containers to ZFS pool with ZFS storage driver might be affected the most. If you could somehow postpone the cleanup_mnt after the actual useful app stops... maybe SIGSTOP to the supervising process could?

The mitigation would be to set vm.dirty_writeback_centisecs=1 (sysctl) and wait more than 10s with filesystem umounts after apps running on top of it are stopped.

You could also set, during the shutdowns, vm.dirty_bytes=8192 and vm.dirty_background_bytes=1, which are the minimal allowed values the kernel will let you set, that should limit the blast radius of this to bare minimum.

@satmandu
Copy link
Contributor

Is this still an issue with the docker overlay2 storage driver? I haven't used the zfs storage driver since the underlying support in OpenZFS for the overlay2 driver settled in the last few years.

@snajpa
Copy link
Contributor

snajpa commented Nov 29, 2024

@satmandu when ovl is used the container shutdown itself shouldn't be a concern, but zfs umount still might be - so if you can wait dirty_writeback_centisecs and then some after your containers are off, before you umount, that should work around the issue in full

@satmandu
Copy link
Contributor

It does look like some of those values can be smaller? These are my current settings:

sysctl vm.dirty_writeback_centisecs
vm.dirty_writeback_centisecs = 500
sysctl vm.dirty_bytes
vm.dirty_bytes = 0
sysctl vm.dirty_background_bytes
vm.dirty_background_bytes = 0

@snajpa
Copy link
Contributor

snajpa commented Nov 29, 2024

vm.dirty_bytes has a minimum of 2 pages, ie. 8192 on most systems, for the others it's ok to put 1 in there, but do note that'll tank performance, so I'd use it only when shutting down the system to reboot into a patched version

@satmandu
Copy link
Contributor

satmandu commented Dec 7, 2024

mdadm has a file in the systemd hierarchy /usr/lib/systemd/system-shutdown/mdadm.shutdown thus:

#!/bin/sh
# We need to ensure all md arrays with external metadata
# (e.g. IMSM, DDF) are clean before completing the shutdown.
/usr/sbin/mdadm --wait-clean --scan

Would it make sense to add something like this as an implementation of your suggested workaround @snajpa?

#!/bin/sh
# /usr/lib/systemd/system-shutdown/zfs.shutdown
# Mitigation for dirty data possibly not get written-back from pagecache to ZFS 
# upper layers on time when a container shuts down in situations that don't
# involve cleanup_mnt (shutdown of a mount namespace where ZFS was used
# as the backing FS for the apps).
# See https://github.com/openzfs/zfs/issues/16324#issuecomment-2506816817
sysctl -w vm.dirty_writeback_centisecs=1 && sleep 11
zfs unmount -a

I'm not sure of the timing of the zfs unmounts in relation to the rest of the system shutdown process, so not sure if that would be too late in the shutdown process to be useful.

(Also, zfs umount -f -a might be useful there? However as per the docs -f isn't supported on Linux.)

@satmandu
Copy link
Contributor

satmandu commented Dec 7, 2024

Or maybe it makes sense to add a zfs-umount.service with a Before=umount.target ?
i.e.,
zfs/etc/systemd/system/zfs-unmount.service.in:

[Unit]
Description=Unmount ZFS filesystems
DefaultDependencies=no
Conflicts=umount.target
Before=umount.target
StopWhenUnneeded=true
ConditionPathIsDirectory=/sys/module/zfs

[Service]
Type=oneshot
EnvironmentFile=-@initconfdir@/zfs
ExecStartPre=@sbindir@/sysctl -w vm.dirty_writeback_centisecs=1
ExecStart=@sbindir@/zfs unmount -a

[Install]
WantedBy=zfs.target

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests