Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPF for non-canonical address in dmu_zfetch_fini #16895

Closed
cfallin opened this issue Dec 21, 2024 · 4 comments
Closed

GPF for non-canonical address in dmu_zfetch_fini #16895

cfallin opened this issue Dec 21, 2024 · 4 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@cfallin
Copy link

cfallin commented Dec 21, 2024

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 40
Kernel Version 6.12.4 (6.12.4-100.fc40.x86_64)
Architecture x86-64
OpenZFS Version 2.2.7

Describe the problem you're observing

After a few hours to a few days of light operation (file server in home network), I see a kernel oops as shown in the logs. Subsequently, the following symptoms persist until reboot:

  • Load average is pinned at 48 (on a 24-core system);
  • sync hangs forever;
  • some memory seems to be leaked or lost permanently and stats get weird: system gets swappy, htop reports memory usage of "-4219161K/62.7G" (!);
  • the system sometimes becomes completely unresponsive and I need to power-cycle.

Describe how to reproduce the problem

I can't seem to find a reliable reproducer, but this crash does happen consistently (I'm power-cycling every few days). The machine's workload is a combination of some development work over ssh and Samba serving as a network Time Machine volume for a macOS machine to continuously back up to; some zvols for a few VMs; and very occasional accesses to ~5TiB of data that is mostly at rest. ZFS pool on a mirror of two large spinning disks, and another pool on NVMe for home directory and zvols.

The system has generally been very stable for the 4.5 years I've had it. I migrated its volumes to ZFS 6 months ago and all was well until recently -- I suspect either a Fedora kernel upgrade or ZFS upgrade, but I can't correlate exactly. I'm running latest or close-to-latest versions of both (6.12.4 and 2.2.7 respectively) now.

Sorry I don't have more to go on here -- happy to try settings or collect other info as needed. Thanks!

Include any warning/errors/backtraces from the system logs

The ultimate "Oops" is:

kernel oops log
[50806.427187] Oops: general protection fault, probably for non-canonical address 0xbfff93fcd9822a28: 0000 [#1] PREEMPT SMP NOPTI
[50806.427202] CPU: 6 UID: 0 PID: 687 Comm: dbu_evict Tainted: P S      W  OE      6.12.4-100.fc40.x86_64 #1
[50806.427213] Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[50806.427218] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS MAX (MS-7B79), BIOS H.40 11/06/2019
[50806.427223] RIP: 0010:__list_del_entry_valid_or_report+0x43/0x80
[50806.427233] Code: ce 15 8b 00 48 b8 00 01 00 00 00 00 ad de 48 39 c2 0f 84 aa 15 8b 00 48 b8 22 01 00 00 00 00 ad de 48 39 c1 0f 84 83 15 8b 00 <48> 8b 31 48 39 fe 0f 85 63 15 8b 00 48 8b 42 08 48 39 c6 0f 85 42
[50806.427240] RSP: 0018:ffffa06241c4fd30 EFLAGS: 00010287
[50806.427248] RAX: dead000000000122 RBX: ffff93fcd98229e8 RCX: bfff93fcd9822a28
[50806.427254] RDX: ffff93fcd9822a28 RSI: fffffffffffffe88 RDI: ffff93fcd9822a28
[50806.427259] RBP: ffff93fcd9822a28 R08: 000000003fffffff R09: ffffffffe39024a0
[50806.427264] R10: 00000000002b001e R11: ffff93ff9e9217c0 R12: ffff93fcd9822a08
[50806.427269] R13: ffff93f0d11ea9c0 R14: ffff93f0cc084448 R15: ffff93f0cc084428
[50806.427274] FS:  0000000000000000(0000) GS:ffff93ff9e900000(0000) knlGS:0000000000000000
[50806.427280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50806.427286] CR2: 00007fa072037698 CR3: 000000013e822000 CR4: 0000000000350ef0
[50806.427291] Call Trace:
[50806.427297]  
[50806.427302]  ? __die_body.cold+0x19/0x27
[50806.427312]  ? die_addr+0x3c/0x60
[50806.427321]  ? exc_general_protection+0x17d/0x400
[50806.427336]  ? asm_exc_general_protection+0x26/0x30
[50806.427351]  ? __list_del_entry_valid_or_report+0x43/0x80
[50806.427360]  dmu_zfetch_fini+0x75/0xf0 [zfs]
[50806.427654]  dnode_destroy+0x183/0x250 [zfs]
[50806.427910]  dnode_buf_evict_async+0x7d/0xf0 [zfs]
[50806.428159]  taskq_thread+0x2c7/0x500 [spl]
[50806.428182]  ? __pfx_default_wake_function+0x10/0x10
[50806.428194]  ? __pfx_dnode_buf_evict_async+0x10/0x10 [zfs]
[50806.428446]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[50806.428462]  kthread+0xd2/0x100
[50806.428469]  ? __pfx_kthread+0x10/0x10
[50806.428475]  ret_from_fork+0x34/0x50
[50806.428482]  ? __pfx_kthread+0x10/0x10
[50806.428487]  ret_from_fork_asm+0x1a/0x30
[50806.428501]  
[50806.428504] Modules linked in: vhost_net vhost vhost_iotlb tap xt_conntrack xt_MASQUERADE xt_mark snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core tun nf_tables ip6table_nat ip6table_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter rfkill vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr nct6775 nct6775_core hwmon_vid binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi amd_atl intel_rapl_msr snd_hda_intel intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi edac_mce_amd ses snd_hda_codec enclosure scsi_transport_sas joydev snd_hda_core kvm_amd snd_hwdep ppdev ee1004 snd_seq kvm snd_seq_device snd_pcm snd_timer r8169 wmi_bmof rapl snd pcspkr acpi_cpufreq i2c_piix4 soundcore i2c_smbus realtek zenpower(OE) parport_pc parport gpio_amdpt gpio_generic nfsd auth_rpcgss nfs_acl lockd grace nfs_localio sunrpc loop dm_multipath nfnetlink zram nouveau drm_ttm_helper ttm video gpu_sched crct10dif_pclmul i2c_algo_bit crc32_pclmul
[50806.428679]  crc32c_intel drm_gpuvm polyval_clmulni drm_exec polyval_generic mxm_wmi ghash_clmulni_intel nvme uas drm_display_helper sha512_ssse3 nvme_core usb_storage sha256_ssse3 sha1_ssse3 cec sp5100_tco zfs(POE) nvme_auth wmi spl(OE) scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables br_netfilter bridge stp llc fuse
[50806.428764] ---[ end trace 0000000000000000 ]---
[50806.428769] RIP: 0010:__list_del_entry_valid_or_report+0x43/0x80
[50806.428775] Code: ce 15 8b 00 48 b8 00 01 00 00 00 00 ad de 48 39 c2 0f 84 aa 15 8b 00 48 b8 22 01 00 00 00 00 ad de 48 39 c1 0f 84 83 15 8b 00 <48> 8b 31 48 39 fe 0f 85 63 15 8b 00 48 8b 42 08 48 39 c6 0f 85 42
[50806.428780] RSP: 0018:ffffa06241c4fd30 EFLAGS: 00010287
[50806.428785] RAX: dead000000000122 RBX: ffff93fcd98229e8 RCX: bfff93fcd9822a28
[50806.428790] RDX: ffff93fcd9822a28 RSI: fffffffffffffe88 RDI: ffff93fcd9822a28
[50806.428794] RBP: ffff93fcd9822a28 R08: 000000003fffffff R09: ffffffffe39024a0
[50806.428797] R10: 00000000002b001e R11: ffff93ff9e9217c0 R12: ffff93fcd9822a08
[50806.428801] R13: ffff93f0d11ea9c0 R14: ffff93f0cc084448 R15: ffff93f0cc084428
[50806.428806] FS:  0000000000000000(0000) GS:ffff93ff9e900000(0000) knlGS:0000000000000000
[50806.428811] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50806.428815] CR2: 00007fa072037698 CR3: 000000013e822000 CR4: 0000000000350ef0
[53791.822323] BUG: unable to handle page fault for address: 000000d8203e2062
[53791.822332] #PF: supervisor read access in kernel mode
[53791.822334] #PF: error_code(0x0000) - not-present page
[53791.822337] PGD 129025067 P4D 129025067 PUD 0 
[53791.822342] Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
[53791.822347] CPU: 16 UID: 0 PID: 684 Comm: arc_prune Tainted: P S    D W  OE      6.12.4-100.fc40.x86_64 #1
[53791.822352] Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [D]=DIE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[53791.822354] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS MAX (MS-7B79), BIOS H.40 11/06/2019
[53791.822356] RIP: 0010:arc_released+0x15/0x30 [zfs]
[53791.822505] Code: 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 31 c0 48 83 7f 10 00 74 11 48 8b 07 <48> 81 78 60 c0 ab 6e c0 0f 94 c0 0f b6 c0 e9 78 90 da cd 0f 1f 84
[53791.822507] RSP: 0018:ffffa06241b7bb10 EFLAGS: 00010206
[53791.822511] RAX: 000000d8203e2002 RBX: ffff93f39f9d0000 RCX: 0000000000000001
[53791.822513] RDX: 0000000000000000 RSI: ffff93f67a7dad60 RDI: ffff93fef8801c40
[53791.822515] RBP: 0000000000000000 R08: 0000000000000030 R09: ffff93f0fee17700
[53791.822518] R10: ffff93f0fc025108 R11: 0000000000000002 R12: 0000000000000000
[53791.822520] R13: ffff93fb64f9a000 R14: ffff93f1b939e618 R15: 00000000001c6d1a
[53791.822522] FS:  0000000000000000(0000) GS:ffff93ff9ee00000(0000) knlGS:0000000000000000
[53791.822524] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[53791.822527] CR2: 000000d8203e2062 CR3: 0000000157e5a000 CR4: 0000000000350ef0
[53791.822529] Call Trace:
[53791.822533]  
[53791.822537]  ? __die_body.cold+0x19/0x27
[53791.822543]  ? page_fault_oops+0x15a/0x2f0
[53791.822550]  ? exc_page_fault+0x7e/0x180
[53791.822554]  ? asm_exc_page_fault+0x26/0x30
[53791.822561]  ? arc_released+0x15/0x30 [zfs]
[53791.822678]  dbuf_rele_and_unlock+0x79/0x5d0 [zfs]
[53791.822799]  ? srso_return_thunk+0x5/0x5f
[53791.822804]  sa_handle_destroy+0x7e/0xd0 [zfs]
[53791.822940]  zfs_zinactive+0x92/0xf0 [zfs]
[53791.823049]  zfs_inactive+0x93/0x210 [zfs]
[53791.823153]  ? unmap_mapping_range+0x85/0x140
[53791.823159]  zpl_evict_inode+0x45/0x60 [zfs]
[53791.823259]  evict+0x118/0x2a0
[53791.823266]  prune_icache_sb+0x92/0xd0
[53791.823271]  super_cache_scan+0x152/0x1e0
[53791.823276]  zfs_prune+0x177/0x220 [zfs]
[53791.823378]  zpl_prune_sb+0x4e/0x80 [zfs]
[53791.823475]  arc_prune_task+0x22/0x40 [zfs]
[53791.823595]  taskq_thread+0x2c7/0x500 [spl]
[53791.823607]  ? __pfx_default_wake_function+0x10/0x10
[53791.823615]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[53791.823623]  kthread+0xd2/0x100
[53791.823627]  ? __pfx_kthread+0x10/0x10
[53791.823630]  ret_from_fork+0x34/0x50
[53791.823634]  ? __pfx_kthread+0x10/0x10
[53791.823637]  ret_from_fork_asm+0x1a/0x30
[53791.823644]  
[53791.823646] Modules linked in: vhost_net vhost vhost_iotlb tap xt_conntrack xt_MASQUERADE xt_mark snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core tun nf_tables ip6table_nat ip6table_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter rfkill vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr nct6775 nct6775_core hwmon_vid binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi amd_atl intel_rapl_msr snd_hda_intel intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi edac_mce_amd ses snd_hda_codec enclosure scsi_transport_sas joydev snd_hda_core kvm_amd snd_hwdep ppdev ee1004 snd_seq kvm snd_seq_device snd_pcm snd_timer r8169 wmi_bmof rapl snd pcspkr acpi_cpufreq i2c_piix4 soundcore i2c_smbus realtek zenpower(OE) parport_pc parport gpio_amdpt gpio_generic nfsd auth_rpcgss nfs_acl lockd grace nfs_localio sunrpc loop dm_multipath nfnetlink zram nouveau drm_ttm_helper ttm video gpu_sched crct10dif_pclmul i2c_algo_bit crc32_pclmul
[53791.823744]  crc32c_intel drm_gpuvm polyval_clmulni drm_exec polyval_generic mxm_wmi ghash_clmulni_intel nvme uas drm_display_helper sha512_ssse3 nvme_core usb_storage sha256_ssse3 sha1_ssse3 cec sp5100_tco zfs(POE) nvme_auth wmi spl(OE) scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables br_netfilter bridge stp llc fuse
[53791.823779] CR2: 000000d8203e2062
[53791.823783] ---[ end trace 0000000000000000 ]---
[53791.823785] RIP: 0010:__list_del_entry_valid_or_report+0x43/0x80
[53791.823789] Code: ce 15 8b 00 48 b8 00 01 00 00 00 00 ad de 48 39 c2 0f 84 aa 15 8b 00 48 b8 22 01 00 00 00 00 ad de 48 39 c1 0f 84 83 15 8b 00 <48> 8b 31 48 39 fe 0f 85 63 15 8b 00 48 8b 42 08 48 39 c6 0f 85 42
[53791.823791] RSP: 0018:ffffa06241c4fd30 EFLAGS: 00010287
[53791.823794] RAX: dead000000000122 RBX: ffff93fcd98229e8 RCX: bfff93fcd9822a28
[53791.823796] RDX: ffff93fcd9822a28 RSI: fffffffffffffe88 RDI: ffff93fcd9822a28
[53791.823798] RBP: ffff93fcd9822a28 R08: 000000003fffffff R09: ffffffffe39024a0
[53791.823800] R10: 00000000002b001e R11: ffff93ff9e9217c0 R12: ffff93fcd9822a08
[53791.823802] R13: ffff93f0d11ea9c0 R14: ffff93f0cc084448 R15: ffff93f0cc084428
[53791.823804] FS:  0000000000000000(0000) GS:ffff93ff9ee00000(0000) knlGS:0000000000000000
[53791.823807] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[53791.823809] CR2: 000000d8203e2062 CR3: 0000000157e5a000 CR4: 0000000000350ef0
[53791.823812] note: arc_prune[684] exited with irqs disabled

The full dmesg since boot is in this gist.

@cfallin cfallin added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 21, 2024
@amotin
Copy link
Member

amotin commented Dec 21, 2024

It seems to be a desktop-grade system, I guess with non-ECC RAM. Have you run good memory tests since it started to happen? WIth such a random symptoms I tend to think about some memory corruptions, either hardware or software.

@cfallin
Copy link
Author

cfallin commented Dec 21, 2024

I can run memtest86 later. However a few things lead me to believe the core hardware itself is still reliable:

  • I do development on it, including lots of compiles and a lot of fuzzing as well. I've never had any sort of miscompile or unexplainable fuzzing discovery, and these are generally CPU+memory torture tests where bitflips surface quickly.
  • (The main one) the crash stacktrace itself is deterministic. The point at which it happens is not, but it is always the same crash. Random bitflips or a bad CPU would definitely not manifest in that way: I would instead be seeing a bunch of kernel panics everywhere with different stacks (and usermode crashes/segfaults too, which I have not seen). What I see instead leads me to believe there is a race condition in the ZFS implementation.

(EDIT: I realized I didn't answer your direct question: yes, it's desktop-class hardware -- a Ryzen 9 3900X with 64GiB of non-ECC DDR4 RAM, and two NAS-grade SATA drives for one pool plus an NVMe drive for the other.)

@cfallin
Copy link
Author

cfallin commented Dec 22, 2024

I tried building a local version of ZFS with #16788 cherrypicked on top of master from yesterday (commit 1acd24696) and am running that to see if it improves anything, since it seems to be a fix for some related issues here. Unfortunately I'm still seeing oopses -- now the following:

new kernel oops
[38296.175499] Oops: general protection fault, probably for non-canonical address 0xbfff9866c47fa9a0: 0000 [#1] PREEMPT SMP NOPTI
[38296.175509] CPU: 13 UID: 0 PID: 694 Comm: dbu_evict Tainted: P S      W  OE      6.12.4-100.fc40.x86_64 #1
[38296.175514] Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[38296.175516] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS MAX (MS-7B79), BIOS H.40 11/06/2019
[38296.175519] RIP: 0010:zrl_is_locked+0x9/0x20 [zfs]
[38296.175662] Code: 9e c0 0f b6 c0 e9 37 0d 8d f9 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <8b> 47 78 83 f8 ff 0f 94 c0 0f b6 c0 e9 06 0d 8d f9 66 0f 1f 44 00
[38296.175665] RSP: 0018:ffffae3601a8fd40 EFLAGS: 00010286
[38296.175668] RAX: ffff985e58d70000 RBX: ffff986a59822ca0 RCX: ffff986ca9aa1a38
[38296.175670] RDX: 0000000000000000 RSI: ffff986a59822cd0 RDI: bfff9866c47fa928
[38296.175672] RBP: ffff985fbaf6a000 R08: 0000000000000000 R09: ffffffffe74c7ec8
[38296.175674] R10: 0000000000220018 R11: 0000000000000000 R12: 0000000000000000
[38296.175676] R13: ffff985e58d70000 R14: ffff985fbaf6a448 R15: ffff985fbaf6a428
[38296.175679] FS:  0000000000000000(0000) GS:ffff986d1ec80000(0000) knlGS:0000000000000000
[38296.175681] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38296.175683] CR2: 00007fc02b8ee4b0 CR3: 0000000158dea000 CR4: 0000000000350ef0
[38296.175686] Call Trace:
[38296.175690]  
[38296.175694]  ? __die_body.cold+0x19/0x27
[38296.175699]  ? die_addr+0x3c/0x60
[38296.175704]  ? exc_general_protection+0x17d/0x400
[38296.175711]  ? asm_exc_general_protection+0x26/0x30
[38296.175718]  ? zrl_is_locked+0x9/0x20 [zfs]
[38296.175830]  dnode_destroy+0x7a/0x250 [zfs]
[38296.175964]  dnode_buf_evict_async+0x7d/0xf0 [zfs]
[38296.176093]  taskq_thread+0x355/0x6f0 [spl]
[38296.176106]  ? __pfx_default_wake_function+0x10/0x10
[38296.176112]  ? __pfx_dnode_buf_evict_async+0x10/0x10 [zfs]
[38296.176244]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[38296.176251]  kthread+0xd2/0x100
[38296.176255]  ? __pfx_kthread+0x10/0x10
[38296.176259]  ret_from_fork+0x34/0x50
[38296.176262]  ? __pfx_kthread+0x10/0x10
[38296.176265]  ret_from_fork_asm+0x1a/0x30
[38296.176272]  
[38296.176274] Modules linked in: xt_conntrack xt_MASQUERADE xt_mark snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core tun nf_tables ip6table_nat ip6table_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter rfkill vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr nct6775 nct6775_core hwmon_vid binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg amd_atl intel_rapl_msr snd_intel_sdw_acpi intel_rapl_common snd_hda_codec edac_mce_amd snd_hda_core snd_hwdep ses kvm_amd snd_seq enclosure ppdev joydev ee1004 scsi_transport_sas snd_seq_device snd_pcm kvm snd_timer rapl wmi_bmof pcspkr snd acpi_cpufreq r8169 i2c_piix4 soundcore zenpower(OE) i2c_smbus parport_pc realtek gpio_amdpt parport gpio_generic nfsd auth_rpcgss nfs_acl lockd grace nfs_localio sunrpc loop dm_multipath nfnetlink zram nouveau drm_ttm_helper ttm video gpu_sched nvme i2c_algo_bit drm_gpuvm crct10dif_pclmul crc32_pclmul drm_exec
[38296.176370]  crc32c_intel mxm_wmi polyval_clmulni nvme_core polyval_generic drm_display_helper ghash_clmulni_intel uas sha512_ssse3 sha256_ssse3 usb_storage sha1_ssse3 cec sp5100_tco nvme_auth wmi zfs(POE) spl(OE) scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables br_netfilter bridge stp llc fuse
[38296.176408] ---[ end trace 0000000000000000 ]---
[38296.176410] RIP: 0010:zrl_is_locked+0x9/0x20 [zfs]
[38296.176523] Code: 9e c0 0f b6 c0 e9 37 0d 8d f9 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <8b> 47 78 83 f8 ff 0f 94 c0 0f b6 c0 e9 06 0d 8d f9 66 0f 1f 44 00
[38296.176526] RSP: 0018:ffffae3601a8fd40 EFLAGS: 00010286
[38296.176529] RAX: ffff985e58d70000 RBX: ffff986a59822ca0 RCX: ffff986ca9aa1a38
[38296.176531] RDX: 0000000000000000 RSI: ffff986a59822cd0 RDI: bfff9866c47fa928
[38296.176533] RBP: ffff985fbaf6a000 R08: 0000000000000000 R09: ffffffffe74c7ec8
[38296.176535] R10: 0000000000220018 R11: 0000000000000000 R12: 0000000000000000
[38296.176537] R13: ffff985e58d70000 R14: ffff985fbaf6a448 R15: ffff985fbaf6a428
[38296.176539] FS:  0000000000000000(0000) GS:ffff986d1ec80000(0000) knlGS:0000000000000000
[38296.176541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38296.176543] CR2: 00007fc02b8ee4b0 CR3: 0000000158dea000 CR4: 0000000000350ef0

The crashes are always in a stack below taskq_thread (i.e. on an async work queue) and here in buf_evict_async; together that makes me suspect a use-after-free due to the corrupted pointer and async completion.

@cfallin
Copy link
Author

cfallin commented Dec 24, 2024

I can run memtest86 later

So I did this just now and found a bunch of bitflips in bits 60 and 62 in a particular memory range of one of my DIMMs. That would explain the weird almost-but-not-quite-canonical kernel addresses (ffff8... -> bfff8...). Apologies for suspecting a race condition here; I suppose ZFS was uniquely good at finding this (perhaps by filling most of my memory when other workloads didn't?). I'm now upgrading this machine and going for ECC!

@cfallin cfallin closed this as completed Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants