-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPF for non-canonical address in dmu_zfetch_fini
#16895
Comments
It seems to be a desktop-grade system, I guess with non-ECC RAM. Have you run good memory tests since it started to happen? WIth such a random symptoms I tend to think about some memory corruptions, either hardware or software. |
I can run memtest86 later. However a few things lead me to believe the core hardware itself is still reliable:
(EDIT: I realized I didn't answer your direct question: yes, it's desktop-class hardware -- a Ryzen 9 3900X with 64GiB of non-ECC DDR4 RAM, and two NAS-grade SATA drives for one pool plus an NVMe drive for the other.) |
I tried building a local version of ZFS with #16788 cherrypicked on top of new kernel oops[38296.175499] Oops: general protection fault, probably for non-canonical address 0xbfff9866c47fa9a0: 0000 [#1] PREEMPT SMP NOPTI [38296.175509] CPU: 13 UID: 0 PID: 694 Comm: dbu_evict Tainted: P S W OE 6.12.4-100.fc40.x86_64 #1 [38296.175514] Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [38296.175516] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS MAX (MS-7B79), BIOS H.40 11/06/2019 [38296.175519] RIP: 0010:zrl_is_locked+0x9/0x20 [zfs] [38296.175662] Code: 9e c0 0f b6 c0 e9 37 0d 8d f9 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <8b> 47 78 83 f8 ff 0f 94 c0 0f b6 c0 e9 06 0d 8d f9 66 0f 1f 44 00 [38296.175665] RSP: 0018:ffffae3601a8fd40 EFLAGS: 00010286 [38296.175668] RAX: ffff985e58d70000 RBX: ffff986a59822ca0 RCX: ffff986ca9aa1a38 [38296.175670] RDX: 0000000000000000 RSI: ffff986a59822cd0 RDI: bfff9866c47fa928 [38296.175672] RBP: ffff985fbaf6a000 R08: 0000000000000000 R09: ffffffffe74c7ec8 [38296.175674] R10: 0000000000220018 R11: 0000000000000000 R12: 0000000000000000 [38296.175676] R13: ffff985e58d70000 R14: ffff985fbaf6a448 R15: ffff985fbaf6a428 [38296.175679] FS: 0000000000000000(0000) GS:ffff986d1ec80000(0000) knlGS:0000000000000000 [38296.175681] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [38296.175683] CR2: 00007fc02b8ee4b0 CR3: 0000000158dea000 CR4: 0000000000350ef0 [38296.175686] Call Trace: [38296.175690] [38296.175694] ? __die_body.cold+0x19/0x27 [38296.175699] ? die_addr+0x3c/0x60 [38296.175704] ? exc_general_protection+0x17d/0x400 [38296.175711] ? asm_exc_general_protection+0x26/0x30 [38296.175718] ? zrl_is_locked+0x9/0x20 [zfs] [38296.175830] dnode_destroy+0x7a/0x250 [zfs] [38296.175964] dnode_buf_evict_async+0x7d/0xf0 [zfs] [38296.176093] taskq_thread+0x355/0x6f0 [spl] [38296.176106] ? __pfx_default_wake_function+0x10/0x10 [38296.176112] ? __pfx_dnode_buf_evict_async+0x10/0x10 [zfs] [38296.176244] ? __pfx_taskq_thread+0x10/0x10 [spl] [38296.176251] kthread+0xd2/0x100 [38296.176255] ? __pfx_kthread+0x10/0x10 [38296.176259] ret_from_fork+0x34/0x50 [38296.176262] ? __pfx_kthread+0x10/0x10 [38296.176265] ret_from_fork_asm+0x1a/0x30 [38296.176272] [38296.176274] Modules linked in: xt_conntrack xt_MASQUERADE xt_mark snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core tun nf_tables ip6table_nat ip6table_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter rfkill vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr nct6775 nct6775_core hwmon_vid binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg amd_atl intel_rapl_msr snd_intel_sdw_acpi intel_rapl_common snd_hda_codec edac_mce_amd snd_hda_core snd_hwdep ses kvm_amd snd_seq enclosure ppdev joydev ee1004 scsi_transport_sas snd_seq_device snd_pcm kvm snd_timer rapl wmi_bmof pcspkr snd acpi_cpufreq r8169 i2c_piix4 soundcore zenpower(OE) i2c_smbus parport_pc realtek gpio_amdpt parport gpio_generic nfsd auth_rpcgss nfs_acl lockd grace nfs_localio sunrpc loop dm_multipath nfnetlink zram nouveau drm_ttm_helper ttm video gpu_sched nvme i2c_algo_bit drm_gpuvm crct10dif_pclmul crc32_pclmul drm_exec [38296.176370] crc32c_intel mxm_wmi polyval_clmulni nvme_core polyval_generic drm_display_helper ghash_clmulni_intel uas sha512_ssse3 sha256_ssse3 usb_storage sha1_ssse3 cec sp5100_tco nvme_auth wmi zfs(POE) spl(OE) scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables br_netfilter bridge stp llc fuse [38296.176408] ---[ end trace 0000000000000000 ]--- [38296.176410] RIP: 0010:zrl_is_locked+0x9/0x20 [zfs] [38296.176523] Code: 9e c0 0f b6 c0 e9 37 0d 8d f9 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <8b> 47 78 83 f8 ff 0f 94 c0 0f b6 c0 e9 06 0d 8d f9 66 0f 1f 44 00 [38296.176526] RSP: 0018:ffffae3601a8fd40 EFLAGS: 00010286 [38296.176529] RAX: ffff985e58d70000 RBX: ffff986a59822ca0 RCX: ffff986ca9aa1a38 [38296.176531] RDX: 0000000000000000 RSI: ffff986a59822cd0 RDI: bfff9866c47fa928 [38296.176533] RBP: ffff985fbaf6a000 R08: 0000000000000000 R09: ffffffffe74c7ec8 [38296.176535] R10: 0000000000220018 R11: 0000000000000000 R12: 0000000000000000 [38296.176537] R13: ffff985e58d70000 R14: ffff985fbaf6a448 R15: ffff985fbaf6a428 [38296.176539] FS: 0000000000000000(0000) GS:ffff986d1ec80000(0000) knlGS:0000000000000000 [38296.176541] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [38296.176543] CR2: 00007fc02b8ee4b0 CR3: 0000000158dea000 CR4: 0000000000350ef0 The crashes are always in a stack below |
So I did this just now and found a bunch of bitflips in bits 60 and 62 in a particular memory range of one of my DIMMs. That would explain the weird almost-but-not-quite-canonical kernel addresses ( |
System information
6.12.4-100.fc40.x86_64
)Describe the problem you're observing
After a few hours to a few days of light operation (file server in home network), I see a kernel oops as shown in the logs. Subsequently, the following symptoms persist until reboot:
sync
hangs forever;htop
reports memory usage of "-4219161K/62.7G
" (!);Describe how to reproduce the problem
I can't seem to find a reliable reproducer, but this crash does happen consistently (I'm power-cycling every few days). The machine's workload is a combination of some development work over ssh and Samba serving as a network Time Machine volume for a macOS machine to continuously back up to; some zvols for a few VMs; and very occasional accesses to ~5TiB of data that is mostly at rest. ZFS pool on a mirror of two large spinning disks, and another pool on NVMe for home directory and zvols.
The system has generally been very stable for the 4.5 years I've had it. I migrated its volumes to ZFS 6 months ago and all was well until recently -- I suspect either a Fedora kernel upgrade or ZFS upgrade, but I can't correlate exactly. I'm running latest or close-to-latest versions of both (6.12.4 and 2.2.7 respectively) now.
Sorry I don't have more to go on here -- happy to try settings or collect other info as needed. Thanks!
Include any warning/errors/backtraces from the system logs
The ultimate "Oops" is:
kernel oops log
The full
dmesg
since boot is in this gist.The text was updated successfully, but these errors were encountered: