Replies: 1 comment 16 replies
-
To be clear, the N_crypt drives are on top of LUKS, or ? The CKSUM errors being reported for all of a raidz is suspicious, to me, especially since all the errors in that zed excerpt you showed are in the same location across multiple disks - I could be mistaken (I don't run raidz in my home env, so I have much less recent experience with it), but I think I expect that to happen when 1) the data in a stripe fails checksum, 2) none of the parity reconstruction results in something that matches checksum, and 3) none of the disks spat up an IO error to show who to blame. Problems with this theory include:
Could you share the hardware+software configs both before and after your migration (motherboard/CPU/HBA/HBA firmware rev + distro/distro release/kernel ver should be sufficient), the models and firmware revs of the hard drives used in each of the two pools you referenced, the output of Presuming you have a copy elsewhere (and zdb doesn't error on the checksum issue, though it'd kind of moot the feature I'm thinking of if it did), would you be willing to share the ondisk representation of one of the affected files when it's intact on the same filesystem and when it's not? (I'm specifically thinking of requesting some output from |
Beta Was this translation helpful? Give feedback.
-
System information
Describe the problem you're observing
I am running a 12 disk raidz2 and I am getting checksum errors on all disks during scrubbing. This happens during my monthly scrubs but it doesn't throw an error at every scrub. Sometimes all goes well for a couple of months and then it happens again. A hardware malfunction/defect can almost be excluded since after first getting this error in early 2020 I have upgraded my systems HBA, power supply, motherboard, CPU, RAM -> switched to ECC. So with basically everything but the disks replaced it still happens. Sadly now on a second system that I was planning to use as a backup in such a case the same thing started to happen on my most recent scrub. I am also using a UPS since July 2020.
The pools on both machines were at above 90% when this happens.
I should mention that I have never had a cksum or any error during normal operation of both pools, so I can only assume this is some edge case. Has anyone else experienced something like this or has any suggestions on what to do about it?
Describe how to reproduce the problem
I have no idea how to reproduce it since it appears completely randomly but always during scrub.
Include any warning/errors/backtraces from the system logs
Output from zpool status -v
zfs list
Output from syslog since dmesg reports no errors whatsoever:
Beta Was this translation helpful? Give feedback.
All reactions