Checksum errors on all devices during scrub on 12 disk raidz2 #12235

uk3gaus · 2021-06-13T20:51:04Z

uk3gaus
Jun 13, 2021

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	10.8
Linux Kernel	5.10.0-0.bpo.3-amd64
Architecture	x86_64
ZFS Version	zfs-2.0.3-1~bpo10+1
SPL Version	2.0.3-1~bpo10+1

Describe the problem you're observing

I am running a 12 disk raidz2 and I am getting checksum errors on all disks during scrubbing. This happens during my monthly scrubs but it doesn't throw an error at every scrub. Sometimes all goes well for a couple of months and then it happens again. A hardware malfunction/defect can almost be excluded since after first getting this error in early 2020 I have upgraded my systems HBA, power supply, motherboard, CPU, RAM -> switched to ECC. So with basically everything but the disks replaced it still happens. Sadly now on a second system that I was planning to use as a backup in such a case the same thing started to happen on my most recent scrub. I am also using a UPS since July 2020.

The pools on both machines were at above 90% when this happens.
I should mention that I have never had a cksum or any error during normal operation of both pools, so I can only assume this is some edge case. Has anyone else experienced something like this or has any suggestions on what to do about it?

Describe how to reproduce the problem

I have no idea how to reproduce it since it appears completely randomly but always during scrub.

Include any warning/errors/backtraces from the system logs

Output from zpool status -v

  pool: nas02
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 35 days 13:19:10 with 2 errors on Sun Jun 13 13:43:12 2021
config:

        NAME                STATE     READ WRITE CKSUM
        nas02               ONLINE       0     0     0
          raidz2-0          ONLINE       0     0     0
            1_crypt  ONLINE              0     0     2
            2_crypt  ONLINE              0     0     2
            3_crypt  ONLINE              0     0     2
            4_crypt  ONLINE              0     0     2
            5_crypt  ONLINE              0     0     2
            6_crypt  ONLINE              0     0     2
            7_crypt  ONLINE              0     0     2
            8_crypt  ONLINE              0     0     2
            9_crypt  ONLINE              0     0     2
            10_crypt ONLINE              0     0     2
            11_crypt ONLINE              0     0     2
            12_crypt ONLINE              0     0     2

errors: Permanent errors have been detected in the following files:

        nas02/**one_file_in_snapshot**

zfs list

NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nas02                131T   120T  10.5T        -         -     2%    91%  1.00x    ONLINE  -
  raidz2             131T   120T  10.5T        -         -     2%  92.0%      -  ONLINE

Output from syslog since dmesg reports no errors whatsoever:

Jun 13 01:59:39 nas02 zed: eid=146 class=data pool='nas02' priority=4 err=52 flags=0x18088b1 bookmark=25725:1036921:0:1
Jun 13 01:59:39 nas02 zed: eid=148 class=data pool='nas02' priority=4 err=52 flags=0x18088b1 bookmark=25725:1036921:0:2
Jun 13 01:59:39 nas02 zed: eid=147 class=checksum pool='nas02' vdev=2_crypt size=12288 offset=806358073344 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:39 nas02 zed: eid=149 class=checksum pool='nas02' vdev=8_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:39 nas02 zed: eid=150 class=checksum pool='nas02' vdev=1_crypt size=12288 offset=806358073344 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:39 nas02 zed: eid=151 class=checksum pool='nas02' vdev=7_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:39 nas02 zed: eid=152 class=checksum pool='nas02' vdev=12_crypt size=12288 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:39 nas02 zed: eid=153 class=checksum pool='nas02' vdev=6_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:39 nas02 zed: eid=154 class=checksum pool='nas02' vdev=11_crypt size=12288 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:39 nas02 zed: eid=155 class=checksum pool='nas02' vdev=5_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:39 nas02 zed: eid=156 class=checksum pool='nas02' vdev=10_crypt size=12288 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:39 nas02 zed: eid=157 class=checksum pool='nas02' vdev=4_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:39 nas02 zed: eid=158 class=checksum pool='nas02' vdev=9_crypt size=12288 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:39 nas02 zed: eid=159 class=checksum pool='nas02' vdev=3_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:39 nas02 zed: eid=160 class=checksum pool='nas02' vdev=8_crypt size=12288 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:40 nas02 zed: eid=161 class=checksum pool='nas02' vdev=2_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:40 nas02 zed: eid=162 class=checksum pool='nas02' vdev=7_crypt size=12288 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:40 nas02 zed: eid=163 class=checksum pool='nas02' vdev=1_crypt size=12288 offset=806358085632 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:40 nas02 zed: eid=164 class=checksum pool='nas02' vdev=6_crypt size=16384 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:40 nas02 zed: eid=165 class=checksum pool='nas02' vdev=12_crypt size=16384 offset=806358081536 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:40 nas02 zed: eid=166 class=checksum pool='nas02' vdev=5_crypt size=16384 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:40 nas02 zed: eid=167 class=checksum pool='nas02' vdev=11_crypt size=16384 offset=806358081536 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:40 nas02 zed: eid=168 class=checksum pool='nas02' vdev=4_crypt size=16384 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:40 nas02 zed: eid=169 class=checksum pool='nas02' vdev=10_crypt size=16384 offset=806358081536 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2
Jun 13 01:59:40 nas02 zed: eid=170 class=checksum pool='nas02' vdev=3_crypt size=16384 offset=806358069248 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:1
Jun 13 01:59:40 nas02 zed: eid=171 class=checksum pool='nas02' vdev=9_crypt size=16384 offset=806358081536 priority=4 err=52 flags=0x1008b0 bookmark=25725:1036921:0:2

rincebrain · 2021-06-14T01:41:46Z

rincebrain
Jun 14, 2021
Collaborator

To be clear, the N_crypt drives are on top of LUKS, or ?

The CKSUM errors being reported for all of a raidz is suspicious, to me, especially since all the errors in that zed excerpt you showed are in the same location across multiple disks - I could be mistaken (I don't run raidz in my home env, so I have much less recent experience with it), but I think I expect that to happen when 1) the data in a stripe fails checksum, 2) none of the parity reconstruction results in something that matches checksum, and 3) none of the disks spat up an IO error to show who to blame.

Problems with this theory include:

the offset=806358073344 errors only have two entries, where I'd expect minimum 3 with a raidz2 (unless there are more prior to that segment, which the eid suggests there are)
it doesn't explain how the data got mangled such that two disks of parity can't reconstruct it

Could you share the hardware+software configs both before and after your migration (motherboard/CPU/HBA/HBA firmware rev + distro/distro release/kernel ver should be sufficient), the models and firmware revs of the hard drives used in each of the two pools you referenced, the output of zpool get all nas02, zfs get all nas02, and the same for the other pool doing this in another system? (If all the files that report errors are always in one dataset, zfs get all [that dataset] as well, please.) I don't think the odds of it being a hardware glitch, particularly across two systems, are high, but once burned, twice shy...

Presuming you have a copy elsewhere (and zdb doesn't error on the checksum issue, though it'd kind of moot the feature I'm thinking of if it did), would you be willing to share the ondisk representation of one of the affected files when it's intact on the same filesystem and when it's not? (I'm specifically thinking of requesting some output from zdb -R, because I'd like to see how the bits flipped.) If not, because all the affected data is private or whatever other reason, that's fine, it would just be potentially informative.

16 replies

rincebrain Jul 12, 2021
Collaborator

Most spinning hard drives don't do anything we'd think of as TRIM at all, some SMR drives support it because they rather care about the difference between "don't care about this any more" and "write new data here", but they will not, so far as I know, do anything like that without you telling them.

Whatever's happening seems persistent, since zdb still sees the wrong bits after scrub does.

(All this is prefaced with an "AIUI" - some may be incomplete or outright wrong if I absorbed incorrect information.)

It affects all disks because ZFS writes stuff across all disks (when it's large enough), the checksums are across all the data in each record (which, again, spans the disks). If you get an IO error when you're reading a record, great, you know who's wrong, and can recover accordingly. If you don't, and none of your reconstruction attempts get you the right checksum, you don't know who was wrong, so you report errors for all the disks. That doesn't mean all the disks necessarily did something wrong, just that it couldn't recover and didn't know who to blame.

The checksums returning bad means the data read from some of the disks is bad now (most likely). If it's not a transient error from your RAM or CPU or hard drive electronics being dodgy or something, then ZFS doesn't have to have done any writing when it encountered the checksum error for it to stay that way. (ZFS does log when it hits checksum errors, but not over existing data - ZFS effectively never overwrites things in-place.)

ZFS, during scrubs, will attempt to repair things just like it would if you encountered them in normal operation - if it ultimately
can't, like I described above, then it'll report errors that bubble up and say "your data was affected" instead of "applications are unaffected".

Scrubs require the pool be imported read/write, and will always, as far as I know, write a couple of things to disk. The only way this could result in your data getting turned from correct to mangled on-disk, I think, is if something is flipping bits, flips bits to make something look mangled, ZFS thinks it's computed a corrected version that passes checksum, and the corrected version's bits get flipped too before it goes to disk. But at that point, I'd be expecting you to see many, many more errors, at least corrected, and not to necessarily get errors if you examine the same data twice.

You seem convinced scrubs are mangling your data. Is there some reason you haven't shared for that? Because I haven't seen any reason to believe scrubbing is causing you problems, versus just finding problems that something else caused.

Even more generally, I've not seen any particular indication that this is a problem in ZFS, versus something that just might have happened to your data silently without it. It certainly could be, I'm not claiming it's not, just that there's nothing that obviously points to it being one, to me, so far.

It's curious that, unless I'm overlooking something, you're not having a bunch of corrected errors, just the uncorrected ones.

I'll keep thinking about it. Maybe something will jump out.

uk3gaus Jul 12, 2021
Author

Well you are absolutely right that something else could be causing these errors to occur. And believe me I understand that zfs might not be the problem here. It might be some package/program that I just don't know which writes random data to raw disks or causes data mangling. Those programs would need root access, as far as I understand, to gain access to the raw device which is in use by LUKS and ZFS. Then again why only tiny portions of disks, why does it cause the kind of damage so that with dual parity ZFS scrub cannot repair it. Why are there no other (noticeable) errors on the system other than the once in a couple of months mangling of data which gets noticed by doing a scrub but only causes unrecoverable errors and no correctable ones. Those are the questions that I am asking myself or have asked myself to figure out what is going on and I don't know how to check for such things as they are kind of low level (raw device access monitoring).
If such a tool exists, I could run some kind of monitoring of raw disk accesses that aren't expected (exclude LUKS and ZFS processes if possible). That way I could find out if another program is accessing disks directly (which it shouldn't) and verify if that causes errors during the next scrub.

A hardware problem is possible but after exchanging hardware and getting a separate machine seems just so unlikely. And again I would expect the error to occur at the same time on both machines, otherwise where would the correlation in hardware be? They run their scrub at the same time because of debians scrub scheduling.

From the software side, I tried to separate these to machines that have the same problem as much as possible by installing the second one from scratch and only use it for a samba share and a couple of virtual machines which aren't stored on the ZFS array. Since the second system has basically just been on and only gotten some use in the last 3 to 4 months for backups from the other ZFS host and the software on it is so far still kind of minimal, it again throws questions as to what would cause two separate systems to have the same thing running which causes those errors.

I hope I am making some kind of sense in trying to explain my reasoning as to why I suspect that maybe ZFS might be involved in some way. If you have more questions about the hardware or software that might point you or me in some kind of direction, ask away.

rincebrain Jul 12, 2021
Collaborator

Sure, it certainly could be, I'm just saying I've seen nothing obviously pointing to something in ZFS (or, for that matter, anything else...), to me. Perhaps someone else will come along and go "oh I see, it's [something]".

konstantingebert Aug 3, 2021

Hello uk3gaus,

with great interest I've read this whole thread about your issue here. I'm not very deep into the code of ZFS but I've been using and working with system with ZFS since 2007.

Just a quick question here, as I've not seen any indication of an action like this performed yet:

If you experience the checksum errors, what happens if you zpool clear poolname and then run the scrub manually again? Will the error be rediscovered and will it be in the same blocks?

The background of this question is simple: Is the error persistent on all disks and in the on-disk-representation of all of them - or is the error introduced temporarily during the scrub while the data is inflight? Another thought came to my mind with regards to what rincebrain mentioned regarding the steps during scrub (if too many or all devices return mangled data (or a checksum calculation error happens) zfs won't be able to distinguish good from bad data and will have to report all disks as having the respective error) - taking this into account if you have a random error in the checksum calculation you will see a random error in the data.

do more frequent scrubs (eg. daily) increase the error rate? => could indicate the error is more related to the process reading from disk, the decryption process of LUKS, or the checksum verification in ZFS; and less related to bit rot or something other time-related

I see this topic is more than 20 days old, but I do expect it to be still open.

uk3gaus Aug 3, 2021
Author

Hi,

regarding your first point: when the checksum errors first occured at the beginning of 2020 I did exactly that. I cleared the pool errors and reran the scrub. The same errors popped up again. My only option was to restore with additional par2 data or in later instances by restoring from backup. After repairing the file (when restoring from non ZFS based storage or with par2) I had to purge all snapshots containing the file for the scrub to come back clean.

I cannot realistically do more frequent scrubs at least not at the rate you are suggesting. Those are 12 disk wide raidz2 arrays consisting of 10/12 TB disks so a scrub takes between 24 - 40 hours depending on the workload while the scrub is running. And to be honest since the likelyhood of more frequent scrubs mangling more data I really don't want to do that. I already disabled automatic scrubs since so far the problem only occured during scrubs since taking care of restoring single files is a pain and I will most likey lose snapshots again.

Just to reiterate: I have the same symptom on two hardware independent systems of which one system was completely rebuilt component by component after the error first occured: new mainboard, cpu, ecc memory, hba (+added fan) and power supply. Two two disk ZFS mirror arrays running in one server also scrubbed at the same time as the big array in the same server. And while the big array had errors on all disks, the two mirrors were completely clean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checksum errors on all devices during scrub on 12 disk raidz2 #12235

{{title}}

Replies: 1 comment 16 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Checksum errors on all devices during scrub on 12 disk raidz2 #12235

uk3gaus Jun 13, 2021

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Output from zpool status -v

zfs list

Output from syslog since dmesg reports no errors whatsoever:

Replies: 1 comment · 16 replies

rincebrain Jun 14, 2021 Collaborator

rincebrain Jul 12, 2021 Collaborator

uk3gaus Jul 12, 2021 Author

rincebrain Jul 12, 2021 Collaborator

konstantingebert Aug 3, 2021

uk3gaus Aug 3, 2021 Author

uk3gaus
Jun 13, 2021

Replies: 1 comment 16 replies

rincebrain
Jun 14, 2021
Collaborator

rincebrain Jul 12, 2021
Collaborator

uk3gaus Jul 12, 2021
Author

rincebrain Jul 12, 2021
Collaborator

uk3gaus Aug 3, 2021
Author