Resilver with one HDD missing and then resilver again? #15926

i3v · 2024-02-23T14:12:52Z

i3v
Feb 23, 2024

Consider the following scenario: in raidz1 one disk of four suddenly becomes UNAVAIL (maybe complete HDD electronics failure, maybe just SATA cable failure - this is not known yet). Resilvering starts.

status: resilver in progress, 0.01% done
tank1
   raidz1
       resilvering
            HDD1       UNAVAIL
            HDD1r      ONLINE  (resilvering)
       HDD2       ONLINE
       HDD3       ONLINE
       HDD4       ONLINE

During this resilvering/replacing it turns out that some sectors on HDD2 are bad (not a complete disk failure).
Thus, some files now got "permanent errors" and there's a high possibility that more such files would be detected.

The HDD2 is struggling to read sometimes due to some weak sectors (zpool iostat -v tank30 60 and iostat -x 60 show reading speed drops to maybe ~1 MBps sometimes for HDD2), but relatively few sectors are actually unreadable yet.

status: resilver in progress, 50% done
tank1
   raidz1
       resilvering
            HDD1       UNAVAIL
            HDD1r      ONLINE   (resilvering)
       HDD2       DEGRADED  too many errors 
       HDD3       ONLINE
       HDD4       ONLINE

errors: Permanent errors have been detected in the following files:
  /tank1/home/user1/file1
  /tank1/home/user1/file2

However, resilvering is 50% done already. Thus, it might look like it's best to not touch anything to not make things even worse.

The goal is to recover as many files as possible. There's no rush.
It is not known what happened to HDD1 - maybe it would be enough to just power cycle the system, maybe it's completely dead, beyond repair.
It is common for HDDs with few bad sectors to develop more bad sectors fast. So, it might look like it's better to let zfs read the remaining part from the HDD2 first, so that it would not be necessary to re-read the first half of the HDD2 again
Although zfs generally claims that rebooting the system would be OK, there seem to be some situation when resilvering restarts
My actual situation is a bit different (it's even raidz2, not raidz1), I'm trying to simplify and generalize things here.

So the question is:
What would happen if user would first wait for the resilvering process to finish (and report the damaged files by just comparing checksums) and then (if any of the damaged files would turn out to be valuable) try to reconnect HDD1? Assuming HDD1 would come ONLINE (or, at least, DEGRADED) after the resilvering process would finish, would ZFS try to read the missing data that led to permanent errors /tank1/home/user1/file1 and /tank1/home/user1/file2 from it? Or would it just ignore HDD1, even if it would be completely normal with all data intact?

One more question:
If HDD1 would become AVAILABLE, and user would try to read /tank1/home/user1/file1 - would zfs make an attempt to read from HDD1, (not HDD1r)? This seem to make sense, because if only a couple of files got "permanent errors", it might make sense to read and "fix" particular files, not resilver entire disk (at least because this would put less stress on the HDD1).

Finally:
Is there any difference between zfs version 0.7.9 and the modern 2.x versions regarding these my questions?

i3v · 2024-02-25T03:14:31Z

i3v
Feb 25, 2024
Author

PS. I just described my actual situation, from a slightly different perspective in #15932.

0 replies

i3v · 2024-02-25T06:33:30Z

i3v
Feb 25, 2024
Author

@rincebrain , regarding this your answer

No, you're not losing progress on resilvering by restarting.

Would you please clarify a bit:

Am I correct that you're saying that zfs would not loose anything it already copied to the new disks anyway? (except maybe for the last few minutes). Even if "restarting resilver due to new (previously missing) disk attached"?
Am I correct, that the new resilver process would start reading from the failing disks again, just to verifying data consistency, starting from the very first transaction (instead of starting from the first known-to-be-broken file)?

Also, provided that you seem to suggest to reboot without waiting for the "first resilver" to finish... Are there any specific risks in just waiting for it to finish?

0 replies

i3v · 2024-02-25T23:05:24Z

i3v
Feb 25, 2024
Author

Few my recent findings

Here's a report about that once the disk is detached (zpool detach and, I guess, after zpool replace too, not just physically detached) from the vdev (mirror in that case), the standard attach does not make zfs try to read from it. That's a pity. There indeed seem to be no
Here's a story about loosing the data in situation much like mine (raidz2 with 3 drives failed at some point). There was an attempt to clone the contents of the drives, but anyway nothing helped.
Here's a report about almost completely successful data recovery with a 3rd party zfs-recovery tool, Klennet. In a scenario where normal pool import failed.
I found a ZFS Corrective Resilvering Functions #15917 - a feature request that perfectly aligns with my initial post in this thread (and my overall situation as well).

1 reply

rincebrain Feb 26, 2024
Collaborator

zpool detach is permanent, yes. If a disk is temporarily offline/faulted or w/e, ZFS will keep a list of what's changed since the last time it knows it succeeded in writing to that disk, and then zpool online/clear will just resilver the bits that changed, and if you think you shouldn't trust the contents, normal usage or scrubs would be useful to check that. (If it isn't sure the disk is consistent for some reason, it would just resilver the whole thing.) After zpool detach it stops tracking that entirely, and while in cases like vdev_mirror: when resilvering, try reading first #12327's goals (e.g. SMR disks), it would be sometimes beneficial to check reads before writing, but on spinning disks, that would be a significant performance killer otherwise, I believe, since you're scaling the number of IOs you're doing by at least 2x, and spinning rust is bad at total IOPs.
It depends on what's wrong, but since it was complaining about "corrupted data" on one of the leaves, that disk at least would likely have not had its life improved by imaging to a perfectly working disk, though the other two might, maybe? There's various things one could try.
I've no personal experience with that tool, but a lot of people on the internet give very bad advice about ZFS, some of which may result in clobbering perfectly recoverable data. That said, that thread doesn't really seem to have tried much recovery before jumping to Klennet, so I can't really speculate about what was wrong.
Kind of? Your situation shouldn't require any cleverness, I think, just fix whatever part of your disk stack is making the disks misbehave, zpool clear the ones marked FAULTED, and wait for it to finish. Unless I'm misunderstanding and a zpool replace will kick a disk out if it doesn't, you know, replace everything by the end, you shouldn't need anything new to make that work.

i3v · 2024-03-08T19:03:54Z

i3v
Mar 8, 2024
Author

OK, here's an update on my personal situation (just in case anyone would run into a similar situation and would be interested):

Good news:

The first "resilver" finished and I power-cycled the system as planned (a couple of days ago).
- rincebrain was right - no disk got auto-detached from the pool yet (in particular, none of sdaa, sdac, sdaq).
- sdac and sdaq became normal 7.3 T disks after power-cycling the whole system (sdaa is still somewhat missing, but that's a different story)

Pitfalls in the way of getting the complete list of the damaged files:

The zpool status -v became unexpectedly slow at some point. It looks like the more files you have in "Permanent errors", the slower it is, superlinear. Eventually, it was taking few hours to do zpool status -v > /tmp/status.txt
- /tmp is on a separate SSD, unrelated to disks used in zfs pool, even on a different HBA adapter.
The list of permanent errors in zpool status -v got erased as soon as the resilvering process finished. (Why??? Is it still stored somewhere?)
- it is good that I managed to run zpool status -v at 98.61% resilvering done (this command actually finished exporting the list only after the resilvering got finished...). There's about 4M files in the "permanent errors" list (I hope that this list is more-or-less complete, at least the size of the zpool status -v output was steadily growing throughout the resilvering).

Pitfalls in the way of a "corrective resilver":
The idea to "re-read each file from the damaged files list, having spares attached" seem to work.
I'm currently reading these files (and zfs is hopefully doing all the corrective writes automatically), thus, hopefully, minimizing the dying disk wear), but:

performance seem to be pretty low: yesterday it only did ~100k files (most files are ~16MB in size). That is, extrapolating, it would take ~40 days to read them all. The total data size there is 145TB, so 4e6*16e6/145e12 = 44% (a really rough estimate of the percentage of data to read). My reading script is single-threaded (I guess that this is safer).
sdaq was changing it's state to REMOVED about each hour or so.
- It was possible to get it back with just zpool online though.
- Interestingly, I cannot zpool offline it (zfs says "insufficient replicas"). I wonder why zfs was thinking that it's OK to change its state to REMOVED meanwhile.
- I wonder if sdaa would end up detached from the pool in a situation like this (sdaa is being just "replaced", while sdaq and sdac are being replaced with a spare, and thus should not get detached from the pool automatically).
- Most files (probably 999 of 1000) got read correctly even without sdaq, so maybe it's actually a good idea to keep it offline for now (to reduce wear).
- eventually, it became a 2 T disk again at some point.
sdac changed it's state to REMOVED as well at some point at night. (And, after that, zfs was unable to read any file from the list)
when reading the files from the list (the script is reading as root), I do get "[Errno 5] Input/output error" once in a while, but they are not logged as a permanent errors (Why???)... I still see "errors: No known data errors". It's good that I do log all the "unable to read" files myself in my script.
Even though my script was constrantly reading files from the list, the resilvering process was going pretty fast (and thus it would inevitably run into bad areas on the disks soon). Sadly, my old zfs version got no zfs_scan_suspend_progress, so I set zfs_scan_idle and zfs_resilver_delay to 5000 - this seem to have helped and effectively paused the resilver.
- However, I wonder whether this also throttles the "corrective writes" that I try to achieve... (maybe even makes zfs skip them?)

For some reason, the resilvering process kept restarting for a while:

2024-03-08.02:08:12 [txg:37041529] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:10:13 [txg:37041533] scan aborted, restarting errors=0 [on dell-storage.slb.com]
2024-03-08.02:10:17 [txg:37041533] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:12:39 [txg:37041537] scan aborted, restarting errors=5 [on dell-storage.slb.com]
2024-03-08.02:12:43 [txg:37041537] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:15:54 [txg:37041541] scan aborted, restarting errors=10 [on dell-storage.slb.com]
2024-03-08.02:15:57 [txg:37041541] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:18:04 [txg:37041544] scan aborted, restarting errors=6 [on dell-storage.slb.com]
2024-03-08.02:18:09 [txg:37041544] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:20:14 [txg:37041548] scan aborted, restarting errors=8 [on dell-storage.slb.com]
2024-03-08.02:20:18 [txg:37041548] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:22:35 [txg:37041553] scan aborted, restarting errors=11 [on dell-storage.slb.com]
2024-03-08.02:22:39 [txg:37041553] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:25:19 [txg:37041557] scan aborted, restarting errors=4 [on dell-storage.slb.com]
2024-03-08.02:25:23 [txg:37041557] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:28:16 [txg:37041562] scan aborted, restarting errors=0 [on dell-storage.slb.com]
2024-03-08.02:28:20 [txg:37041562] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]
2024-03-08.02:29:45 [txg:37041565] scan aborted, restarting errors=0 [on dell-storage.slb.com]
2024-03-08.02:29:49 [txg:37041565] scan setup func=2 mintxg=3 maxtxg=37041519 [on dell-storage.slb.com]

I've got 1579 lines like tank30/home:<0x3b6c9ff> in my "98.61% resilvering" zpool status -v output. No idea how to deal with this yet. But my simple "re-read files" script is unable to handle these for sure. Effectively the only info on this I found is this
I found no way to determine if a particular block (from a particular file) was actually copied to the "spare" disk or not (and/or) which disks (in the raidz2 vdev) got any usable data from this block. The closest info I found yet is this "vdev + offset" arithmetics. (Is there any?)
I don't like that zfs is still writing something to the dying disks. It's just a few kB/s, but why is it doing that at all???

More "dd vs live-resilver" opinions:
In addition to the posts I previously found:

Now I wish I had started from doing a ddrescue [1], [2] for all disks in raidz2-3, even though some people discourage that [3]. I'm not really sure. And I found no "official guidance" (if any at all) on this.

here's a couple more posts that I found interesting:

This post recommends to use dd over zfs built-in resilver.
This post effectively recommends "targeting just the used sectors" rather than "ddrescue"...

0 replies

i3v · 2024-03-15T12:06:25Z

i3v
Mar 15, 2024
Author

Some news again:

I'm currently reading these files (and zfs is hopefully doing all the corrective writes automatically), thus, hopefully, minimizing the dying disk wear),

My rereading script processed first ~409k files from the (4M files long) list of the damaged files with just sdac attached (e.g. just one of the 3 dying disks, to avoid wearing them all). Of that, 98% were read successfully. (And I still hope I would be able to read most of the remaining 2% if both sdac and sdaq would be online simultaneously).

The "files per hour", "read errors per file" and "read errors per hour" rates seem to be more-or-less flat (which is somewhat encouraging in that disks are not degrading too fast). The number of the unreadable sectors, reported be SMART is somewhat steadily growing, though (42k -> 49k for r_ucor for sdac during this period).

However, I wonder whether this also throttles the "corrective writes" that I try to achieve... (maybe even makes zfs skip them?)

To make sure that my "reread" strategy actually works, I reread the recovered files once again (with the dying disks offline).

I had to do echo offline >> /sys/bus/scsi/devices/0:0:28:0/state to disconnect sdac, because zpool offline was not allowing me to detach it.

With sdaa, sdac and sdaq offline, I reread all those ~400k recovered files. One of them turned out to be damaged (e.g. some IO error happened), not idea why. But all the other files were read OK. Thus, in general this approach seem to work.

Thus, after all, it looks like those "resilver throttling" does not affect the "corrective writes". Great.

sdaq was changing it's state to REMOVED about each hour or so.

I had to add "do zpool online tank30 scsi-35000c50094ba6eff if the script fails to read 20 files in a row" logic to the script, because the disk was self-disconnecting once in a while (not sure why, probably there's some threshold on a number of read errors). This happened 51 times in total.

performance seem to be pretty low: yesterday it only did ~100k files (most files are ~16MB in size). That is, extrapolating, it would take ~40 days to read them all.

For now, with a linear approximation, such a "manual resilver" would take ~60 days in my case. The delay isn't a real problem, but planning to spin those dying disks for that long is scary.

I don't like that zfs is still writing something to the dying disks. It's just a few kB/s, but why is it doing that at all???

This is probably the worst aspect of this "reread individual files" strategy. ZFS still attempts to write something to the dying disks. Not much, just ~20 kbps, but there's still scary. (Which is bad because, according to my personal experience, dying disks degrade much faster when you try to write something, compared with when you try to just read from them.) I did zfs set atime=off tank30, but this does not seem to help.

I found no way to determine if a particular block (from a particular file) was actually copied to the "spare" disk or not

I think I somewhat understood how to get the physical location of the file on disk (here's a better formatted code from this post).

The idea here is that once I have a list of the damaged files that I want to recover, it would be useful to produce a list of LBAs that I want the data recovery company to read from each of the disks I send them. Making a whole-disk sector-by-sector copy is usually infeasible, and it would be great to prioritize reading the sectors you actually need.

The caveat here is that I have those mysterious tank30/home:<0x3b6c9ff> too. I don't understand them at all and I'm afraid they could be folders, full of useful files (maybe millions of them) that I just cannot see (and thus they are not present in my 4M files long damaged files list...

0 replies

i3v · 2024-08-10T21:56:20Z

i3v
Aug 10, 2024
Author

As I've reported in the original issue, this my particular situation got fixed.

I'm still not sure if this my "Resilver with one HDD missing and then resilver again?" idea was good or bad.
As well as the "rereading script" idea. They allowed me to recover some files and access the damage without the necessity to recover the whole zfs array. But maybe at the expense of some additional damage to the already damaged disks. Thus, I cannot really recommend this to anyone.

Best of all, once you start to suspect you might have negative redundancy, you should go find some good data recovery company. The one with all the necessary hardware (like the popular PC-3000), firmware patches, software and skills.

The strategy to put the dying disk offline when trying to resilver the array seem to be a good idea (at least for this old zfs), just to minimize the damage (to allow the subsequent data recovery). Actually, zfs tries to do exactly that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resilver with one HDD missing and then resilver again? #15926

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Resilver with one HDD missing and then resilver again? #15926

i3v Feb 23, 2024

Replies: 6 comments · 1 reply

i3v Feb 25, 2024 Author

i3v Feb 25, 2024 Author

i3v Feb 25, 2024 Author

rincebrain Feb 26, 2024 Collaborator

i3v Mar 8, 2024 Author

i3v Mar 15, 2024 Author

i3v Aug 10, 2024 Author

i3v
Feb 23, 2024

Replies: 6 comments 1 reply

i3v
Feb 25, 2024
Author

i3v
Feb 25, 2024
Author

i3v
Feb 25, 2024
Author

rincebrain Feb 26, 2024
Collaborator

i3v
Mar 8, 2024
Author

i3v
Mar 15, 2024
Author

i3v
Aug 10, 2024
Author