Replies: 6 comments 1 reply
-
PS. I just described my actual situation, from a slightly different perspective in #15932. |
Beta Was this translation helpful? Give feedback.
-
@rincebrain , regarding this your answer
Would you please clarify a bit:
Also, provided that you seem to suggest to reboot without waiting for the "first resilver" to finish... Are there any specific risks in just waiting for it to finish? |
Beta Was this translation helpful? Give feedback.
-
Few my recent findings
|
Beta Was this translation helpful? Give feedback.
-
OK, here's an update on my personal situation (just in case anyone would run into a similar situation and would be interested): Good news:
Pitfalls in the way of getting the complete list of the damaged files:
Pitfalls in the way of a "corrective resilver":
More "dd vs live-resilver" opinions:
here's a couple more posts that I found interesting: |
Beta Was this translation helpful? Give feedback.
-
Some news again:
My rereading script processed first ~409k files from the (4M files long) list of the damaged files with just The "files per hour", "read errors per file" and "read errors per hour" rates seem to be more-or-less flat (which is somewhat encouraging in that disks are not degrading too fast). The number of the unreadable sectors, reported be SMART is somewhat steadily growing, though (42k -> 49k for
To make sure that my "reread" strategy actually works, I reread the recovered files once again (with the dying disks offline). I had to do With Thus, after all, it looks like those "resilver throttling" does not affect the "corrective writes". Great.
I had to add "do
For now, with a linear approximation, such a "manual resilver" would take ~60 days in my case. The delay isn't a real problem, but planning to spin those dying disks for that long is scary.
This is probably the worst aspect of this "reread individual files" strategy. ZFS still attempts to write something to the dying disks. Not much, just ~20 kbps, but there's still scary. (Which is bad because, according to my personal experience, dying disks degrade much faster when you try to write something, compared with when you try to just read from them.) I did
I think I somewhat understood how to get the physical location of the file on disk (here's a better formatted code from this post). The idea here is that once I have a list of the damaged files that I want to recover, it would be useful to produce a list of LBAs that I want the data recovery company to read from each of the disks I send them. Making a whole-disk sector-by-sector copy is usually infeasible, and it would be great to prioritize reading the sectors you actually need. The caveat here is that I have those mysterious |
Beta Was this translation helpful? Give feedback.
-
As I've reported in the original issue, this my particular situation got fixed. I'm still not sure if this my "Resilver with one HDD missing and then resilver again?" idea was good or bad. Best of all, once you start to suspect you might have negative redundancy, you should go find some good data recovery company. The one with all the necessary hardware (like the popular PC-3000), firmware patches, software and skills. The strategy to put the dying disk offline when trying to resilver the array seem to be a good idea (at least for this old zfs), just to minimize the damage (to allow the subsequent data recovery). Actually, zfs tries to do exactly that. |
Beta Was this translation helpful? Give feedback.
-
Consider the following scenario: in raidz1 one disk of four suddenly becomes UNAVAIL (maybe complete HDD electronics failure, maybe just SATA cable failure - this is not known yet). Resilvering starts.
During this resilvering/replacing it turns out that some sectors on
HDD2
are bad (not a complete disk failure).Thus, some files now got "permanent errors" and there's a high possibility that more such files would be detected.
zpool iostat -v tank30 60
andiostat -x 60
show reading speed drops to maybe ~1 MBps sometimes for HDD2), but relatively few sectors are actually unreadable yet.However, resilvering is 50% done already. Thus, it might look like it's best to not touch anything to not make things even worse.
So the question is:
What would happen if user would first wait for the resilvering process to finish (and report the damaged files by just comparing checksums) and then (if any of the damaged files would turn out to be valuable) try to reconnect HDD1? Assuming HDD1 would come ONLINE (or, at least, DEGRADED) after the resilvering process would finish, would ZFS try to read the missing data that led to permanent errors
/tank1/home/user1/file1
and/tank1/home/user1/file2
from it? Or would it just ignore HDD1, even if it would be completely normal with all data intact?One more question:
If HDD1 would become AVAILABLE, and user would try to read
/tank1/home/user1/file1
- would zfs make an attempt to read from HDD1, (not HDD1r)? This seem to make sense, because if only a couple of files got "permanent errors", it might make sense to read and "fix" particular files, not resilver entire disk (at least because this would put less stress on the HDD1).Finally:
Is there any difference between zfs version 0.7.9 and the modern 2.x versions regarding these my questions?
Beta Was this translation helpful? Give feedback.
All reactions