You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Preface: This pool and its data is owned and operated by @dncinematics. Their request for help started here on reddit. I provided some suggestions via reddit and then offered to a screenshare to diagnose the situation to see if the pool could be salvaged. I'm just some random internet dude trying to help out and have no affiliation with @dncinematics.
David from @dncinematics seems like a chill dude and has been patient and attentive in trying to get his pool salvaged. He is in a tough spot and I'm sure he'd appreciate any help the community can offer.
At this stage I've exhausted my abilities. Having run diagnostics, one has the feeling something (metadata?) is corrupted but the majority of data feels like its still there just beyond reach. I understand the pool owner already tried an import with the -X switch. I'm reluctant to try this again or with zdb because of the health warnings this option comes with...
Paging @rincebrain@robn@ryao@behlendorf and anyone who might be able to provide further guidance. The assistance would be much appreciated, especially on interpreting why zdb -B fails with dump backup: dmu_send_obj: Invalid argument
I was not able to find anything useful via online research and a quick scan of the zdb code didn't uncover something obvious, so I assume this is being bubbled up from elsewhere in the codebase? [repo search]
The pool and related info
OS: FreeBSD system running FreeNAS 11.3 Pool: raidz2 - 12 * 14 TB disks - 1 disk is UNAVAIL (dead - still connected to the system) zfs version? I was not able to accurately determine this. zdb -C says version: 5000. The healthy root pool on the system reported the zpool get version was - and the root dataset zfs get version was 5 The available zdb options were limited/outdated. Where are the backups? Pool owner was migrating between backup solutions (backblaze and something else) and has a data gap.
What happened leading up to the incident?
To the best of my knowledge:
Pool is raidz2 on FreeNAS 11.3 (FreeBSD)
A business as usual I/O workload hung and the pool reported being OFFLINE
A graceful reboot was performed
Pool won't import and one disk is UNAVAIL cannot import <pool> I/O Error Destroy and recreate pool from a backup source
zpool import status: One or more devices are missing ... The pool can be imported despite missing ...
What things have been done/tried?
The pool owner tried a few things from online research including zpool import -f -FX <pool> via TrueNAS forum [link] and also tried https://www.klennet.com/zfs-recovery/ and https://www.ufsexplorer.com/ufs-explorer-raid-recovery/. The recovery tools were able to see some older pool data but the majority of the recent data was not found/recovered. Some misc recent recovered files were corrupt.
When I offered to take a look via screenshare, I realised "OH!" - BSD and, "OH!" - tooling/versions are old. So we did some basic diagnostics on the FreeNAS console and then I suggested to boot the server from SystemRescue+ZFS 10.02+2.2.2 to see if the fresher versions could help, certainly providing modern options for zdb like the -B dump dataset backup option.
The usual zpool import invocations and switches sans -X yield the same I/O Error Destroy and recreate pool from a backup source error. Variations with -o readonly=on and -N didn't help. -F variations didn't help.
zdb -B -e <pool>/<objset> is fails with error: dump backup: dmu_send_obj: Invalid argument. I was not able to find anything useful on this error via online research and a quick scan of the OpenZFS code didn't uncover something obvious. Need help to understand what could be causing this?
zdb -e -d <pool> basic dataset info:
Using zdb to get list of objects in the pools root datadata seems to output an empty list?
zdb -C revealed at least two child disks had aux_state: err_exceeded
A block leak test did reveal leaks, this was performed on the pool origin FreeBSD system:
Here is some zdb output of the pools root dataset:
Head of the zdb -C output
What is the pool state right now?
The hardware has been booted on SystemRescue+ZFS 10.02+2.2.2 and is running a block checksum pass to see if it helps uncover any clues as to where things are broken zdb -e -b -cc <pool> 2>&1 | tee blockcheck.out
This looks like it will take some days to complete. Yes, it puts the disks under read stress, but what is there to loose at this point?
Open questions
Q1: Is it worth sending the UNAVIL (and dead) drive to a data recovery firm and seeing if they can restore the data and reintroduce this to the pool to see if it brings anything?
Q2: Would trying master or the latest release of OpenZFS yield any benefits? Latest version used on the pool so far 2.2.2. For example install latest Proxmox to a USB drive and try the diagnostics again? Proxmox recovery/install console doesn't easily offer networking and/or ssh without signifcant effort.
Q3: Why is zdb -B -e <pool>/<objset> failing? Where <objset> is the root dataset of the pool.
I wonder if this issue can be resolved? If true then maybe there is a chance to salvage the pool data?
Q4: Is there anything I/we have overlooked? Any tips or tricks and further advice? Any module flags that could help? For example could zfs_send_corrupt_data help zdb -B to run?
Q5: Anyone care to speculate why this has occurred and what is stopping ZFS from recovering?
Q6: Is there any merit in trying to upgrade the zpool or would that require importing it first?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Preface: This pool and its data is owned and operated by @dncinematics. Their request for help started here on reddit. I provided some suggestions via reddit and then offered to a screenshare to diagnose the situation to see if the pool could be salvaged. I'm just some random internet dude trying to help out and have no affiliation with @dncinematics.
David from @dncinematics seems like a chill dude and has been patient and attentive in trying to get his pool salvaged. He is in a tough spot and I'm sure he'd appreciate any help the community can offer.
At this stage I've exhausted my abilities. Having run diagnostics, one has the feeling something (metadata?) is corrupted but the majority of data feels like its still there just beyond reach. I understand the pool owner already tried an import with the
-X
switch. I'm reluctant to try this again or withzdb
because of the health warnings this option comes with...Paging @rincebrain @robn @ryao @behlendorf and anyone who might be able to provide further guidance. The assistance would be much appreciated, especially on interpreting why
zdb -B
fails withdump backup: dmu_send_obj: Invalid argument
I was not able to find anything useful via online research and a quick scan of the
zdb
code didn't uncover something obvious, so I assume this is being bubbled up from elsewhere in the codebase? [repo search]The pool and related info
OS: FreeBSD system running FreeNAS 11.3
Pool: raidz2 - 12 * 14 TB disks - 1 disk is UNAVAIL (dead - still connected to the system)
zfs version? I was not able to accurately determine this.
zdb -C
saysversion: 5000
. The healthy root pool on the system reported thezpool get version
was-
and the root datasetzfs get version
was5
The availablezdb
options were limited/outdated.Where are the backups? Pool owner was migrating between backup solutions (backblaze and something else) and has a data gap.
What happened leading up to the incident?
To the best of my knowledge:
cannot import <pool> I/O Error Destroy and recreate pool from a backup source
zpool import
status: One or more devices are missing ... The pool can be imported despite missing ...What things have been done/tried?
The pool owner tried a few things from online research including
zpool import -f -FX <pool>
via TrueNAS forum [link] and also tried https://www.klennet.com/zfs-recovery/ and https://www.ufsexplorer.com/ufs-explorer-raid-recovery/. The recovery tools were able to see some older pool data but the majority of the recent data was not found/recovered. Some misc recent recovered files were corrupt.When I offered to take a look via screenshare, I realised "OH!" - BSD and, "OH!" - tooling/versions are old. So we did some basic diagnostics on the FreeNAS console and then I suggested to boot the server from SystemRescue+ZFS 10.02+2.2.2 to see if the fresher versions could help, certainly providing modern options for
zdb
like the-B
dump dataset backup option.The usual
zpool import
invocations and switches sans-X
yield the sameI/O Error Destroy and recreate pool from a backup source
error. Variations with-o readonly=on
and-N
didn't help.-F
variations didn't help.zdb -B -e <pool>/<objset>
is fails with error:dump backup: dmu_send_obj: Invalid argument
. I was not able to find anything useful on this error via online research and a quick scan of the OpenZFS code didn't uncover something obvious. Need help to understand what could be causing this?zdb -e -d <pool>
basic dataset info:Using
zdb
to get list of objects in the pools root datadata seems to output an empty list?zdb -C
revealed at least two child disks hadaux_state: err_exceeded
A block leak test did reveal leaks, this was performed on the pool origin FreeBSD system:
Here is some
zdb
output of the pools root dataset:Head of the
zdb -C
outputWhat is the pool state right now?
The hardware has been booted on SystemRescue+ZFS 10.02+2.2.2 and is running a block checksum pass to see if it helps uncover any clues as to where things are broken
zdb -e -b -cc <pool> 2>&1 | tee blockcheck.out
This looks like it will take some days to complete. Yes, it puts the disks under read stress, but what is there to loose at this point?
Open questions
Q1: Is it worth sending the UNAVIL (and dead) drive to a data recovery firm and seeing if they can restore the data and reintroduce this to the pool to see if it brings anything?
Q2: Would trying master or the latest release of OpenZFS yield any benefits? Latest version used on the pool so far 2.2.2. For example install latest Proxmox to a USB drive and try the diagnostics again? Proxmox recovery/install console doesn't easily offer networking and/or ssh without signifcant effort.
Q3: Why is
zdb -B -e <pool>/<objset>
failing? Where<objset>
is the root dataset of the pool.I wonder if this issue can be resolved? If true then maybe there is a chance to salvage the pool data?
Q4: Is there anything I/we have overlooked? Any tips or tricks and further advice? Any module flags that could help? For example could
zfs_send_corrupt_data
helpzdb -B
to run?Q5: Anyone care to speculate why this has occurred and what is stopping ZFS from recovering?
Q6: Is there any merit in trying to upgrade the zpool or would that require importing it first?
Thanks for reading!
Beta Was this translation helpful? Give feedback.
All reactions