Can pool data be salvaged when faced with: cannot import <pool> I/O Error Destroy and recreate pool from a backup source? #16724

kyle0r · 2024-11-05T22:10:25Z

kyle0r
Nov 5, 2024

Preface: This pool and its data is owned and operated by @dncinematics. Their request for help started here on reddit. I provided some suggestions via reddit and then offered to a screenshare to diagnose the situation to see if the pool could be salvaged. I'm just some random internet dude trying to help out and have no affiliation with @dncinematics.

David from @dncinematics seems like a chill dude and has been patient and attentive in trying to get his pool salvaged. He is in a tough spot and I'm sure he'd appreciate any help the community can offer.

At this stage I've exhausted my abilities. Having run diagnostics, one has the feeling something (metadata?) is corrupted but the majority of data feels like its still there just beyond reach. I understand the pool owner already tried an import with the -X switch. I'm reluctant to try this again or with zdb because of the health warnings this option comes with...

Paging @rincebrain @robn @ryao @behlendorf and anyone who might be able to provide further guidance. The assistance would be much appreciated, especially on interpreting why zdb -B fails with dump backup: dmu_send_obj: Invalid argument
I was not able to find anything useful via online research and a quick scan of the zdb code didn't uncover something obvious, so I assume this is being bubbled up from elsewhere in the codebase? [repo search]

The pool and related info

OS: FreeBSD system running FreeNAS 11.3
Pool: raidz2 - 12 * 14 TB disks - 1 disk is UNAVAIL (dead - still connected to the system)
zfs version? I was not able to accurately determine this. zdb -C says version: 5000. The healthy root pool on the system reported the zpool get version was - and the root dataset zfs get version was 5 The available zdb options were limited/outdated.
Where are the backups? Pool owner was migrating between backup solutions (backblaze and something else) and has a data gap.

What happened leading up to the incident?

To the best of my knowledge:

Pool is raidz2 on FreeNAS 11.3 (FreeBSD)
A business as usual I/O workload hung and the pool reported being OFFLINE
A graceful reboot was performed
Pool won't import and one disk is UNAVAIL
cannot import <pool> I/O Error Destroy and recreate pool from a backup source
zpool import status: One or more devices are missing ... The pool can be imported despite missing ...

What things have been done/tried?

The pool owner tried a few things from online research including zpool import -f -FX <pool> via TrueNAS forum [link] and also tried https://www.klennet.com/zfs-recovery/ and https://www.ufsexplorer.com/ufs-explorer-raid-recovery/. The recovery tools were able to see some older pool data but the majority of the recent data was not found/recovered. Some misc recent recovered files were corrupt.

When I offered to take a look via screenshare, I realised "OH!" - BSD and, "OH!" - tooling/versions are old. So we did some basic diagnostics on the FreeNAS console and then I suggested to boot the server from SystemRescue+ZFS 10.02+2.2.2 to see if the fresher versions could help, certainly providing modern options for zdb like the -B dump dataset backup option.

The usual zpool import invocations and switches sans -X yield the same I/O Error Destroy and recreate pool from a backup source error. Variations with -o readonly=on and -N didn't help. -F variations didn't help.

zdb -B -e <pool>/<objset> is fails with error: dump backup: dmu_send_obj: Invalid argument. I was not able to find anything useful on this error via online research and a quick scan of the OpenZFS code didn't uncover something obvious. Need help to understand what could be causing this?

zdb -e -d <pool> basic dataset info:

Using zdb to get list of objects in the pools root datadata seems to output an empty list?

zdb -C revealed at least two child disks had aux_state: err_exceeded

A block leak test did reveal leaks, this was performed on the pool origin FreeBSD system:

...
block traversal size 23976954519552 != alloc 151514178465792 (leaked 127537223946240)
traversal size 23976954519552 != alloc 151514178465792 (leaked 127537223946240)

        bp count:       143659041
        ganged count:           0
        bp logical:    18658255397888      avg: 129878
        bp physical:   18256776702976      avg: 127084     compression:   1.02
        bp allocated:  23976954519552      avg: 166901

Here is some zdb output of the pools root dataset:

Head of the zdb -C output

What is the pool state right now?

The hardware has been booted on SystemRescue+ZFS 10.02+2.2.2 and is running a block checksum pass to see if it helps uncover any clues as to where things are broken zdb -e -b -cc <pool> 2>&1 | tee blockcheck.out

This looks like it will take some days to complete. Yes, it puts the disks under read stress, but what is there to loose at this point?

Open questions

Q1: Is it worth sending the UNAVIL (and dead) drive to a data recovery firm and seeing if they can restore the data and reintroduce this to the pool to see if it brings anything?

Q2: Would trying master or the latest release of OpenZFS yield any benefits? Latest version used on the pool so far 2.2.2. For example install latest Proxmox to a USB drive and try the diagnostics again? Proxmox recovery/install console doesn't easily offer networking and/or ssh without signifcant effort.

Q3: Why is zdb -B -e <pool>/<objset> failing? Where <objset> is the root dataset of the pool.

I wonder if this issue can be resolved? If true then maybe there is a chance to salvage the pool data?

Q4: Is there anything I/we have overlooked? Any tips or tricks and further advice? Any module flags that could help? For example could zfs_send_corrupt_data help zdb -B to run?

Q5: Anyone care to speculate why this has occurred and what is stopping ZFS from recovering?

Q6: Is there any merit in trying to upgrade the zpool or would that require importing it first?

Thanks for reading!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can pool data be salvaged when faced with: cannot import <pool> I/O Error Destroy and recreate pool from a backup source? #16724

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Can pool data be salvaged when faced with: cannot import <pool> I/O Error Destroy and recreate pool from a backup source? #16724

kyle0r Nov 5, 2024

The pool and related info

What happened leading up to the incident?

What things have been done/tried?

What is the pool state right now?

Open questions

Replies: 0 comments

kyle0r
Nov 5, 2024