Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool import hangs forever. Edit: trim under heavy IO might damage the pool #356

Open
Anankke opened this issue Feb 25, 2024 · 17 comments
Open

Comments

@Anankke
Copy link

Anankke commented Feb 25, 2024

System information

Type Version/Name
Distribution Name Windows 10 22H2 19045
Architecture AMD64
OpenZFS Version zfs-2.2.99-5-ga6951e43bf

Describe the problem you're observing

When 'zpool import', the command hangs at the last drive's setting physpath here and freezes. Cannot import the pool, and zpool status hangs with no output, cannot get data.

Describe how to reproduce the problem

After some heavy IO, scrub and trim. When doing trim, everything freezes. No exact idea how it goes wrong. It was running a scrub but it seems the whole pool freeze, and a reboot ruins everything.

Include any warning/errors/backtraces from the system logs

zpool import -f

 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 0000000a`35bfc308 00007ffc`3bcc591b : 00000000`00026040 00007ff6`dd4a4efd 00000000`00000020 00000000`00000001 : ntdll!NtDeviceIoControlFile+0x14
01 0000000a`35bfc310 00007ffc`3e435921 : 00000000`9c40200b 0000000a`00000000 0000000a`35bfc490 00007ff6`dd4509ce : KERNELBASE!DeviceIoControl+0x6b
02 0000000a`35bfc380 00007ff6`dcb57d47 : 00000124`1477b190 00000000`00000000 00000000`00000000 00000000`00000000 : KERNEL32!DeviceIoControlImplementation+0x81
03 0000000a`35bfc3d0 00007ff6`dcb57c11 : 00000124`00000000 00007ff6`dcb2b99d 00000000`00000001 00000000`00026040 : zpool!zcmd_ioctl_compat+0xe7 [C:\src\openzfs\lib\libzfs_core\os\windows\libzfs_core_ioctl.c @ 54] 
04 0000000a`35bfc460 00007ff6`dcb2fbc8 : 00000002`00000040 11f75162`23794a00 00000000`00000000 00000124`169c1070 : zpool!lzc_ioctl_fd+0x41 [C:\src\openzfs\lib\libzfs_core\os\windows\libzfs_core_ioctl.c @ 110] 
05 0000000a`35bfc4b0 00007ff6`dcb1cf5a : 00000000`0000001d 00000000`06213cdd 00000124`169c2028 00000124`169b1d80 : zpool!zfs_ioctl+0x28 [C:\src\openzfs\lib\libzfs\os\windows\libzfs_util_os.c @ 55] 
06 0000000a`35bfc4f0 00007ff6`dcae9b89 : 00000000`00074835 00000001`00000000 00000124`14842e90 00000000`00000012 : zpool!zpool_import_props+0x2aa [C:\src\openzfs\lib\libzfs\libzfs_pool.c @ 2150] 
07 0000000a`35bfe0d0 00007ff6`dcae964c : 00000000`00000011 00000124`147886c0 0000000a`35bfe490 0000000a`35bfe448 : zpool!do_import+0x4d9 [C:\src\openzfs\cmd\zpool\zpool_main.c @ 3240] 
08 0000000a`35bfe1e0 00007ff6`dcad8f29 : 00007ff6`dd51d71d 0000000a`35bfe438 00000124`0000000a 00000124`14780050 : zpool!import_pools+0x4fc [C:\src\openzfs\cmd\zpool\zpool_main.c @ 3373] 
09 0000000a`35bfe2a0 00007ff6`dcad07c9 : 00000124`147600c0 00007ff6`dd5162d2 0000000a`35bfec10 0000000a`35bfeb10 : zpool!zpool_do_import+0xf99 [C:\src\openzfs\cmd\zpool\zpool_main.c @ 3864] 
0a 0000000a`35bfe920 00007ff6`dd3cfd49 : 00007ff6`00000000 00007ff6`dd49b953 00000000`00000000 00007ff6`dd3d0b2d : zpool!main+0x2b9 [C:\src\openzfs\cmd\zpool\zpool_main.c @ 11308] 
0b 0000000a`35bffb50 00007ff6`dd3cfbee : 00007ff6`dd515000 00007ff6`dd515330 00000000`00000000 00000000`00000000 : zpool!invoke_main+0x39 [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 79] 
0c 0000000a`35bffba0 00007ff6`dd3cfaae : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : zpool!__scrt_common_main_seh+0x12e [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288] 
0d 0000000a`35bffc10 00007ff6`dd3cfdde : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : zpool!__scrt_common_main+0xe [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331] 
0e 0000000a`35bffc40 00007ffc`3e437344 : 0000000a`35c8d000 00000000`00000000 00000000`00000000 00000000`00000000 : zpool!mainCRTStartup+0xe [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17] 
0f 0000000a`35bffc70 00007ffc`3e5a26b1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : KERNEL32!BaseThreadInitThunk+0x14
10 0000000a`35bffca0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21

Tried import in Linux, same issue, seems pool is broken.

If you are lucky, try import readonly with recovery on.

@Anankke
Copy link
Author

Anankke commented Feb 27, 2024

Updated.

@lundman
Copy link

lundman commented Feb 27, 2024

Oh hey, new ticket. OK, those stacks are a good start, but they just show it called the kernel. We need to dump the kernel stacks to see what is happening there.

If you know how to attach windbg/VS debugger, the command is
.logopen c:\src\stacks.txt ; !stacks 2; .logclose

@Anankke
Copy link
Author

Anankke commented Feb 29, 2024

I can reproduce this by issue a trim when having heavy IO, and after a while all the vdev in the pool start to have checksum error and grow quickly, showing same number across all of them. Then the pool freeze and when you reboot, windows will take forever to reboot and you have to power off. After the reboot, you will not be able to import this pool rw in both Windows/Linux.

@lundman
Copy link

lundman commented Feb 29, 2024

OK so you are saying potentially trim might be veerrryy bad, and I should disable it for now?

@Anankke
Copy link
Author

Anankke commented Feb 29, 2024

No sir, I am still trying to get exactly what is happening there, and it seems trim is very likely related to this due that it has some chance to ruin the whole pool in my experience. I was just trying to solve problems. I was writing 1.2G/s and this can happen in my case.

lundman added a commit that referenced this issue Feb 29, 2024
Appears to corrupt pool, take safe option until it can be
investigated.

Signed-off-by: Jorgen Lundman <[email protected]>
@Anankke Anankke changed the title zpool import hangs forever zpool import hangs forever. Edit: trim under heavy IO might damage the pool Feb 29, 2024
@lundman
Copy link

lundman commented Feb 29, 2024

Yeah, I will disable trim to be safe. I can not test trim with VM, as it isn't supported by vmware. I will plug in a real device and double check the sector math is correct.

@Anankke
Copy link
Author

Anankke commented Feb 29, 2024

More context: I got two pool failures under similar situation. With 2 pools one is raidz and another raidz2. All vdevs are healthy, responsive and can be trimmed individually when not on ZFS.

The following command is used to create the pool.

zpool create -O casesensitivity=insensitive -o ashift=12 -O atime=off -O relatime=off -O recordsize=1M -O normalization=formD -O xattr=sa -O dnodesize=auto -O compression=zstd-fast -O prefetch=none ZFS raidz PHYSICALDRIVE0 PHYSICALDRIVE1 PHYSICALDRIVE2 PHYSICALDRIVE3 PHYSICALDRIVE4 PHYSICALDRIVE5

zpool create -O casesensitivity=insensitive -o ashift=12 -O acltype=posixacl -O xattr=sa -O atime=off -O relatime=off -O recordsize=16M -O dnodesize=auto -O normalization=formD -O compression=zstd-9 Z2 raidz2 PHYSICALDRIVE6 PHYSICALDRIVE7 PHYSICALDRIVE8 PHYSICALDRIVE9 PHYSICALDRIVE10 PHYSICALDRIVE11

@lundman
Copy link

lundman commented Feb 29, 2024

Hmm starting to sweat here, I think maybe I need to add on the partition-offset to the offset to trim. As you can see on first line here, and last line, in the selected block

https://github.com/openzfsonwindows/openzfs/blob/windows/module/os/windows/zfs/vdev_disk.c#L727-L755 😬

@lundman
Copy link

lundman commented Feb 29, 2024

OK pushed out a new release OpenZFSOnWindows-debug-2.2.99-13-gfddfb6aeb5.exe. If you have the energy, check that its OK by default, then try enabling trim and check if it is fixed now.

@Anankke
Copy link
Author

Anankke commented Mar 10, 2024

Well, as I don't have a test pool currently... I am not dare to try it now 😭

@v1ckxy
Copy link

v1ckxy commented Apr 18, 2024

I think I suffered the same but without any kind of "trim" enabled (registry option is set as 0, and disks are mechanical so no need to trim nothing)

Yesterday before shutting down the system I executed the typical commands:

zfs unmount -a
zpool export -a

However, zpool took forever so I leave the system powered on and went to bed.
And... this morning, zpool command was still "running", so I forced a restart (shutdown /t 0 /r /f).

After that, the pool is not importing and zpool hangs forever:
image

Yesterday I just copied ~1.x TB of little files (~1800files) into the pool. No issues whatsoever.

Any advice?

@lundman
Copy link

lundman commented Apr 18, 2024

Shouldn't be related to the trim thing, that is now disabled.

It would be interesting to get a dump of the processes while import is running to see what it is doing - wonder if Windows has a way to do that without setting a remote debugger.

@v1ckxy
Copy link

v1ckxy commented Apr 18, 2024

Actually, I forced restart again (shutdown /t 0 /r /f), tried again to import the pool and...
it loaded 🧐🤔😵‍💫:

image

Shall I open a new issue then? Looks like ¿windows? is doing something here.

@lundman
Copy link

lundman commented Apr 18, 2024

Open one if you come across it again. I was wondering if it was unlinked-drain being slow, but afaik that is async these days.

@lundman
Copy link

lundman commented Apr 18, 2024

Also, if you are just rebooting, or shutdown, you don't need to "unmount" and "export". You should generally just export if you are to move the storage to different hardware, like, plugging into another machine, or booting a different OS.

@v1ckxy
Copy link

v1ckxy commented Apr 18, 2024

Oh, I thought it was mandatory and I was doing before shutting down the system 😅

Right now I'm scrubbing the pool. Let's see if everything's fine.

  scan: scrub in progress since Thu Apr 18 09:44:53 2024
        9.66T / 16.3T scanned at 13.1G/s, 650G / 16.3T issued at 884M/s
        0B repaired, 3.91% done, 05:08:37 to go

I'll open an issue in case something strange happens again (TBH I'm missing some kind of log file :D)
Thank you for your hard work.

@lundman
Copy link

lundman commented Apr 18, 2024

kstat will dump a bunch of logs, if you set the verbose=1 afaik.
we also have internal cbuf, but I dont know if we kept the "save to disk" feature. perhaps we should resurrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants