-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel oops page fault triggered by Docker in arc_prune #16324
Comments
I've seen this several times on several systems running ZFS 2.2.4 and Linux 6.8. Reverting to Linux 6.6 is stable. I have a trace which looks similar to yours ( An example of
An example of
|
I have the same problem on the latest Unraid 7.0.0-beta-1 prerelease.
|
I'm a mod at the Unraid forums and we have seen multiple users with this issue with Docker on zfs since kernel 6.8 (openzfs 2.2.4-1), there was also one report with kernel 6.7 during beta testing, call traces look all very similar, some example in case it helps.
|
We've been hitting what looks to be this issue ever since we launched our new infrastructure all running on the latest Ubuntu Kernel and ZFS. Once we had migrated some hundreds of container workloads we started experiencing crashes. We've been very unfortunate in our crash dump collection, but we've ascertained that the crashes we are seeing is very similar to this, please see the LXCFS issue linked above. Symptoms as we see them:
ZFS version where we've seen this is These workloads were all stable for years on older kernels. This is a real issue and I would not be surprised to learn that a lot of zfs users out there are being affected by this now. It took us a long time to track down the source of our crashes, and I expect others may be in the same situation. I believe this issue warrants immediate attention, especially since upgrading to the latest mainline-ish kernel and zfs does not seem to resolve this. |
Taking into account that this crash happens in a shrinkers-related code I can make a wild guess that this issue should be provokable by something like |
I tried that and it did not crash my system or trigger the issue. AFAIK the ZFS ARC is separate. My guess would be it's hitting a race condition of sorts on heavy IO where it has to be evicting a lot out of ARC. |
I have a similar issue with kernel 6.8.12, docker 27.0.3, and zfs 2.2.4 (zfs-2.2.4-r0-gentoo) on Gentoo. My trace does not include arc_prune but zfs_prune.
|
@1JorgeB any chance switching storage drivers might be a reliable workaround until this gets resolved, versus using a btrfs image? |
same issue here, on 6.8.9. crashes after an AI training workload for a few hours. will revert to 6.6. EDIT: 6.6 is stable (am not using docker, just standard filesystem reading and writing). |
FYI OpenZFS has supported docker's overlay2 storage driver since 2.2. See moby/moby#46337 (comment) I gave up on the docker zfs storage driver some time ago, as it was pretty buggy. If you are starting docker with systemd you can modify the startup line to be:
Using overlay2 I don't have any issues loading |
Similar situation here, this time with 6.10.4 and zfs 2.2.5-1 on Debian. This occurred during a I was forced to power cycle after this, which incidentally upgraded my kernel to 6.10.6, and the same pull succeeded fine afterward. But the ARC conditions would also be entirely different after a fresh boot so I doubt the kernel upgrade mattered. Just some info. FWIW, my kernel is in lockdown mode due to secure boot.
|
Still reproducing in ZFS 2.2.6 and kernel 6.10.8-zen1-1-zen. |
I was playing with this one today and trying to reproduce it:
I've also enabled SLUB debugging (in
and KFENCE:
With no results, unfortunately. I also tried to limit physical memory amount from 256GiB to 128G, 64G and 32G with the same results. No crashes on Likely, it's a tricky race condition. |
@mihalicyn maybe a silly question, but did you use Docker's |
not a silly question at all ;-) When debugging stuff everything must be checked twice! Yeah, I do have ZFS storage driver enabled manually:
|
That is very odd, it crashes every single time for me. It's a guaranteed crash whenever I run that command. I'll see if I can make a VM that reproduces. In the meantime, I can provide any debug log asked. |
Please can you try to enable SLAB debugging with
That would be awesome too! |
Alright, I've been working on this for a bit and I haven't been able to reproduce on my laptop, nor in a VM on said laptop. I then did a quick sanity check and yup, still crashes first time on my desktop. So I decided to instrument the desktop, and then... nothing. It seems like The trace I got this time however is different: it tried to execute an NX-protected page? The only thing of note I can think of that might be a contributor is this is a Threadripper 1950X system, 16C/32T with 32 GB of RAM, so it's a NUMA system and relatively high core count, which leaves a lot of room for a race condition. Maybe if others in here can share their specs we can correlate some things.
|
Additional information that I think could help narrow it down: I think it's possibly related to the creation and destruction of datasets and snapshots. I never see it die during the extraction, I see it die at the very end of it when it commits the Docker layer, whatever it's doing.
That smells like a use after free race condition triggered by changes in datasets and snapshots. Is there an existing stress test for that I could try? I'll see if I can write one later this weekend when I have more time. |
Thanks for doing that!
yeah, it can make things way too slow. You can play with parameters
You can try to make |
I've observed the same error. Context:
|
I've reproduced on every system I have in production, which are all "above average" core count systems. AMD Threadripper 3990X (64c/128t), Ryzen 3900X, 5900X (12c/24t). I can't try slub_debug or KFENCE in production, but I could try within KVM on the same systems. |
We also observed this across many high core count systems, in our case dual Xeon Platinum systems with 56C/112T |
Hi @TheUbuntuGuy,
actually, you can enable KFENCE in production. It is designed to be enabled in production to debug issues like this one. |
I reverted the kernel on my production systems to v6.6 as they were unusable due to this bug. I meant that I couldn't test using those options due to it being a production system and I can't crash it. I will try in a VM using the same CPU layout when I get the chance, probably with some automation to try the crash over and over, since KFENCE is sampled and may not catch the issue quickly. |
@TheUbuntuGuy thanks a lot, can you please try with |
either I'm just hilariously wandering in the dark and am creating more problems on the way than I'm solving, or we're really uncovering a bug after a bug in low memory conditions... Update: |
This bug occurred to me today on my backup server which is not running any docker containers. The backup server is based on Ubuntu 22.04.5 LTS
The Intel Core i7-4770 is rather old featuring only 4C/8T and 32GB DDR3 RAM. I installed zfs using
The crash happened 24 hours (+- 15 minutes according to dmesg) after I started to move very big directories with lots of hard linked files from my base pool to freshly created datasets. How the bug occurred:
Then I moved the directories to the newly created datasets:
I did this for all 11 servers. This, of course, caused a very heavy iowait of up to 70%. Currently, there are numerous processes stalled in D state waiting for disk io and the whole server is idling with no disk activity. Beside the mv processes there are 3 kernel threads in that same stalled D state called: I've managed to dump the entire system memory to my /root directoy on ext4 using
I will let the system running in its bugged state for a few days and I'm happy to provide any additional information if needed. FYI: the stack trace shows some
|
@TheOnlyMarkus you have relatively old kernel @snajpa thanks for your work on the fix! I believe that vpsfreecz@12e235d should fix a crash in |
There is a more detailed back and forth in vpsfreecz#1. It appears easier to reproduce with certain kernel config, but we aren't sure why. I don't want to speak for @snajpa, but it looks like it may be a memory management behavioural change in the kernel. I can test 6.7.12 separately if that is of value. |
@mihalicyn yup that one should avoid the sb going away from underneath the reclaim; I also hunted down 7fd4d7f which fixes #16608 but there's this thing with seemingly arc loaned dbuf being overwritten, or, more generally, it's just some crap that's corrupting memory on low memory conditions, UAF style IMO, but still can't quite pinpoint where is that happening |
arc_write/arc_write_done callback vs. dbuf_evict is where the stacks lead me but I don't see nothing so far |
Another stacktrace from kernel 6.11.7 / zfs 2.2.6
The exact line from zfs_prune:
It looks like the function pointer shrinker->scan_objects is invalid
|
yes and that is already solved by ^ my patch but it's no good until another problem is solved |
I'll try KCSAN on it, but I'm exhausted and need to take few days off. We need to run current kernel versions b/c of all the development it's getting making container workloads run more smoothly, it's a lot of changes, it picked up steam somewhere at 5.7 and isn't slowing down... |
I assume there are unit tests that can be set to test behavior in low-memory conditions. (And if not, should there be some?) |
currently there aren't any lowmem test scenarios, but there should be, I'd be willing to set it up and look after the needed infra (as I don't think it'll make sense to do it in the cloud), but I also have to make a living somehow, so if any org making actualy money with OpenZFS would like to hire me to work on these things, I'm all in :)) even with RHEL 7 behind us, the landscape of supported kernels isn't getting any easier, the lowmem tests would have to run for all of them while at the topic of tests, to my best knowledge, there also isn't any up to date alternative to @behlendorf's xfstests fork, that's another area could use some love |
meh, KCSAN unhelpful so far, I guess I need new perspectives so a break for me it really is |
so that UAF seems to be fixed in master truly :) on that PR now, thanks everyone, especially @TheUbuntuGuy |
@snajpa what is the commit/PR that fixed the UAF? Also big thanks to you for putting in a huge amount of time into this issue. I am truly grateful |
@TheUbuntuGuy it seems to be a bug in 6.10 series in the end, can you verify it's ok with #16770 and 6.11? looks to me as if those kcsan things I saw with 6.10 weren't false alerts & were actually a bug in the kernel :) it works with 6.11 (tried 6.11.6 and higher) with |
@TheUbuntuGuy awesome, thank you for all your help with this :) luckily 6.10 isn't LTS, but perhaps we could still use some mechanism to blacklist known problematic kernel versions... @robn what do you think? |
@snajpa sorry, can you give me a summary? Maybe a link to the upstream bug? Until then, I can guess at how I'll respond:
|
@robn vpsfreecz#1 here we were trying to debug really weirdly manifesting problems with 6.10.x... your generic answer is good enough for me as is, works out to the fact that I have to return to 6.10 and retest. But now I have yet another fancy crash of 2.3.0-rc3+bit of rc4 +my patches to attend to :D |
@TheUbuntuGuy @satmandu @maxpoulin64 @AllKind @IvanVolosyuk please excuse my screw up for which I offer a fix here #16817 There's a possibility that dirty data might not get written-back from pagecache to ZFS upper layers on time when a container shuts down... It might be somewhat mitigated in situations that don't involve The mitigation would be to set You could also set, during the shutdowns, |
Is this still an issue with the docker |
@satmandu when ovl is used the container shutdown itself shouldn't be a concern, but |
It does look like some of those values can be smaller? These are my current settings: sysctl vm.dirty_writeback_centisecs
vm.dirty_writeback_centisecs = 500
sysctl vm.dirty_bytes
vm.dirty_bytes = 0
sysctl vm.dirty_background_bytes
vm.dirty_background_bytes = 0 |
|
mdadm has a file in the systemd hierarchy #!/bin/sh
# We need to ensure all md arrays with external metadata
# (e.g. IMSM, DDF) are clean before completing the shutdown.
/usr/sbin/mdadm --wait-clean --scan Would it make sense to add something like this as an implementation of your suggested workaround @snajpa? #!/bin/sh
# /usr/lib/systemd/system-shutdown/zfs.shutdown
# Mitigation for dirty data possibly not get written-back from pagecache to ZFS
# upper layers on time when a container shuts down in situations that don't
# involve cleanup_mnt (shutdown of a mount namespace where ZFS was used
# as the backing FS for the apps).
# See https://github.com/openzfs/zfs/issues/16324#issuecomment-2506816817
sysctl -w vm.dirty_writeback_centisecs=1 && sleep 11
zfs unmount -a I'm not sure of the timing of the zfs unmounts in relation to the rest of the system shutdown process, so not sure if that would be too late in the shutdown process to be useful. (Also, |
Or maybe it makes sense to add a
|
System information
I'm holding to 6.8.9 specifically to stay within official supported kernel versions.
Describe the problem you're observing
Extracting large container images in Docker causes ZFS to trigger an unhandled page fault, and permanently locks up the filesystem until reboot. Sync will never complete, and normal shutdown also doesn't complete.
Describe how to reproduce the problem
Running this particular container reliably hangs ZFS on my system during extraction, using Docker's ZFS storage driver.
It gets stuck on a line such as this one and never completes, killing the Docker daemon makes it a zombie, IO is completely hosed.
Include any warning/errors/backtraces from the system logs
The stack trace is always the same. Disk passes scrub with 0 errors after rebooting.
The text was updated successfully, but these errors were encountered: