dsl_dataset: put IO-inducing frees on the pool deadlist #16722

robn · 2024-11-05T05:35:47Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Closes: #16697
Closes: #16708
Closes: #6783

Description

dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline.

Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return.

If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised.

This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks.

For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal.

Implementation notes

I thought about a separate task queue or throttle for this, but the using the pool deadlist and the async destroy machinery was much nicer. It's already there, and quiet most of the time. We have talked in the past about using if for unlink() for performance, and this does actually help with that: deleting enormous test files returned much faster, as you'd expect. However performance is not the purpose of this PR, which is why I've kept it narrow. I'd like to think about it more before we did it for everything, not least how it interacts with a real dataset destroy, and also scrubs etc.

Still, cool you can watch this with zpool wait -t free.

Discussion

There's a few different problems that lead to this happening. This patch isn't really addressing any of them, but rather detecting a known cause and routing around it. I'll write down all the trouble spots that lead to this so we at least have them.

`zio_t` is too big

1280 bytes is a lot for something we create by the thousands. I have a heap of ideas for how to reduce it and/or change the overheads of the IO pipeline and queues as a whole. They're all hugely invasive and not something I want to do in a hurry.

Bonus: there's many similar shape of bugs in the issue tracker that I believe boil down to the same kind of thing: it's easy to load up "too much" IO on the IO taskqs that then sits there, pinning memory, until the workers at the other end can clear it, which can be an age if the the thing at the end is slow. Unfortunately it's not all just zio_t size.

"free" is not really an IO

I suppose it was probably useful to think about that way because allocations are requested from within the IO pipeline. Most of the time though its just metaslab_free(). Obviously the code has to go somewhere, so this is mostly felt in the API: zio_free() doesn't even take a pio, and zio_free_sync() does but might not issue IO, but also has other rules like "same txg". So I think the idea that free is "magic IO" is too magic, and its unpredictability is what gets us here.

I did try to rework just these functions to be "free this sometime, don't care how" or "free it now or error" and then use the error to decide what to do. It didn't work out and I abandon it.

"frees that cause IO" isn't obvious

The old gang and dedup tests are fine, for now. brt_maybe_exists() is a little troubling, as its a feature in use everywhere now, and it might mean that a lot more frees are going on pipeline than used to.

But I also suspect its not the full story. #16037 shows similar behaviour but doesn't appear to involve gang, dedup or cloned blocks. I could find anything else that obviously might generate a ton of IO from a free, but there's still lots of ZFS I'm not familiar with and it also doesn't cover the future.

dmu_tx assignment isn't granular enough

Deleting a file gets a single call to dmu_tx_hold_free(tx, obj, 0, DMU_OBJECT_END). In that respect its doing nothing wrong, but that's clearly too much all at once, and is ultimately what led to this, as the write throttle can only push back on single holds.

write throttle doesn't know about resource overheads

Similar vibes to #15511. Most things don't account for CPU/memory/IO overheads, and so can't push back when those are low. There's no single answer of course, but it seems we're going to run into problems like this more and more as everything gets bigger and bigger.

How Has This Been Tested?

The goal is to create a single file with millions of L0 blocks, and then drop them all at once. This worked well enough for me. Needs a spare 2T+ disk.

zpool create -O recordsize=16K -O dedup=on tank vdf
dd if=/dev/random of=/tank/file bs=1M count=1M
rm /tank/file

Watching the zio_cache field through the process is instructive. Without this patch, even if you haven't gone large enough to trigger an OOM, you will see the count blow out a long way. You may see substantial pauses while reclaim kicks in.

With this patch, it runs at a much more leisurely rate. Still gets big, but never ridiculous.

ZTS run is going locally, though I'm not expecting to see anything much. At least, it should make sure I didn't screw up the accounting.

Update: @serjponomarev has tested this, see #16708 (comment). Seems plausible?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

satmandu · 2024-11-05T13:36:49Z

I wonder if this might have any impact on the OOMs I was having when working with building and tearing down several large docker containers in sequence, as mentioned in #10255 (comment) .

amotin

Thanks. This looks like a good workaround in case we were unable to chunk large delete between several TXGs. But I've got a couple thoughts about it:

We usually consider deletes as netfree transactions, since they free more space than allocate. This though does not promise when exactly the space is freed, that I guess in theory may allow pool overflow. Sure things are already far from predictable in case of dedup or cloning, but as I have told, just a thought.
As I can see, zfs_max_async_dedup_frees does not apply to cloned blocks, since it is not easily visible from the block pointer, so I wonder if the issue may still be reproducible when sync thread will still try to free all the cloned blocks at once, or at least as many as it can within few seconds of TXG timeout. I wonder if we could somehow add limit for cloned blocks there, or we should re-enable zfs_async_block_max_blocks.

amotin · 2024-11-05T17:27:45Z

@satmandu ZFS ZIOs, massive allocation of which is addressed here, are not accounted towards ARC, so they may increase memory pressure on the system and make kernel to request ARC eviction, which being ignored due to evil zfs_arc_shrinker_limit may cause your OOMs. But this PR mostly addresses cases when either dedup or partially block cloning is used. You may check if you have those active.

satmandu · 2024-11-05T17:40:16Z

@amotin I do have block cloning enabled.

zpool get feature@block_cloning rpool
NAME   PROPERTY               VALUE                  SOURCE
rpool  feature@block_cloning  active                 local

amotin · 2024-11-05T17:43:40Z

I do have block cloning enabled.

rpool  feature@block_cloning  active                 local

Enabled would not be a sign, but active -- may be. But you may also look how much you have cloned via zpool get all | grep bclone. It may not mean that specific files you delete are cloned, but it is possible.

satmandu · 2024-11-05T17:49:24Z

I also have it active, I believe as per

cat /sys/module/zfs/parameters/zfs_bclone_enabled
1

Looking further:

zpool get all rpool | grep bclone
rpool  bcloneused                     1.60G                          -
rpool  bclonesaved                    2.24G                          -
rpool  bcloneratio                    2.39x                          -

(Maybe that's just not enough to warrant this PR having any effect on my system.)

Also, as an aside, looking at https://github.com/moby/moby/blob/master/vendor/golang.org/x/sys/unix/ioctl_linux.go, I see the use of both FICLONE, FICLONERANGE, and FIDEDUPERANGE if OpenZFS at some point supports that. So docker can use block cloning functionality on Linux if available.

dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

robn · 2024-11-07T01:18:03Z

We usually consider deletes as netfree transactions, since they free more space than allocate. This though does not promise when exactly the space is freed, that I guess in theory may allow pool overflow. Sure things are already far from predictable in case of dedup or cloning, but as I have told, just a thought.

Yeah, it's a good thought. I agree not for this PR (at least, I think "just a thought" means that heh). I wonder if the answer will be something like, if an allocation (or transaction?) comes in that we can't fulfill, we check the freeing list and if its running, we wait, or we force it to free enough right now, or... something.

I was also thinking today that maybe if dnode_sync_free_range_impl() is freeing indirects, and they're higher than say L2 or L3, we could flag that through to dsl_dataset_block_kill() so that it only puts "very large" frees on the deadlist.

I think that's an interesting idea in its own right, but maybe it could wired in even higher up, like, past a certain size it's not marked netfree, and we use the netfree mark as a "now or later" flag. I guess then it might not be assignable, so idk, maybe we have to break it up instead of nuking the entire range in one go?

Maybe the whole answer will be in the ZIO layer though. Maybe we want a single "batch free" op that can take a list of BPs instead. Maybe that allows some batching down at the BRT and DDT layers too. Hmm hmm.

Ehh, this is all complicated, it's just deciding where you put the complexity.

As I can see, zfs_max_async_dedup_frees does not apply to cloned blocks, since it is not easily visible from the block pointer, so I wonder if the issue may still be reproducible when sync thread will still try to free all the cloned blocks at once, or at least as many as it can within few seconds of TXG timeout. I wonder if we could somehow add limit for cloned blocks there, or we should re-enable zfs_async_block_max_blocks.

Untested, but the clone block check might be just:

diff --git module/zfs/dsl_scan.c module/zfs/dsl_scan.c
index 6cd0dbdea..0de1df948 100644
--- module/zfs/dsl_scan.c
+++ module/zfs/dsl_scan.c
@@ -3578,7 +3578,7 @@ dsl_scan_free_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
 	    -bp_get_dsize_sync(scn->scn_dp->dp_spa, bp),
 	    -BP_GET_PSIZE(bp), -BP_GET_UCSIZE(bp), tx);
 	scn->scn_visited_this_txg++;
-	if (BP_GET_DEDUP(bp))
+	if (BP_GET_DEDUP(bp) || brt_maybe_exists(scn->scn_dp->dp_spa, bp))
 		scn->scn_dedup_frees_this_txg++;
 	return (0);
 }

It's a bit tedious, since we did that test to get into this path, and again moments before in zio_free_sync(). It's not super expensive, but there's still a read lock and a couple of AVL lookups. I have no clear sense of what that means for contention elsewhere, nor for scan free performance.

behlendorf · 2024-11-07T19:26:44Z

Looking at the CI results it looks like with the change we're running afoul of some of the accounting.

https://github.com/openzfs/zfs/actions/runs/11714317314/job/32635388639?pr=16722#step:11:2332

  [ 3876.039204] ZTS run /usr/share/zfs/zfs-tests/tests/functional/mmp/mmp_hostid
  [ 3876.996904] VERIFY3(vd->vdev_stat.vs_alloc >= -alloc_delta) failed (0 >= 1024)
  [ 3876.999814] PANIC at vdev.c:5133:vdev_space_update()
  [ 3877.002032] Showing stack for process 225423
  [ 3877.004900] CPU: 1 PID: 225423 Comm: dp_sync_taskq Tainted: P           OE     5.10.0-33-amd64 #1 Debian 5.10.226-1
  [ 3877.010650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 3877.014083] Call Trace:
  [ 3877.015473]  dump_stack+0x6b/0x83
  [ 3877.017042]  spl_panic+0xd4/0xfc [spl]
  [ 3877.018735]  ? srso_alias_return_thunk+0x5/0x7f
  [ 3877.020574]  ? srso_alias_return_thunk+0x5/0x7f
  [ 3877.024945]  ? zio_write+0x65/0x110 [zfs]
  [ 3877.026679]  ? remove_reference+0x1b0/0x1b0 [zfs]
  [ 3877.028553]  ? srso_alias_return_thunk+0x5/0x7f
  [ 3877.030445]  ? zfs_btree_find+0x174/0x320 [zfs]
  [ 3877.032328]  ? arc_write+0x38c/0x730 [zfs]
  [ 3877.034069]  ? arc_hdr_alloc_abd+0x250/0x250 [zfs]
  [ 3877.036564]  vdev_space_update+0x1e3/0x2d0 [zfs]
  [ 3877.038642]  arc_hdr_l2hdr_destroy+0x67/0xe0 [zfs]
  [ 3877.040771]  arc_hdr_destroy+0x95/0x3b0 [zfs]
  [ 3877.042658]  arc_freed+0x64/0x170 [zfs]
  [ 3877.044562]  zio_free_sync+0x66/0x190 [zfs]
  [ 3877.046567]  zio_free+0xbc/0x100 [zfs]
  [ 3877.048418]  dsl_dataset_block_kill+0x745/0xb50 [zfs]
  [ 3877.050855]  free_blocks+0x19b/0x340 [zfs]
  [ 3877.055043]  dnode_sync_free_range_impl+0x1c4/0x360 [zfs]
  [ 3877.057633]  dnode_sync_free_range+0x73/0xf0 [zfs]
  [ 3877.059958]  ? dnode_sync_free_range_impl+0x360/0x360 [zfs]
  [ 3877.062647]  range_tree_walk+0x6f/0xb0 [zfs]
  [ 3877.064665]  dnode_sync+0x42e/0xf70 [zfs]
  [ 3877.066538]  dmu_objset_sync_dnodes+0x88/0x130 [zfs]
  [ 3877.068768]  sync_dnodes_task+0x42/0x1e0 [zfs]
  [ 3877.070695]  taskq_thread+0x285/0x5c0 [spl]
  [ 3877.072901]  ? wake_up_q+0xa0/0xa0
  [ 3877.074402]  ? dnode_rele_task+0x70/0x70 [zfs]
  [ 3877.076179]  ? taskq_lowest_id+0xc0/0xc0 [spl]
  [ 3877.078053]  kthread+0x11b/0x140
  [ 3877.079357]  ? __kthread_bind_mask+0x60/0x60
  [ 3877.081172]  ret_from_fork+0x22/0x30

robn mentioned this pull request Nov 5, 2024

Optimized Large File Deletion to Prevent OOM #16708

Open

amotin reviewed Nov 5, 2024

View reviewed changes

amotin added the Status: Code Review Needed Ready for review and testing label Nov 5, 2024

robn force-pushed the free-io-defer branch from 52beaf5 to 754ef42 Compare November 7, 2024 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

robn commented Nov 5, 2024 •

edited

Loading

satmandu commented Nov 5, 2024

amotin left a comment

amotin commented Nov 5, 2024

satmandu commented Nov 5, 2024

amotin commented Nov 5, 2024 •

edited

Loading

satmandu commented Nov 5, 2024 •

edited

Loading

robn commented Nov 7, 2024

behlendorf commented Nov 7, 2024 •

edited

Loading

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

Are you sure you want to change the base?

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

Conversation

robn commented Nov 5, 2024 • edited Loading

Motivation and Context

Description

Implementation notes

Discussion

zio_t is too big

"free" is not really an IO

"frees that cause IO" isn't obvious

dmu_tx assignment isn't granular enough

write throttle doesn't know about resource overheads

How Has This Been Tested?

Types of changes

Checklist:

satmandu commented Nov 5, 2024

amotin left a comment

Choose a reason for hiding this comment

amotin commented Nov 5, 2024

satmandu commented Nov 5, 2024

amotin commented Nov 5, 2024 • edited Loading

satmandu commented Nov 5, 2024 • edited Loading

robn commented Nov 7, 2024

behlendorf commented Nov 7, 2024 • edited Loading

robn commented Nov 5, 2024 •

edited

Loading

`zio_t` is too big

amotin commented Nov 5, 2024 •

edited

Loading

satmandu commented Nov 5, 2024 •

edited

Loading

behlendorf commented Nov 7, 2024 •

edited

Loading