During pool export flush the ARC asynchronously #16215

don-brady · 2024-05-21T22:00:43Z

Motivation and Context

When a pool is exported, the ARC buffers in use by that pool are flushed (evicted) as part of the export. In addition, any L2 VDEVs are removed from the L2 ARC. Both of these operations are performed sequentially and inline to the export. For larger ARC footprints, this can represent a significant amount of time. In an HA scenario, this can cause a planned failover to take longer than needed and risk timeouts on the services backed by the pool data.

Description

The teardown of the ARC data used by the pool can be done asynchronously during a pool export. For the main ARC data, the spa load GUID is used to associate a buffer with the spa so we can safely free the spa_t while the teardown proceeds in the background. For the L2 ARC VDEV, the device l2arc_dev_t has a copy of the vdev_t pointer which was being used to calculate the asize of the buffer from the psize when updating the L2 arc stats during the teardown. This asize value can be captured when the buffer is created, thereby eliminating the need for a late binding asize calculation using the VDEV during teardown.

Added an arc_flush_taskq for these background teardown tasks. In arc_fini() (e.g., during module unload) it now waits for any teardown tasks to complete.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Manual testing with ARC and multiple L2 arc devices up to about 24G of arc data and ~100GB of L2 capacity. The pool export went from about 45 seconds down to 5 seconds with the asynchronous teardown in place.
Manually tested exporting while a L2 rebuild was still in progress. The L2 vdev waits for the rebuild to be canceled before proceeding with the teardown.
Ran various ZTS test suites, like l2arc, zpool_import, zpool_export, to exercise the changed code paths.
Ran ztest

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

don-brady · 2024-06-03T20:52:28Z

@gamanakis -- if you have time could you look at the L2 part of this change? Thanks

gamanakis · 2024-06-04T09:14:46Z

Thanks for including me on this, on a first pass it looks good.

amotin

Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?

amotin · 2024-06-04T19:14:22Z

module/zfs/arc.c

+		if ((dev = l2arc_dev_get_next()) == NULL ||
+		    dev->l2ad_spa == NULL) {


Could you please explain l2ad_spa and l2ad_vdev locking semantics, and as part of that how can l2ad_spa be NULL here if we assume the locking is correct, or how places like arc_hdr_l2hdr_destroy() won't catch NULL dereference due to a race?

Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary since l2arc_dev_invalid() already checks under locks if the spa was removed for a device. l2arc_dev_get_next() will never return with a NULL dev->l2ad_spa.

As far as locking is concerned.
The spa_namespace_lock covers a transition of spa->spa_is_exporting and the removal of a spa_t
The spa config SCL_L2ARC lock protects against a vdev/spa from being remove while in-use
The l2arc_dev_mtx protects the L2 device list and a L2 device's l2ad_spa and l2ad_vdev fields

l2arc_dev_get_next() will hand out L2 devices and returns with the spa config SCL_L2ARC lock held. There are two possible spa exceptions that l2arc_dev_get_next() checks for:

spa is being removed (dev->l2ad_spa->spa_is_exporting = B_TRUE) -- protected by spa_namespace_lock

spa was removed (dev->l2ad_spa = NULL) -- protected by l2arc_dev_mtx

This means that when an L2 device is being removed, both the l2arc_dev_mtx and a writer spa config SCL_L2ARC lock should be held to prevent any races. Note that the latter is currently not true and will be remedied.

Also, arc_hdr_l2hdr_destroy() is protected by the dev->l2ad_mtx lock. The l2ad_vdev can be null after a pool is exported and an async ARC removal is still in progress.

As indicated above, there is a race that I now see in l2arc_remove_vdev(). It was holding the l2arc_dev_mtx to transition dev->l2ad_spa to NULL but it wasn't taking the spa config SCL_L2ARC (as writer) to let any inflight devices' from a past l2arc_dev_get_next() drain.

Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary

I don't like unnecessary defensive checks like that. They complicate code reading. Impossible things should be asserted.

don-brady · 2024-06-06T21:38:24Z

Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?

Good question and I tested this case. The async flush will continue it's best-effort pass to flush buffers associated with the exported spa's guid. Any ARC buffers that the import uses will have a positive ref count and be skipped by the async flush task. You can think of it as an alternate arc evict thread that is targeting a specific guid with a zero ref count rather than age.

I suppose we could have the task periodically check if there is an active spa with that guid and exit the task. I'm not sure how common it is to export and then re-import right away on the same host. Before this change, you would have to wait until the export finished flushing the ARC buffers.

The ARC teardown for a spa can take multiple minutes, so you could even have a second pool export+import while the first arc_flush_task() is still running and end up with two arc_flush_task() both looking to evict candidates for the same guid. This is not fatal but a little weird.

amotin · 2024-06-12T13:24:40Z

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

allanjude · 2024-06-17T00:48:39Z

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

I share this concern. Would it make sense to block importing the same pool again until eviction is complete?

don-brady · 2024-06-18T20:26:51Z

Per @amotin -- add a zpool wait for ARC teardown

don-brady · 2024-06-18T20:32:40Z

To address concerns of re-importing after the pool was changed on another host:

Save txg at export and compare to imported txg

if same, then cancel the teardown (ARC data is still valid)
If different, force import to wait for teardown to complete (ARC data can be stale)

amotin · 2024-06-18T21:02:04Z

Thinking more about it, since ARC is indexed on DVA+birth, in case of pool export/import if some blocks appear stale, they should just never be used, unless we actually import some other pool somehow having the same GUID, or import the pool at earlier TXG. We already have somewhat similar problem with persistent L2ARC, when we load into ARC headers blocks that could be long freed from the pool, and they stay in ARC until L2ARC rotate and evict them. But in case of L2ARC we at least know that those stale blocks are from this pool and just won't be used again. I am not sure whether multiple pools with the same GUID is a realistic scenario, but importing the pool at earlier TXG I think may be more realistic, and dangerous same time, since the same TXG numbers may be reused.

don-brady · 2024-06-26T20:17:43Z

Update on re-import while ARC tear-down in progress

Ever since the commit that added zpool reguid feature, the ARC uses the spa_load_guid, not the spa's actual guid for identification. This load guid is transient, not persistent and will change at each import. So after the import, any blocks left in the ARC with this load guid are now orphaned and not associated with any spa.

So we don't need to worry about ARC blocks that are still around when the pool is re-imported since it will identify its blocks using a different spa_load_guid.

richardelling · 2024-07-23T20:01:00Z

@don-brady I was wondering when you were going to remember the behaviour of spa_load_guid

grwilson

Have you looked at having arc_evict prioritize evicting from flushed pools vs trying to evict buffers from active pools?

don-brady · 2024-08-12T18:39:47Z

Have you looked at having arc_evict prioritize evicting from flushed pools vs trying to evict buffers from active pools?
@grwilson

I hadn't considered it other than if it was safe. Both arc_evict() and arc_flush() sit on top of arc_evict_state(). In the flush case it is targeting a specific spa (guid) and in the evict case it ignores targeting (i.e. when guid == 0). So arc_evict_state() is either targeting a specific guid or none at all.

We can have multiple threads (arc_evict() and multiple arc_flush_async()) all running at the same time. And in the underlying arc_evict_state() it's using a randomly selected sublist. So they will likely be working on different buffers.

(a) One option would be to have arc_evict() thread back off when there are any active arc flushes so as to give those flush-initiated evictions priority. And maybe have the last arc flush wake up the arc evict thread.

(b) Another option would be to keep a list of active flush spa guids and have the arc_evict() thread only match on guids from the list if it is not empty -- so it ends up only targeting buffers that need to be flushed.

don-brady · 2024-08-23T14:40:50Z

Rebased to fix merge conflicts.

Eviction thread nows considers active spa flushes and targets those spas (if any).

module/zfs/arc.c

don-brady · 2024-09-27T19:52:55Z

Removed last commit and rebased to latest master branch.

This also includes removing L2 vdevs asynchronously Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>

don-brady · 2024-10-22T04:25:29Z

Rebased to latest master and switched to use taskq_dispatch_ent().

don-brady · 2024-10-22T17:58:55Z

fixed a leak in arc_async_flush_list() found during module unload by ztest

The zpool reguid feature introduced the spa_load_guid, which is a transient value used for runtime identification purposes in the ARC. This value is not the same as the spa's persistent pool guid. However, the value is seeded from spa_generate_load_guid() which does not check for uniqueness against the spa_load_guid from other pools. Although extremely rare, you can end up with two different pools sharing the same spa_load_guid value! This change guarantees that the value is always unique and additionally not still in use by an async arc flush task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>

gamanakis · 2024-10-22T21:12:27Z

Thank you for this PR, the L2ARC part looks good to me in a first pass.

amotin · 2024-11-06T17:52:45Z

module/zfs/arc.c

+	/*
+	 * Note that arc_flush_task() needs arc_async_flush_lock to remove af
+	 * list node.  So by holding the lock we avoid a race for af removal
+	 * with our use here.
+	 */
+	mutex_enter(&arc_async_flush_lock);
+	taskq_dispatch_ent(arc_flush_taskq, arc_flush_task,
+	    af, TQ_SLEEP, &af->af_tqent);
+	mutex_exit(&arc_async_flush_lock);


I am not getting why would we need this lock here and in l2arc_remove_vdev(). IMO it does not protect anything. arc_flush_task() takes it by itself as part of arc_async_flush_remove(), and any way it runs asynchronously, so locking here should do nothing to it. I wonder if it supposed to protect from arc_async_flush_remove() called by something else, but in that case it is just insufficient.

amotin · 2024-11-06T17:55:43Z

module/zfs/arc.c

+	arc_flush_taskq = taskq_create("arc_flush", 75, defclsyspri,
+	    1, INT_MAX, TASKQ_DYNAMIC | TASKQ_THREADS_CPU_PCT);


Do we really need a 75% of CPU cores here? System may have more important things to do, while even idle too many threads may have a cost, even if purely administrative, by polluting the list of threads in case of debugging.

amotin · 2024-11-06T18:55:03Z

module/zfs/arc.c

+		if ((dev = l2arc_dev_get_next()) == NULL ||
+		    dev->l2ad_spa == NULL) {


Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary

I don't like unnecessary defensive checks like that. They complicate code reading. Impossible things should be asserted.

amotin · 2024-11-06T19:10:14Z

include/sys/arc.h

+/* The asize in the header is only used by L2 cache */
+#define	HDR_SET_ASIZE(hdr, x) do { \
+	ASSERT(IS_P2ALIGNED((x), 1U << SPA_MINBLOCKSHIFT)); \
+	(hdr)->b_asize = ((x) >> SPA_MINBLOCKSHIFT); \
+} while (0)
+
 #define	HDR_GET_LSIZE(hdr)	((hdr)->b_lsize << SPA_MINBLOCKSHIFT)
 #define	HDR_GET_PSIZE(hdr)	((hdr)->b_psize << SPA_MINBLOCKSHIFT)
+#define	HDR_GET_ASIZE(hdr)	((hdr)->b_asize << SPA_MINBLOCKSHIFT)


I find confusing naming it ASIZE (both here and in the structure), since it is specific only for L2ARC. We can not have non-L2ARC asize here since due to multiple block copies on different vdevs they might have different asize's. It is just because we do not allow multiple L2ARC copies it makes sense here. I'm thinking about something more specific, like L2SIZE and b_l2size. It would be nice for it to reside in l2arc_buf_hdr_t, but it does not worth the memory waste.

amotin · 2024-11-06T19:13:16Z

module/zfs/arc.c

 	hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP);
 	hdr->b_birth = birth;
 	hdr->b_type = type;
 	hdr->b_flags = 0;
 	arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
 	HDR_SET_LSIZE(hdr, size);
 	HDR_SET_PSIZE(hdr, psize);
+	HDR_SET_ASIZE(hdr, vdev_psize_to_asize(dev->l2ad_vdev, psize));


Not critical, but I wonder if it would be cleaner to pass l2size as an argument from caller among the other sizes. We already have it there and you even assert that they are equal.

don-brady added the Status: Code Review Needed Ready for review and testing label May 22, 2024

behlendorf self-assigned this May 25, 2024

amotin reviewed Jun 4, 2024

View reviewed changes

don-brady force-pushed the arc_flush_async branch from f8d5651 to 03db4c3 Compare June 26, 2024 20:09

don-brady requested a review from behlendorf June 27, 2024 03:08

don-brady requested a review from grwilson July 17, 2024 19:13

allanjude approved these changes Jul 28, 2024

View reviewed changes

grwilson reviewed Jul 31, 2024

View reviewed changes

don-brady force-pushed the arc_flush_async branch from f1c4323 to 3d835dd Compare August 23, 2024 14:20

amotin reviewed Sep 6, 2024

View reviewed changes

module/zfs/arc.c Outdated Show resolved Hide resolved

module/zfs/arc.c Outdated Show resolved Hide resolved

module/zfs/arc.c Outdated Show resolved Hide resolved

don-brady force-pushed the arc_flush_async branch from 3d835dd to 1b4e0ff Compare September 27, 2024 19:49

During pool export flush the ARC asynchronously

5dc29a6

This also includes removing L2 vdevs asynchronously Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>

don-brady force-pushed the arc_flush_async branch from 1b4e0ff to 17ea221 Compare October 22, 2024 04:23

don-brady force-pushed the arc_flush_async branch from 17ea221 to 12bd860 Compare October 22, 2024 17:57

don-brady force-pushed the arc_flush_async branch from 12bd860 to 9805730 Compare October 22, 2024 20:01

amotin reviewed Nov 6, 2024

View reviewed changes

amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During pool export flush the ARC asynchronously #16215

During pool export flush the ARC asynchronously #16215

don-brady commented May 21, 2024 •

edited

Loading

don-brady commented Jun 3, 2024

gamanakis commented Jun 4, 2024

amotin left a comment

amotin Jun 4, 2024

don-brady Jun 6, 2024

amotin Nov 6, 2024

don-brady commented Jun 6, 2024

amotin commented Jun 12, 2024

allanjude commented Jun 17, 2024

don-brady commented Jun 18, 2024

don-brady commented Jun 18, 2024

amotin commented Jun 18, 2024

don-brady commented Jun 26, 2024

richardelling commented Jul 23, 2024

grwilson left a comment

don-brady commented Aug 12, 2024

don-brady commented Aug 23, 2024

don-brady commented Sep 27, 2024

don-brady commented Oct 22, 2024

don-brady commented Oct 22, 2024

gamanakis commented Oct 22, 2024

amotin Nov 6, 2024

amotin Nov 6, 2024

amotin Nov 6, 2024

amotin Nov 6, 2024

amotin Nov 6, 2024

		if ((dev = l2arc_dev_get_next()) == NULL \|\|
		dev->l2ad_spa == NULL) {

		arc_flush_taskq = taskq_create("arc_flush", 75, defclsyspri,
		1, INT_MAX, TASKQ_DYNAMIC \| TASKQ_THREADS_CPU_PCT);

During pool export flush the ARC asynchronously #16215

Are you sure you want to change the base?

During pool export flush the ARC asynchronously #16215

Conversation

don-brady commented May 21, 2024 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

don-brady commented Jun 3, 2024

gamanakis commented Jun 4, 2024

amotin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

don-brady commented Jun 6, 2024

amotin commented Jun 12, 2024

allanjude commented Jun 17, 2024

don-brady commented Jun 18, 2024

don-brady commented Jun 18, 2024

amotin commented Jun 18, 2024

don-brady commented Jun 26, 2024

richardelling commented Jul 23, 2024

grwilson left a comment

Choose a reason for hiding this comment

don-brady commented Aug 12, 2024

don-brady commented Aug 23, 2024

don-brady commented Sep 27, 2024

don-brady commented Oct 22, 2024

don-brady commented Oct 22, 2024

gamanakis commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

don-brady commented May 21, 2024 •

edited

Loading