Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During pool export flush the ARC asynchronously #16215

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

don-brady
Copy link
Contributor

@don-brady don-brady commented May 21, 2024

Motivation and Context

When a pool is exported, the ARC buffers in use by that pool are flushed (evicted) as part of the export. In addition, any L2 VDEVs are removed from the L2 ARC. Both of these operations are performed sequentially and inline to the export. For larger ARC footprints, this can represent a significant amount of time. In an HA scenario, this can cause a planned failover to take longer than needed and risk timeouts on the services backed by the pool data.

Description

The teardown of the ARC data used by the pool can be done asynchronously during a pool export. For the main ARC data, the spa load GUID is used to associate a buffer with the spa so we can safely free the spa_t while the teardown proceeds in the background. For the L2 ARC VDEV, the device l2arc_dev_t has a copy of the vdev_t pointer which was being used to calculate the asize of the buffer from the psize when updating the L2 arc stats during the teardown. This asize value can be captured when the buffer is created, thereby eliminating the need for a late binding asize calculation using the VDEV during teardown.

Added an arc_flush_taskq for these background teardown tasks. In arc_fini() (e.g., during module unload) it now waits for any teardown tasks to complete.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

  1. Manual testing with ARC and multiple L2 arc devices up to about 24G of arc data and ~100GB of L2 capacity. The pool export went from about 45 seconds down to 5 seconds with the asynchronous teardown in place.
  2. Manually tested exporting while a L2 rebuild was still in progress. The L2 vdev waits for the rebuild to be canceled before proceeding with the teardown.
  3. Ran various ZTS test suites, like l2arc, zpool_import, zpool_export, to exercise the changed code paths.
  4. Ran ztest

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@don-brady don-brady added the Status: Code Review Needed Ready for review and testing label May 22, 2024
@behlendorf behlendorf self-assigned this May 25, 2024
@don-brady
Copy link
Contributor Author

@gamanakis -- if you have time could you look at the L2 part of this change? Thanks

@gamanakis
Copy link
Contributor

Thanks for including me on this, on a first pass it looks good.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?

Comment on lines +9438 to +9611
if ((dev = l2arc_dev_get_next()) == NULL ||
dev->l2ad_spa == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain l2ad_spa and l2ad_vdev locking semantics, and as part of that how can l2ad_spa be NULL here if we assume the locking is correct, or how places like arc_hdr_l2hdr_destroy() won't catch NULL dereference due to a race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary since l2arc_dev_invalid() already checks under locks if the spa was removed for a device. l2arc_dev_get_next() will never return with a NULL dev->l2ad_spa.

As far as locking is concerned.
The spa_namespace_lock covers a transition of spa->spa_is_exporting and the removal of a spa_t
The spa config SCL_L2ARC lock protects against a vdev/spa from being remove while in-use
The l2arc_dev_mtx protects the L2 device list and a L2 device's l2ad_spa and l2ad_vdev fields

l2arc_dev_get_next() will hand out L2 devices and returns with the spa config SCL_L2ARC lock held. There are two possible spa exceptions that l2arc_dev_get_next() checks for:

  • spa is being removed (dev->l2ad_spa->spa_is_exporting = B_TRUE) -- protected by spa_namespace_lock
  • spa was removed (dev->l2ad_spa = NULL) -- protected by l2arc_dev_mtx

This means that when an L2 device is being removed, both the l2arc_dev_mtx and a writer spa config SCL_L2ARC lock should be held to prevent any races. Note that the latter is currently not true and will be remedied.

Also, arc_hdr_l2hdr_destroy() is protected by the dev->l2ad_mtx lock. The l2ad_vdev can be null after a pool is exported and an async ARC removal is still in progress.

As indicated above, there is a race that I now see in l2arc_remove_vdev(). It was holding the l2arc_dev_mtx to transition dev->l2ad_spa to NULL but it wasn't taking the spa config SCL_L2ARC (as writer) to let any inflight devices' from a past l2arc_dev_get_next() drain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary

I don't like unnecessary defensive checks like that. They complicate code reading. Impossible things should be asserted.

@don-brady
Copy link
Contributor Author

Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?

Good question and I tested this case. The async flush will continue it's best-effort pass to flush buffers associated with the exported spa's guid. Any ARC buffers that the import uses will have a positive ref count and be skipped by the async flush task. You can think of it as an alternate arc evict thread that is targeting a specific guid with a zero ref count rather than age.

I suppose we could have the task periodically check if there is an active spa with that guid and exit the task. I'm not sure how common it is to export and then re-import right away on the same host. Before this change, you would have to wait until the export finished flushing the ARC buffers.

The ARC teardown for a spa can take multiple minutes, so you could even have a second pool export+import while the first arc_flush_task() is still running and end up with two arc_flush_task() both looking to evict candidates for the same guid. This is not fatal but a little weird.

@amotin
Copy link
Member

amotin commented Jun 12, 2024

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

@allanjude
Copy link
Contributor

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

I share this concern. Would it make sense to block importing the same pool again until eviction is complete?

@don-brady
Copy link
Contributor Author

Per @amotin -- add a zpool wait for ARC teardown

@don-brady
Copy link
Contributor Author

To address concerns of re-importing after the pool was changed on another host:

Save txg at export and compare to imported txg

  • if same, then cancel the teardown (ARC data is still valid)
  • If different, force import to wait for teardown to complete (ARC data can be stale)

@amotin
Copy link
Member

amotin commented Jun 18, 2024

Thinking more about it, since ARC is indexed on DVA+birth, in case of pool export/import if some blocks appear stale, they should just never be used, unless we actually import some other pool somehow having the same GUID, or import the pool at earlier TXG. We already have somewhat similar problem with persistent L2ARC, when we load into ARC headers blocks that could be long freed from the pool, and they stay in ARC until L2ARC rotate and evict them. But in case of L2ARC we at least know that those stale blocks are from this pool and just won't be used again. I am not sure whether multiple pools with the same GUID is a realistic scenario, but importing the pool at earlier TXG I think may be more realistic, and dangerous same time, since the same TXG numbers may be reused.

@don-brady
Copy link
Contributor Author

Update on re-import while ARC tear-down in progress

Ever since the commit that added zpool reguid feature, the ARC uses the spa_load_guid, not the spa's actual guid for identification. This load guid is transient, not persistent and will change at each import. So after the import, any blocks left in the ARC with this load guid are now orphaned and not associated with any spa.

So we don't need to worry about ARC blocks that are still around when the pool is re-imported since it will identify its blocks using a different spa_load_guid.

@richardelling
Copy link
Contributor

@don-brady I was wondering when you were going to remember the behaviour of spa_load_guid

Copy link
Member

@grwilson grwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at having arc_evict prioritize evicting from flushed pools vs trying to evict buffers from active pools?

@don-brady
Copy link
Contributor Author

Have you looked at having arc_evict prioritize evicting from flushed pools vs trying to evict buffers from active pools?
@grwilson

I hadn't considered it other than if it was safe. Both arc_evict() and arc_flush() sit on top of arc_evict_state(). In the flush case it is targeting a specific spa (guid) and in the evict case it ignores targeting (i.e. when guid == 0). So arc_evict_state() is either targeting a specific guid or none at all.

We can have multiple threads (arc_evict() and multiple arc_flush_async()) all running at the same time. And in the underlying arc_evict_state() it's using a randomly selected sublist. So they will likely be working on different buffers.

(a) One option would be to have arc_evict() thread back off when there are any active arc flushes so as to give those flush-initiated evictions priority. And maybe have the last arc flush wake up the arc evict thread.

(b) Another option would be to keep a list of active flush spa guids and have the arc_evict() thread only match on guids from the list if it is not empty -- so it ends up only targeting buffers that need to be flushed.

@don-brady
Copy link
Contributor Author

Rebased to fix merge conflicts.

Eviction thread nows considers active spa flushes and targets those spas (if any).

module/zfs/arc.c Outdated Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
@don-brady
Copy link
Contributor Author

Removed last commit and rebased to latest master branch.

This also includes removing L2 vdevs asynchronously

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Don Brady <[email protected]>
@don-brady
Copy link
Contributor Author

Rebased to latest master and switched to use taskq_dispatch_ent().

@don-brady
Copy link
Contributor Author

fixed a leak in arc_async_flush_list() found during module unload by ztest

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value!

This change guarantees that the value is always unique and
additionally not still in use by an async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Don Brady <[email protected]>
@gamanakis
Copy link
Contributor

Thank you for this PR, the L2ARC part looks good to me in a first pass.

Comment on lines +4508 to +4516
/*
* Note that arc_flush_task() needs arc_async_flush_lock to remove af
* list node. So by holding the lock we avoid a race for af removal
* with our use here.
*/
mutex_enter(&arc_async_flush_lock);
taskq_dispatch_ent(arc_flush_taskq, arc_flush_task,
af, TQ_SLEEP, &af->af_tqent);
mutex_exit(&arc_async_flush_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not getting why would we need this lock here and in l2arc_remove_vdev(). IMO it does not protect anything. arc_flush_task() takes it by itself as part of arc_async_flush_remove(), and any way it runs asynchronously, so locking here should do nothing to it. I wonder if it supposed to protect from arc_async_flush_remove() called by something else, but in that case it is just insufficient.

Comment on lines +7878 to +7879
arc_flush_taskq = taskq_create("arc_flush", 75, defclsyspri,
1, INT_MAX, TASKQ_DYNAMIC | TASKQ_THREADS_CPU_PCT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a 75% of CPU cores here? System may have more important things to do, while even idle too many threads may have a cost, even if purely administrative, by polluting the list of threads in case of debugging.

Comment on lines +9438 to +9611
if ((dev = l2arc_dev_get_next()) == NULL ||
dev->l2ad_spa == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary

I don't like unnecessary defensive checks like that. They complicate code reading. Impossible things should be asserted.

Comment on lines +66 to +74
/* The asize in the header is only used by L2 cache */
#define HDR_SET_ASIZE(hdr, x) do { \
ASSERT(IS_P2ALIGNED((x), 1U << SPA_MINBLOCKSHIFT)); \
(hdr)->b_asize = ((x) >> SPA_MINBLOCKSHIFT); \
} while (0)

#define HDR_GET_LSIZE(hdr) ((hdr)->b_lsize << SPA_MINBLOCKSHIFT)
#define HDR_GET_PSIZE(hdr) ((hdr)->b_psize << SPA_MINBLOCKSHIFT)
#define HDR_GET_ASIZE(hdr) ((hdr)->b_asize << SPA_MINBLOCKSHIFT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find confusing naming it ASIZE (both here and in the structure), since it is specific only for L2ARC. We can not have non-L2ARC asize here since due to multiple block copies on different vdevs they might have different asize's. It is just because we do not allow multiple L2ARC copies it makes sense here. I'm thinking about something more specific, like L2SIZE and b_l2size. It would be nice for it to reside in l2arc_buf_hdr_t, but it does not worth the memory waste.

hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP);
hdr->b_birth = birth;
hdr->b_type = type;
hdr->b_flags = 0;
arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
HDR_SET_LSIZE(hdr, size);
HDR_SET_PSIZE(hdr, psize);
HDR_SET_ASIZE(hdr, vdev_psize_to_asize(dev->l2ad_vdev, psize));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, but I wonder if it would be cleaner to pass l2size as an argument from caller among the other sizes. We already have it there and you even assert that they are equal.

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants