-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During pool export flush the ARC asynchronously #16215
base: master
Are you sure you want to change the base?
Conversation
@gamanakis -- if you have time could you look at the L2 part of this change? Thanks |
Thanks for including me on this, on a first pass it looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?
if ((dev = l2arc_dev_get_next()) == NULL || | ||
dev->l2ad_spa == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain l2ad_spa and l2ad_vdev locking semantics, and as part of that how can l2ad_spa be NULL here if we assume the locking is correct, or how places like arc_hdr_l2hdr_destroy() won't catch NULL dereference due to a race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the above dev->l2ad_spa == NULL
check was defensive (part of development) but is not necessary since l2arc_dev_invalid()
already checks under locks if the spa was removed for a device. l2arc_dev_get_next()
will never return with a NULL dev->l2ad_spa
.
As far as locking is concerned.
The spa_namespace_lock
covers a transition of spa->spa_is_exporting
and the removal of a spa_t
The spa config SCL_L2ARC
lock protects against a vdev/spa from being remove while in-use
The l2arc_dev_mtx
protects the L2 device list and a L2 device's l2ad_spa
and l2ad_vdev
fields
l2arc_dev_get_next()
will hand out L2 devices and returns with the spa config SCL_L2ARC
lock held. There are two possible spa exceptions that l2arc_dev_get_next()
checks for:
- spa is being removed (
dev->l2ad_spa->spa_is_exporting
= B_TRUE) -- protected byspa_namespace_lock
- spa was removed (
dev->l2ad_spa
= NULL) -- protected byl2arc_dev_mtx
This means that when an L2 device is being removed, both the l2arc_dev_mtx
and a writer spa config SCL_L2ARC
lock should be held to prevent any races. Note that the latter is currently not true and will be remedied.
Also, arc_hdr_l2hdr_destroy()
is protected by the dev->l2ad_mtx
lock. The l2ad_vdev
can be null after a pool is exported and an async ARC removal is still in progress.
As indicated above, there is a race that I now see in l2arc_remove_vdev()
. It was holding the l2arc_dev_mtx
to transition dev->l2ad_spa
to NULL but it wasn't taking the spa config SCL_L2ARC
(as writer) to let any inflight devices' from a past l2arc_dev_get_next()
drain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary
I don't like unnecessary defensive checks like that. They complicate code reading. Impossible things should be asserted.
Good question and I tested this case. The async flush will continue it's best-effort pass to flush buffers associated with the exported spa's guid. Any ARC buffers that the import uses will have a positive ref count and be skipped by the async flush task. You can think of it as an alternate arc evict thread that is targeting a specific guid with a zero ref count rather than age. I suppose we could have the task periodically check if there is an active spa with that guid and exit the task. I'm not sure how common it is to export and then re-import right away on the same host. Before this change, you would have to wait until the export finished flushing the ARC buffers. The ARC teardown for a spa can take multiple minutes, so you could even have a second pool export+import while the first |
Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data. |
I share this concern. Would it make sense to block importing the same pool again until eviction is complete? |
Per @amotin -- add a zpool wait for ARC teardown |
To address concerns of re-importing after the pool was changed on another host: Save txg at export and compare to imported txg
|
Thinking more about it, since ARC is indexed on DVA+birth, in case of pool export/import if some blocks appear stale, they should just never be used, unless we actually import some other pool somehow having the same GUID, or import the pool at earlier TXG. We already have somewhat similar problem with persistent L2ARC, when we load into ARC headers blocks that could be long freed from the pool, and they stay in ARC until L2ARC rotate and evict them. But in case of L2ARC we at least know that those stale blocks are from this pool and just won't be used again. I am not sure whether multiple pools with the same GUID is a realistic scenario, but importing the pool at earlier TXG I think may be more realistic, and dangerous same time, since the same TXG numbers may be reused. |
f8d5651
to
03db4c3
Compare
Update on re-import while ARC tear-down in progress Ever since the commit that added So we don't need to worry about ARC blocks that are still around when the pool is re-imported since it will identify its blocks using a different |
@don-brady I was wondering when you were going to remember the behaviour of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at having arc_evict
prioritize evicting from flushed pools vs trying to evict buffers from active pools?
I hadn't considered it other than if it was safe. Both We can have multiple threads ( (a) One option would be to have (b) Another option would be to keep a list of active flush spa guids and have the |
f1c4323
to
3d835dd
Compare
Rebased to fix merge conflicts. Eviction thread nows considers active spa flushes and targets those spas (if any). |
3d835dd
to
1b4e0ff
Compare
Removed last commit and rebased to latest master branch. |
This also includes removing L2 vdevs asynchronously Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
1b4e0ff
to
17ea221
Compare
Rebased to latest master and switched to use |
17ea221
to
12bd860
Compare
fixed a leak in |
The zpool reguid feature introduced the spa_load_guid, which is a transient value used for runtime identification purposes in the ARC. This value is not the same as the spa's persistent pool guid. However, the value is seeded from spa_generate_load_guid() which does not check for uniqueness against the spa_load_guid from other pools. Although extremely rare, you can end up with two different pools sharing the same spa_load_guid value! This change guarantees that the value is always unique and additionally not still in use by an async arc flush task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
12bd860
to
9805730
Compare
Thank you for this PR, the L2ARC part looks good to me in a first pass. |
/* | ||
* Note that arc_flush_task() needs arc_async_flush_lock to remove af | ||
* list node. So by holding the lock we avoid a race for af removal | ||
* with our use here. | ||
*/ | ||
mutex_enter(&arc_async_flush_lock); | ||
taskq_dispatch_ent(arc_flush_taskq, arc_flush_task, | ||
af, TQ_SLEEP, &af->af_tqent); | ||
mutex_exit(&arc_async_flush_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not getting why would we need this lock here and in l2arc_remove_vdev()
. IMO it does not protect anything. arc_flush_task()
takes it by itself as part of arc_async_flush_remove()
, and any way it runs asynchronously, so locking here should do nothing to it. I wonder if it supposed to protect from arc_async_flush_remove()
called by something else, but in that case it is just insufficient.
arc_flush_taskq = taskq_create("arc_flush", 75, defclsyspri, | ||
1, INT_MAX, TASKQ_DYNAMIC | TASKQ_THREADS_CPU_PCT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a 75% of CPU cores here? System may have more important things to do, while even idle too many threads may have a cost, even if purely administrative, by polluting the list of threads in case of debugging.
if ((dev = l2arc_dev_get_next()) == NULL || | ||
dev->l2ad_spa == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the above dev->l2ad_spa == NULL check was defensive (part of development) but is not necessary
I don't like unnecessary defensive checks like that. They complicate code reading. Impossible things should be asserted.
/* The asize in the header is only used by L2 cache */ | ||
#define HDR_SET_ASIZE(hdr, x) do { \ | ||
ASSERT(IS_P2ALIGNED((x), 1U << SPA_MINBLOCKSHIFT)); \ | ||
(hdr)->b_asize = ((x) >> SPA_MINBLOCKSHIFT); \ | ||
} while (0) | ||
|
||
#define HDR_GET_LSIZE(hdr) ((hdr)->b_lsize << SPA_MINBLOCKSHIFT) | ||
#define HDR_GET_PSIZE(hdr) ((hdr)->b_psize << SPA_MINBLOCKSHIFT) | ||
#define HDR_GET_ASIZE(hdr) ((hdr)->b_asize << SPA_MINBLOCKSHIFT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find confusing naming it ASIZE
(both here and in the structure), since it is specific only for L2ARC. We can not have non-L2ARC asize here since due to multiple block copies on different vdevs they might have different asize's. It is just because we do not allow multiple L2ARC copies it makes sense here. I'm thinking about something more specific, like L2SIZE
and b_l2size
. It would be nice for it to reside in l2arc_buf_hdr_t
, but it does not worth the memory waste.
hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP); | ||
hdr->b_birth = birth; | ||
hdr->b_type = type; | ||
hdr->b_flags = 0; | ||
arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR); | ||
HDR_SET_LSIZE(hdr, size); | ||
HDR_SET_PSIZE(hdr, psize); | ||
HDR_SET_ASIZE(hdr, vdev_psize_to_asize(dev->l2ad_vdev, psize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical, but I wonder if it would be cleaner to pass l2size as an argument from caller among the other sizes. We already have it there and you even assert that they are equal.
Motivation and Context
When a pool is exported, the ARC buffers in use by that pool are flushed (evicted) as part of the export. In addition, any L2 VDEVs are removed from the L2 ARC. Both of these operations are performed sequentially and inline to the export. For larger ARC footprints, this can represent a significant amount of time. In an HA scenario, this can cause a planned failover to take longer than needed and risk timeouts on the services backed by the pool data.
Description
The teardown of the ARC data used by the pool can be done asynchronously during a pool export. For the main ARC data, the spa load GUID is used to associate a buffer with the spa so we can safely free the
spa_t
while the teardown proceeds in the background. For the L2 ARC VDEV, the devicel2arc_dev_t
has a copy of thevdev_t
pointer which was being used to calculate the asize of the buffer from the psize when updating the L2 arc stats during the teardown. This asize value can be captured when the buffer is created, thereby eliminating the need for a late binding asize calculation using the VDEV during teardown.Added an
arc_flush_taskq
for these background teardown tasks. Inarc_fini()
(e.g., during module unload) it now waits for any teardown tasks to complete.Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
How Has This Been Tested?
l2arc
,zpool_import
,zpool_export
, to exercise the changed code paths.ztest
Types of changes
Checklist:
Signed-off-by
.