-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared L2ARC - Proof of Concept #14060
base: master
Are you sure you want to change the base?
Conversation
(I will give a talk on this PoC at the OpenZFS Developer Summit 2022.) The ARC dynamically shares DRAM capacity among all currently imported zpools. However, the L2ARC does not do the same for block capacity: the L2ARC vdevs of one zpool only cache buffers of that zpool. This can be undesirable on systems that host multiple zpools because it inhibits dynamic sharing of the cache device capacity which is desirable if the need for L2ARC varies among zpools over time, or if the set of zpools that are imported in the system varies over time. Shared L2ARC addresses this need by decoupling the L2ARC vdevs from the zpools that store actual data. The mechanism that we use is to place the L2ARC vdevs into a special zpool, and to adjust the L2ARC feed thread logic to use that special zpool's L2ARC vdevs for all zpools' buffers. High-level changes: * Reserve "NTNX-fsvm-local-l2arc" as a magic zpool name. We call this "the l2arc pool". All other pools are called "primary pools". * Make l2arc feed thread feed ARC buffers from any zpool to the l2arc zpool. (Before this patch, the l2arc feed thread would only feed ARC buffers to l2arc devices if they are for the same spa_t). * Change the locking to ensure that the l2arc zpool cannot be removed while there are ongoing reads initiated by arc_read on one of the primary pools. This is sufficient and retains correctness of the ARC because nothing about the fundamental operation of L2ARC changes. The only thing that changes is that the L2ARC data is stored on vdevs outside the primary pool. Proof Of Concept => Production ============================== This commit is a proof-of-concept. It works, it results in the desired performance improvement, and it's stable. But to make it production ready, more work needs to be done. (1) The design is based on a version of ZFS that does not support encryption nor Persisent L2ARC. I'm no expert in either of these features. Encryption might work just fine as long as the l2arc feed thread can access the encryption keys for l2arc_apply_transforms. But Persistent L2ARC definitely needs more design work (multiple L2ARC headers?). (2) Remove hard-coded magic name; use a property instead. Make it opt-in so that existing setups are not affected. Example: zpool create -o share_l2arc_vdevs=on my-l2arc-pool (3) Coexistence with non-shared L2ARC; also via property. Make it opt-in so that existing setups are not affected. Example: zpool set use_shared_l2arc=on my-data-pool Signed-off-by: Christian Schwarz <[email protected]>
Nice idea. Maybe @gamanakis or @Ornias1993 want to take a look on high level design an especially on the persistent l2arc problem?! Thanks in advance to all participants. |
include/libzfs.h
Outdated
@@ -419,6 +419,11 @@ typedef enum { | |||
ZPOOL_STATUS_NON_NATIVE_ASHIFT, /* (e.g. 512e dev with ashift of 9) */ | |||
ZPOOL_STATUS_COMPATIBILITY_ERR, /* bad 'compatibility' property */ | |||
ZPOOL_STATUS_INCOMPATIBLE_FEAT, /* feature set outside compatibility */ | |||
/* | |||
* Pool won't use the given L2ARC because this software version uses | |||
* the Nutanix shared L2ARC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeet branding:
* the Nutanix shared L2ARC. | |
* the shared L2ARC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L2ARC being per-pool, has been plagueing the viability of multi-pool (for example: a fast and a slow pool) deployments for a while. Even when using multiple SSD's for L2ARC, it would make more sense to have them striped, instead of each serving a different pool.
In abstract: I like the simplicity of the design.
Though we do need to add/adapt a BUNCH LOAD of tests, because we need to be 300% sure that all edge cases are tested against. But at <300 lines of code currently, this would be an amazing benefit to the project :)
It's also important to thoroughly test this with weirder setups like dedupe, metadata vdevs, l2arc being defined as "metadata" only etc. Though I do not expect big issues with this.
While at it, though it think it's extreme-extreme niche, it might be prudent to allow multiple shared-L2ARC groups as well.
Though I do want to highlight that we should get rid of all the brand references. For following review and discussion, it might be nice doing so sooner rather than later ;-)
Now the only reference left is the special pool name. That whole concept is going to replaced by zpool properties in the future.
Is this PR dead? |
Sorry for the late reply. I currently have no plans to pursue this PR any further. That being said, I think the idea still stands and it's inevitable for the type of cloud ZFS setups illustrated in my dev summit talk and also @pcd1193182 's talk on shared log pool: EBS-like network disk for bulk storage, local NVMe for acceleration. Note that similar efforts are underway for the ZIL (shared log pool). |
I gave a talk on this PoC at the OpenZFS Developer Summit 2022: Wiki , Slides , Recording
The ARC dynamically shares DRAM capacity among all currently imported zpools. However, the L2ARC does not do the same for block capacity: the L2ARC vdevs of one zpool only cache buffers of that zpool. This can be undesirable on systems that host multiple zpools because it inhibits dynamic sharing of the cache device capacity which is desirable if the need for L2ARC varies among zpools over time, or if the set of zpools that are imported in the system varies over time.
Shared L2ARC addresses this need by decoupling the L2ARC vdevs from the zpools that store actual data. The mechanism that we use is to place the L2ARC vdevs into a special zpool, and to adjust the L2ARC feed thread logic to use that special zpool's L2ARC vdevs for all zpools' buffers.
High-level changes:
This is sufficient and retains correctness of the ARC because nothing about the fundamental operation of L2ARC changes. The only thing that changes is that the L2ARC data is stored on vdevs outside the primary pool.
Proof Of Concept => Production
This commit is a proof-of-concept.
It works, it results in the desired performance improvement, and it's stable. But to make it production ready, more work needs to be done.
(1) The design is based on a version of ZFS that does not support encryption nor Persisent L2ARC. I'm no expert in either of these features. Encryption might work just fine as long as the l2arc feed thread can access the encryption keys for l2arc_apply_transforms.
But Persistent L2ARC definitely needs more design work (multiple L2ARC headers?).
(2) Remove hard-coded magic name; use a property instead. Make it opt-in so that existing setups are not affected. Example:
zpool create -o share_l2arc_vdevs=on my-l2arc-pool
(3) Coexistence with non-shared L2ARC; also via property. Make it opt-in so that existing setups are not affected. Example:
zpool set use_shared_l2arc=on my-data-pool
Signed-off-by: Christian Schwarz [email protected]