Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel ARC eviction #16486

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

allanjude
Copy link
Contributor

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.

Motivation and Context

Read and write performance can become limited by the arc_evict process being single threaded.
Additional data cannot be added to the ARC until sufficient existing data is evicted.

On many-core systems with TBs of RAM, a single thread becomes a significant bottleneck.

With the change we see a 25% increase in read and write throughput

Description

Use a new taskq to run multiple multiple arc_evict() threads at once, each given a fraction of the desired memory to reclaim

How Has This Been Tested?

Benchmarking with a full ARC to measure the performance difference.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Comment on lines +4172 to +4185
uint64_t nchunks = ((left - 1) >> MIN_EVICT_PERTASK_SHIFT) + 1;
unsigned n = nchunks < num_sublists ? nchunks : num_sublists;
uint64_t fullrows = nchunks / n;
unsigned lastrowcols = nchunks % n;
unsigned k = (lastrowcols ? lastrowcols : n);

uint64_t bytes_pertask_low =
fullrows << MIN_EVICT_PERTASK_SHIFT;
uint64_t bytes_pertask = bytes_pertask_low + (lastrowcols ?
(1 << MIN_EVICT_PERTASK_SHIFT) : 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are over-engineering here. I don't think eviction per taskq should really be a multiple of 1 << MIN_EVICT_PERTASK_SHIFT to complicate the logic, merely it should be bigger than one. So you could just use MIN_EVICT_PERTASK_SHIFT to decide number of tasks, and then split the eviction amount equally between them.

And I wonder if it would make sense to scale number of tasks with eviction not linearly, but in some logarithimic fashion to not spin too many threads at once, stressing the system more for diminishing return.

module/zfs/arc.c Outdated Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
@adamdmoss
Copy link
Contributor

adamdmoss commented Sep 12, 2024

I've been casually testing this out (combined with the parallel_dbuf_evict PR) over the last couple of weeks (most recently, 5b070d1 ).

I've not been hammering it hard or specifically, just letting it do its thing with my messing-around desktop system.

Hit a probable regression today, though: while mv'ing a meager 8GB of files from one pool to another, all my zfs IO got really high-latency, and an iotop showed that the copy part of the move (this being a mv across pools, so in reality it's a copy-and-remove) was running at a painful few 100KB/sec, and the zfs arc_evict thread was taking a whole core... but just one core.

In time it all cleared up and of course I can't conclusively blame this PR's changes, but I left with two fuzzy observations:

  • In many years of mucking around with ZFS I've never(?) seemed to get the 'arc_evict is pegging CPU badly' issue until I started testing this PR's changes (though I'm aware the issue occurs in the wild for folks on master/release ZFSes)
  • arc_evict was only using one core as far as I can tell, so I guess the parallelism which is the point of this PR just wasn't kicking-in for some reason anyway and/or the spinning was happening outside of the parallelized part

@0mp
Copy link
Contributor

0mp commented Sep 12, 2024

I have updated the patch with a different logic for picking the default maximum number of ARC eviction threads. The new logic aims to pick the number that is one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

@amotin
Copy link
Member

amotin commented Sep 12, 2024

one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about highbit(), though then it will grow pretty slow for very large systems, so that the limit of 16 will never be reached. But I am not exactly sure the faster growth would make sense, since it may cause more lock contentions in memory allocator, etc.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 13, 2024
@allanjude
Copy link
Contributor Author

one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about highbit(), though then it will grow pretty slow for very large systems, so that the limit of 16 will never be reached. But I am not exactly sure the faster growth would make sense, since it may cause more lock contentions in memory allocator, etc.

Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled. But for something like 4-12 core systems, we would want it to use at least 2 threads, and then grow from there, reaching 16 threads at 128 cores.

@amotin
Copy link
Member

amotin commented Sep 16, 2024

Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled.

Now that you mentioned it, I've noticed its been disabled by default. I don't like the idea to tune it manually in production depending on system size. I would prefer to to have reasonable automatic defaults.

Read and write performance can become limited by the arc_evict
process being single threaded. Additional data cannot be added
to the ARC until sufficient existing data is evicted.

On many-core systems with TBs of RAM, a single thread becomes
a significant bottleneck.

With the change we see a 25% increase in read and write throughput

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Co-authored-by: Allan Jude <[email protected]>
Co-authored-by: Mateusz Piotrowski <[email protected]>
Signed-off-by: Alexander Stetsenko <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
@0mp
Copy link
Contributor

0mp commented Oct 23, 2024

Hey! So, here's what changed in the patch:

Formula

There is now a different formula for automatically scaling the number of evict threads when the parameter is set to 0. The formula is:

MIN(MAX(max_ncpus > 6 ? 2 : 1, ilog2(max_ncpus) + (max_ncpus >> 6)), 16);

It looks like this (the x axis is the CPU count and the y axis is the evict thread count):

image

Here's also a table:

CPUs zfs_arc_evict_threads Evict threads count Using taskq?
1 0 1 (autoscaled) No
2 0 1 (autoscaled) No
5 0 1 (autoscaled) No
6 0 2 (autoscaled) Yes
1024 0 16 (autoscaled) Yes
(not using autoscaling, CPU count is irrelevant) 1 1 No
(not using autoscaling, CPU count is irrelevant) 32 32 Yes

Less parameters

zfs_arc_evict_threads is now the only parameter exposed to control the evict thread count. The zfs_arc_evict_threads_parallel has been removed in favor of enabling the use of taskqs when there are two or more evict threads.

This approach has been suggested by @tonyhutter in another PR (#16487 (comment)).

Stability improvements

It is no longer possible to modify the actual evict threads count during runtime. Since the evict taskqs are only created during arc_init(), the module saves the actual number of evict threads it is going to use and does not care about changes to zfs_arc_evict_threads from then on. This behavior has been documented in the manual page.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for automating it. Few comments to that part, and please take a look on my earlier comments.

man/man4/zfs.4 Outdated Show resolved Hide resolved
Comment on lines +7899 to +7900
MIN(zfs_arc_evict_threads_live, max_ncpus), defclsyspri,
MIN(zfs_arc_evict_threads_live, max_ncpus), max_ncpus,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need another set of limitations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you got anything in particular on your mind?

What I see here is that perhaps it is not right to cap the thread count at the CPU count, which might be against what the user when setting the tunable manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. When we calculate zfs_arc_evict_threads_live above -- it should be sane already. If it is submitted by user -- should we really prevent foot-shooting? And in any case I would move it out of taskq_create() call, limiting the zfs_arc_evict_threads_live value, or the zfs_arc_evict_threads if you remove duplication as I have proposed below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree that the foot-shooting prevention seems unnecessary.

From what I understand:

  • nthreads controls the number of threads in the queue, so we can just set it to zfs_arc_evict_threads_live
  • minalloc controls the minimal size of the task entries cache. It guess it makes sense to set it to at least zfs_arc_evict_threads_live.
  • maxalloc controls the maximum size of the task entries cache. I've grepped the tree for other examples of taskq_create() invocations and it seems like INT_MAX is a fairly popular value here. Do you have any thoughts about it?

The resulting code would look like this:

	arc_evict_taskq = taskq_create("arc_evict",
	    zfs_arc_evict_threads_live, defclsyspri,
	    zfs_arc_evict_threads_live, max_ncpus, INT_MAX,
	    TASKQ_PREPOPULATE);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About nthread I agree -- zfs_arc_evict_threads_live. minalloc generally you may set to a number of tasks you expect to be scheduled same time often enough, that may avoid some memory allocations, especially when we may be low on memory. maxalloc I would set to INT_MAX unless you want the code to actually limit that, which I don't think you do. But since you are pre-allocating the taskq entries as part of evict_arg_t and using taskq_dispatch_ent() it is actually irrelevant. You may just set the min and max to 0 and INT_MAX respectively.

module/zfs/arc.c Outdated
Comment on lines 7887 to 7888
zfs_arc_evict_threads_live = MIN(MAX(max_ncpus > 6 ? 2 : 1,
arc_ilog2(max_ncpus) + (max_ncpus >> 6)), 16);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the MAX(max_ncpus > 6 ? 2 : 1 part? ilog2(6) should already be 2.

Copy link
Contributor

@0mp 0mp Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks!

Currently, we get the following thread counts:

CPU count Resulting thread count
1 1
2 1
3 1
4 2
... ....

So you are right that the MAX(max_ncpus > 6 ? 2 : 1 part is not working as expected currently. If we want to stick to 1 thread for 4 and 5 CPUs then we'd need to use the following formula:

// Version 2a
MIN(max_ncpus < 6 ? 1 : arc_ilog2(max_ncpus) + (max_ncpus >> 6), 16);

If we decide to simplify that further and go for 2 threads on systems with 4 or 5 CPUs, then we can just use:

// Version 2b
MAX(1, MIN(arc_ilog2(max_ncpus) + (max_ncpus >> 6), 16));

Version 2b looks good to me and is certainly easier to reason about.

Copy link
Member

@amotin amotin Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2b looks cleaner, but 2 threads out of 4 sound a bit overkill to me. May be the curve could be thought more. BTW, speaking about more clean (readable) code, this is not performance-critical part and we are not in 1990's, there is no point to use bit shifts for division, you may just use / 64 and compilers will do it right. And then you would not need parenthesis around it.

module/zfs/arc.c Outdated Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
- Improve the description of the scaling algorithm in the manual page.

Signed-off-by: Mateusz Piotrowski <[email protected]>
This parameter cannot be changed during runtime anyway in any meaningful
way. Make it explicitly read-only.

The manual page does not need to be updated. It already mentions that
the thread count cannot be changed during runtime.
@@ -4071,25 +4117,108 @@ arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa,
multilist_sublist_unlock(mls);
}

evict_arg_t *evarg = kmem_alloc(sizeof (*evarg) * num_sublists,
KM_SLEEP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleepable memory allocation in eviction path is a request for potential troubles.

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants