Skip to content

Commit 8639ece

Browse files
committedAug 8, 2023
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve locality. The segmentation is achieved by grouping CPUs into pods. By default, the cache boundaries according to cpus_share_cache() define the CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the system has two L3 caches. The workqueue would be mapped to two worker_pools each serving one L3 cache domains. While this improves locality, because the pod boundaries are strict, it limits the total bandwidth a given issuer can consume. For example, let's say there is a thread pinned to a CPU issuing enough work items to saturate the whole machine. With the machine segmented into two pods, no matter how many work items it issues, it can only use half of the CPUs on the system. While this limitation has existed for a very long time, it wasn't very pronounced because the affinity grouping used to be always by NUMA nodes. With cache boundaries as the default and support for even finer grained scopes (smt and cpu), it is now an a lot more pressing problem. This patch implements non-strict affinity scope where the pod boundaries aren't enforced strictly. Going back to the previous example, the workqueue would still be mapped to two worker_pools; however, the affinity enforcement would be soft. The workers in both pools would have their cpus_allowed set to the whole machine thus allowing the scheduler to migrate them anywhere on the machine. However, whenever an idle worker is woken up, the workqueue code asks the scheduler to bring back the task within the pod if the worker is outside. ie. work items start executing within its affinity scope but can be migrated outside as the scheduler sees fit. This removes the hard cap on utilization while maintaining the benefits of affinity scopes. After the earlier ->__pod_cpumask changes, the implementation is pretty simple. When non-strict which is the new default: * pool_allowed_cpus() returns @pool->attrs->cpumask instead of ->__pod_cpumask so that the workers are allowed to run on any CPU that the associated workqueues allow. * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets the field to a CPU within the pod. This would be the first use of task_struct->wake_cpu outside scheduler proper, so it isn't clear whether this would be acceptable. However, other methods of migrating tasks are significantly more expensive and are likely prohibitively so if we want to do this on every work item. This needs discussion with scheduler folks. There is also a race window where setting ->wake_cpu wouldn't be effective as the target task is still on CPU. However, the window is pretty small and this being a best-effort optimization, it doesn't seem to warrant more complexity at the moment. While the non-strict cache affinity scopes seem to be the best option, the performance picture interacts with the affinity scope and is a bit complicated to fully discuss in this patch, so the behavior is made easily selectable through wqattrs and sysfs and the next patch will add documentation to discuss performance implications. v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools. Signed-off-by: Tejun Heo <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]>
1 parent 9546b29 commit 8639ece

File tree

5 files changed

+132
-20
lines changed

5 files changed

+132
-20
lines changed
 

‎Documentation/core-api/workqueue.rst

+23-7
Original file line numberDiff line numberDiff line change
@@ -353,9 +353,10 @@ Affinity Scopes
353353
An unbound workqueue groups CPUs according to its affinity scope to improve
354354
cache locality. For example, if a workqueue is using the default affinity
355355
scope of "cache", it will group CPUs according to last level cache
356-
boundaries. A work item queued on the workqueue will be processed by a
357-
worker running on one of the CPUs which share the last level cache with the
358-
issuing CPU.
356+
boundaries. A work item queued on the workqueue will be assigned to a worker
357+
on one of the CPUs which share the last level cache with the issuing CPU.
358+
Once started, the worker may or may not be allowed to move outside the scope
359+
depending on the ``affinity_strict`` setting of the scope.
359360

360361
Workqueue currently supports the following five affinity scopes.
361362

@@ -391,6 +392,21 @@ directory.
391392
``affinity_scope``
392393
Read to see the current affinity scope. Write to change.
393394

395+
``affinity_strict``
396+
0 by default indicating that affinity scopes are not strict. When a work
397+
item starts execution, workqueue makes a best-effort attempt to ensure
398+
that the worker is inside its affinity scope, which is called
399+
repatriation. Once started, the scheduler is free to move the worker
400+
anywhere in the system as it sees fit. This enables benefiting from scope
401+
locality while still being able to utilize other CPUs if necessary and
402+
available.
403+
404+
If set to 1, all workers of the scope are guaranteed always to be in the
405+
scope. This may be useful when crossing affinity scopes has other
406+
implications, for example, in terms of power consumption or workload
407+
isolation. Strict NUMA scope can also be used to match the workqueue
408+
behavior of older kernels.
409+
394410

395411
Examining Configuration
396412
=======================
@@ -475,21 +491,21 @@ Monitoring
475491
Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
476492

477493
$ tools/workqueue/wq_monitor.py events
478-
total infl CPUtime CPUhog CMwake mayday rescued
494+
total infl CPUtime CPUhog CMW/RPR mayday rescued
479495
events 18545 0 6.1 0 5 - -
480496
events_highpri 8 0 0.0 0 0 - -
481497
events_long 3 0 0.0 0 0 - -
482-
events_unbound 38306 0 0.1 - - - -
498+
events_unbound 38306 0 0.1 - 7 - -
483499
events_freezable 0 0 0.0 0 0 - -
484500
events_power_efficient 29598 0 0.2 0 0 - -
485501
events_freezable_power_ 10 0 0.0 0 0 - -
486502
sock_diag_events 0 0 0.0 0 0 - -
487503

488-
total infl CPUtime CPUhog CMwake mayday rescued
504+
total infl CPUtime CPUhog CMW/RPR mayday rescued
489505
events 18548 0 6.1 0 5 - -
490506
events_highpri 8 0 0.0 0 0 - -
491507
events_long 3 0 0.0 0 0 - -
492-
events_unbound 38322 0 0.1 - - - -
508+
events_unbound 38322 0 0.1 - 7 - -
493509
events_freezable 0 0 0.0 0 0 - -
494510
events_power_efficient 29603 0 0.2 0 0 - -
495511
events_freezable_power_ 10 0 0.0 0 0 - -

‎include/linux/workqueue.h

+11
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,17 @@ struct workqueue_attrs {
169169
*/
170170
cpumask_var_t __pod_cpumask;
171171

172+
/**
173+
* @affn_strict: affinity scope is strict
174+
*
175+
* If clear, workqueue will make a best-effort attempt at starting the
176+
* worker inside @__pod_cpumask but the scheduler is free to migrate it
177+
* outside.
178+
*
179+
* If set, workers are only allowed to run inside @__pod_cpumask.
180+
*/
181+
bool affn_strict;
182+
172183
/*
173184
* Below fields aren't properties of a worker_pool. They only modify how
174185
* :c:func:`apply_workqueue_attrs` select pools and thus don't

‎kernel/workqueue.c

+72-2
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,7 @@ enum pool_workqueue_stats {
211211
PWQ_STAT_CPU_TIME, /* total CPU time consumed */
212212
PWQ_STAT_CPU_INTENSIVE, /* wq_cpu_intensive_thresh_us violations */
213213
PWQ_STAT_CM_WAKEUP, /* concurrency-management worker wakeups */
214+
PWQ_STAT_REPATRIATED, /* unbound workers brought back into scope */
214215
PWQ_STAT_MAYDAY, /* maydays to rescuer */
215216
PWQ_STAT_RESCUED, /* linked work items executed by rescuer */
216217

@@ -1103,13 +1104,41 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
11031104
static bool kick_pool(struct worker_pool *pool)
11041105
{
11051106
struct worker *worker = first_idle_worker(pool);
1107+
struct task_struct *p;
11061108

11071109
lockdep_assert_held(&pool->lock);
11081110

11091111
if (!need_more_worker(pool) || !worker)
11101112
return false;
11111113

1112-
wake_up_process(worker->task);
1114+
p = worker->task;
1115+
1116+
#ifdef CONFIG_SMP
1117+
/*
1118+
* Idle @worker is about to execute @work and waking up provides an
1119+
* opportunity to migrate @worker at a lower cost by setting the task's
1120+
* wake_cpu field. Let's see if we want to move @worker to improve
1121+
* execution locality.
1122+
*
1123+
* We're waking the worker that went idle the latest and there's some
1124+
* chance that @worker is marked idle but hasn't gone off CPU yet. If
1125+
* so, setting the wake_cpu won't do anything. As this is a best-effort
1126+
* optimization and the race window is narrow, let's leave as-is for
1127+
* now. If this becomes pronounced, we can skip over workers which are
1128+
* still on cpu when picking an idle worker.
1129+
*
1130+
* If @pool has non-strict affinity, @worker might have ended up outside
1131+
* its affinity scope. Repatriate.
1132+
*/
1133+
if (!pool->attrs->affn_strict &&
1134+
!cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
1135+
struct work_struct *work = list_first_entry(&pool->worklist,
1136+
struct work_struct, entry);
1137+
p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
1138+
get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
1139+
}
1140+
#endif
1141+
wake_up_process(p);
11131142
return true;
11141143
}
11151144

@@ -2051,7 +2080,10 @@ static struct worker *alloc_worker(int node)
20512080

20522081
static cpumask_t *pool_allowed_cpus(struct worker_pool *pool)
20532082
{
2054-
return pool->attrs->__pod_cpumask;
2083+
if (pool->cpu < 0 && pool->attrs->affn_strict)
2084+
return pool->attrs->__pod_cpumask;
2085+
else
2086+
return pool->attrs->cpumask;
20552087
}
20562088

20572089
/**
@@ -3715,6 +3747,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
37153747
to->nice = from->nice;
37163748
cpumask_copy(to->cpumask, from->cpumask);
37173749
cpumask_copy(to->__pod_cpumask, from->__pod_cpumask);
3750+
to->affn_strict = from->affn_strict;
37183751

37193752
/*
37203753
* Unlike hash and equality test, copying shouldn't ignore wq-only
@@ -3745,6 +3778,7 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
37453778
BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
37463779
hash = jhash(cpumask_bits(attrs->__pod_cpumask),
37473780
BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
3781+
hash = jhash_1word(attrs->affn_strict, hash);
37483782
return hash;
37493783
}
37503784

@@ -3758,6 +3792,8 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
37583792
return false;
37593793
if (!cpumask_equal(a->__pod_cpumask, b->__pod_cpumask))
37603794
return false;
3795+
if (a->affn_strict != b->affn_strict)
3796+
return false;
37613797
return true;
37623798
}
37633799

@@ -5847,6 +5883,7 @@ module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
58475883
* nice RW int : nice value of the workers
58485884
* cpumask RW mask : bitmask of allowed CPUs for the workers
58495885
* affinity_scope RW str : worker CPU affinity scope (cache, numa, none)
5886+
* affinity_strict RW bool : worker CPU affinity is strict
58505887
*/
58515888
struct wq_device {
58525889
struct workqueue_struct *wq;
@@ -6026,10 +6063,42 @@ static ssize_t wq_affn_scope_store(struct device *dev,
60266063
return ret ?: count;
60276064
}
60286065

6066+
static ssize_t wq_affinity_strict_show(struct device *dev,
6067+
struct device_attribute *attr, char *buf)
6068+
{
6069+
struct workqueue_struct *wq = dev_to_wq(dev);
6070+
6071+
return scnprintf(buf, PAGE_SIZE, "%d\n",
6072+
wq->unbound_attrs->affn_strict);
6073+
}
6074+
6075+
static ssize_t wq_affinity_strict_store(struct device *dev,
6076+
struct device_attribute *attr,
6077+
const char *buf, size_t count)
6078+
{
6079+
struct workqueue_struct *wq = dev_to_wq(dev);
6080+
struct workqueue_attrs *attrs;
6081+
int v, ret = -ENOMEM;
6082+
6083+
if (sscanf(buf, "%d", &v) != 1)
6084+
return -EINVAL;
6085+
6086+
apply_wqattrs_lock();
6087+
attrs = wq_sysfs_prep_attrs(wq);
6088+
if (attrs) {
6089+
attrs->affn_strict = (bool)v;
6090+
ret = apply_workqueue_attrs_locked(wq, attrs);
6091+
}
6092+
apply_wqattrs_unlock();
6093+
free_workqueue_attrs(attrs);
6094+
return ret ?: count;
6095+
}
6096+
60296097
static struct device_attribute wq_sysfs_unbound_attrs[] = {
60306098
__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
60316099
__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
60326100
__ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store),
6101+
__ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store),
60336102
__ATTR_NULL,
60346103
};
60356104

@@ -6452,6 +6521,7 @@ void __init workqueue_init_early(void)
64526521
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
64536522
cpumask_copy(pool->attrs->__pod_cpumask, cpumask_of(cpu));
64546523
pool->attrs->nice = std_nice[i++];
6524+
pool->attrs->affn_strict = true;
64556525
pool->node = cpu_to_node(cpu);
64566526

64576527
/* alloc pool ID */

‎tools/workqueue/wq_dump.py

+12-4
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,11 @@
3636
Lists all workqueues along with their type and worker pool association. For
3737
each workqueue:
3838
39-
NAME TYPE POOL_ID...
39+
NAME TYPE[,FLAGS] POOL_ID...
4040
4141
NAME name of the workqueue
4242
TYPE percpu, unbound or ordered
43+
FLAGS S: strict affinity scope
4344
POOL_ID worker pool ID associated with each possible CPU
4445
"""
4546

@@ -138,13 +139,16 @@ def print_pod_type(pt):
138139
print(f'cpu={pool.cpu.value_():3}', end='')
139140
else:
140141
print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='')
142+
print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='')
143+
if pool.attrs.affn_strict:
144+
print(' strict', end='')
141145
print('')
142146

143147
print('')
144148
print('Workqueue CPU -> pool')
145149
print('=====================')
146150

147-
print('[ workqueue \ CPU ', end='')
151+
print('[ workqueue \ type CPU', end='')
148152
for cpu in for_each_possible_cpu(prog):
149153
print(f' {cpu:{max_pool_id_len}}', end='')
150154
print(' dfl]')
@@ -153,11 +157,15 @@ def print_pod_type(pt):
153157
print(f'{wq.name.string_().decode()[-24:]:24}', end='')
154158
if wq.flags & WQ_UNBOUND:
155159
if wq.flags & WQ_ORDERED:
156-
print(' ordered', end='')
160+
print(' ordered ', end='')
157161
else:
158162
print(' unbound', end='')
163+
if wq.unbound_attrs.affn_strict:
164+
print(',S ', end='')
165+
else:
166+
print(' ', end='')
159167
else:
160-
print(' percpu ', end='')
168+
print(' percpu ', end='')
161169

162170
for cpu in for_each_possible_cpu(prog):
163171
pool_id = per_cpu_ptr(wq.cpu_pwq, cpu)[0].pool.id.value_()

‎tools/workqueue/wq_monitor.py

+14-7
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,11 @@
2020
and got excluded from concurrency management to avoid stalling
2121
other work items.
2222
23-
CMwake The number of concurrency-management wake-ups while executing a
24-
work item of the workqueue.
23+
CMW/RPR For per-cpu workqueues, the number of concurrency-management
24+
wake-ups while executing a work item of the workqueue. For
25+
unbound workqueues, the number of times a worker was repatriated
26+
to its affinity scope after being migrated to an off-scope CPU by
27+
the scheduler.
2528
2629
mayday The number of times the rescuer was requested while waiting for
2730
new worker creation.
@@ -65,6 +68,7 @@ def err(s):
6568
PWQ_STAT_CPU_TIME = prog['PWQ_STAT_CPU_TIME'] # total CPU time consumed
6669
PWQ_STAT_CPU_INTENSIVE = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations
6770
PWQ_STAT_CM_WAKEUP = prog['PWQ_STAT_CM_WAKEUP'] # concurrency-management worker wakeups
71+
PWQ_STAT_REPATRIATED = prog['PWQ_STAT_REPATRIATED'] # unbound workers brought back into scope
6872
PWQ_STAT_MAYDAY = prog['PWQ_STAT_MAYDAY'] # maydays to rescuer
6973
PWQ_STAT_RESCUED = prog['PWQ_STAT_RESCUED'] # linked work items executed by rescuer
7074
PWQ_NR_STATS = prog['PWQ_NR_STATS']
@@ -89,22 +93,25 @@ def dict(self, now):
8993
'cpu_time' : self.stats[PWQ_STAT_CPU_TIME],
9094
'cpu_intensive' : self.stats[PWQ_STAT_CPU_INTENSIVE],
9195
'cm_wakeup' : self.stats[PWQ_STAT_CM_WAKEUP],
96+
'repatriated' : self.stats[PWQ_STAT_REPATRIATED],
9297
'mayday' : self.stats[PWQ_STAT_MAYDAY],
9398
'rescued' : self.stats[PWQ_STAT_RESCUED], }
9499

95100
def table_header_str():
96101
return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} '\
97-
f'{"CPUitsv":>7} {"CMwake":>7} {"mayday":>7} {"rescued":>7}'
102+
f'{"CPUitsv":>7} {"CMW/RPR":>7} {"mayday":>7} {"rescued":>7}'
98103

99104
def table_row_str(self):
100105
cpu_intensive = '-'
101-
cm_wakeup = '-'
106+
cmw_rpr = '-'
102107
mayday = '-'
103108
rescued = '-'
104109

105-
if not self.unbound:
110+
if self.unbound:
111+
cmw_rpr = str(self.stats[PWQ_STAT_REPATRIATED]);
112+
else:
106113
cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE])
107-
cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP])
114+
cmw_rpr = str(self.stats[PWQ_STAT_CM_WAKEUP])
108115

109116
if self.mem_reclaim:
110117
mayday = str(self.stats[PWQ_STAT_MAYDAY])
@@ -115,7 +122,7 @@ def table_row_str(self):
115122
f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
116123
f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000:8.1f} ' \
117124
f'{cpu_intensive:>7} ' \
118-
f'{cm_wakeup:>7} ' \
125+
f'{cmw_rpr:>7} ' \
119126
f'{mayday:>7} ' \
120127
f'{rescued:>7} '
121128
return out.rstrip(':')

0 commit comments

Comments
 (0)
Please sign in to comment.