scx: Implement cpufreq support #180

htejun · 2024-04-12T00:39:31Z

Add scx_bpf_cpuperf_cap(), scx_bpf_cpuperf_cur() and scx_bpf_cpuperf_set()which allows BPF schedulers to monitor and set each CPU's performance level.ops.tick()` is added too.

Byte-Lab

Great work, this looks good to me. I left a few comments, but overall this looks correct and like the right approach.

One thing I did want to ask -- now that we have support for dynamically setting performance levels for individual CPUs, do you think we should consider adding support for uclamp? Given that it's already an API that's available to users, and fits the schedutil CPU performance APIs quite well, it might be a natural API to expose and build on. Wdyt?

Byte-Lab · 2024-04-12T04:25:10Z

tools/sched_ext/scx_qmap.bpf.c

+	if (!online)
+		return -ENOMEM;


It shouldn't be necessary to check for nullity here -- the kfunc doesn't advertise that it can return NULL via KF_RET_NULL

Ah, will drop. Thanks.

Byte-Lab · 2024-04-12T04:30:32Z

kernel/sched/ext.c

@@ -16,6 +16,45 @@ enum scx_consts {
 	SCX_EXIT_BT_LEN			= 64,
 	SCX_EXIT_MSG_LEN		= 1024,
 	SCX_EXIT_DUMP_DFL_LEN		= 32768,
+
+	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,


Could you please clarify the idea behind the name of this constant? Why wouldn't we want to call it SCX_CPUPERF_MAX? Is the idea that these relative capacity and frequency (or performance?) values we're querying are conceptually on a scale of [0, 1] relative to other cores on the system, even though in practice they're [0, 1024]?

Yeah, it's treating it as fixed point scale where the actual number range is [0, 1]. In practice, it's the same as MAX but I tend to like ONEs better because it's conceptually more versatile. e.g. The current interface doesn't do it but if we want the interface turbo, doing it with ONE is more natural than with MAX.

Byte-Lab · 2024-04-12T05:22:56Z

kernel/sched/ext.c

+	 * executing an SCX task. Setting @p->scx.slice to 0 will trigger an
+	 * immediate dispatch cycle on the CPU.
+	 */
+	void (*tick)(struct task_struct *p);


Do you think it would be useful to also pass the s32 cpu? Certainly not strictly necessary, but I'd imagine it will be pretty common for callers to use scx_bpf_task_cpu() and bpf_get_smp_processor_id(), so it might be nice for ergonomics.

I think the benefit of keeping the interface params minimal likely outweighs the convenience factor. For internal functions, having extra args to avoid repeated lookups is fine but here I think it's better to keep things tighter.

Byte-Lab · 2024-04-12T05:39:43Z

kernel/sched/ext.c

+ * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
+ * schedutil cpufreq governor chooses the target frequency. The actual
+ * performance level chosen is dependent on the hardware and cpufreq driver in
+ * use and can be monitored using scx_bpf_cpuperf_cur().


Do you think it's worth mentioning that the latency for frequency changes to be observed is hardware dependent? Or should that already be pretty obvious to anyone who'd be playing with these knobs in the first place?

Will fortify the comment a bit.

htejun · 2024-04-12T07:55:13Z

One thing I did want to ask -- now that we have support for dynamically setting performance levels for individual CPUs, do you think we should consider adding support for uclamp? Given that it's already an API that's available to users, and fits the schedutil CPU performance APIs quite well, it might be a natural API to expose and build on. Wdyt?

uclamp overlays on top of schedutil and should work the same whether the underlying scheduler is fair or scx. ie. If the user clamps cpufreq range for a task, the clamping already works the same for scx.

Use a struct instead of u64[2]. This will ease future changes. No functional changes.

These two functions being inlined ends up bringing out a bunch of other stuff into kernel/sched/ext.h. Let's uninline them. - Uninline both and rename scx_notify_sched_tick() to scx_tick() and scx_notify_pick_next_task() to scx_next_task_picked(). The notify term is a bit unusual and more often used with notifier chains which isn't the case here. - Call scx_tick() while holding rq lock. This doesn't make difference now but will ease future changes. - Move the stuff which was in kernel/sched/ext.h to support the inline functions into kernel/sched/ext.c. After this, both ext header files are really lean containing only what's needed to integrate with the rest of the kernel. - Some other cosmetic changes. Other than scx_tick() being called under rq lock. No functional changes.

…rs() RT, DL, thermal and irq load and utilization metrics need to be decayed and updated periodically and before consumption to keep the numbers reasonable. This is currently done from __update_blocked_others() as a part of the fair class load balance path. Let's factor it out to update_other_load_avgs(). Pure refactor. No functional changes. This will be used by the new BPF extensible scheduling class to ensure that the above metrics are properly maintained. Signed-off-by: Tejun Heo <[email protected]>

…chanisms Without this, e.g., RT util metric gets stuck high which can mislead the schedutil cpufreq governor.

sugov_cpu_is_busy() is used to avoid decreasing performance level while the CPU is busy and called by sugov_update_single_freq() and sugov_update_single_perf(). Both callers repeat the same pattern to first test for uclamp and then the business. Let's refactor so that the tests aren't repeated. The new helper is named sugov_hold_freq() and tests both the uclamp exception and CPU business. No functional changes. This will make adding more exception conditions easier. Signed-off-by: Tejun Heo <[email protected]>

This gets called on every tick if the CPU is executing an SCX task. e.g. It can be used to terminate the slice of the current task early.

To monitor the current performance state of each CPU.

This allows the BPF scheduler to request a specific performance level for each CPU. SCX defaults to max perf if scx_bpf_cpuperf_set() is not called.

This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](sched-ext/sched_ext#180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer.

This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](sched-ext/sched_ext#180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer. Signed-off-by: Daniel Hodges <[email protected]>

htejun requested a review from Byte-Lab April 12, 2024 00:39

Byte-Lab approved these changes Apr 12, 2024

View reviewed changes

htejun added 8 commits April 11, 2024 22:06

scx_qmap: Add struct cpu_ctx

16ab1af

Use a struct instead of u64[2]. This will ease future changes. No functional changes.

scx: Should update load avg and util metrics for other classes and me…

09f2c40

…chanisms Without this, e.g., RT util metric gets stuck high which can mislead the schedutil cpufreq governor.

scx: Implement sched_ext_ops.tick()

1d88435

This gets called on every tick if the CPU is executing an SCX task. e.g. It can be used to terminate the slice of the current task early.

scx: Implement scx_bpf_cpuperf_cap() and scx_bpf_cpuperf_cur()

2c2b2e3

To monitor the current performance state of each CPU.

scx: Implement scx_bpf_cpuperf_set()

fbc1f9a

This allows the BPF scheduler to request a specific performance level for each CPU. SCX defaults to max perf if scx_bpf_cpuperf_set() is not called.

htejun force-pushed the htejun/cpufreq branch from 5cdeae4 to fbc1f9a Compare April 12, 2024 08:06

htejun merged commit 41442f4 into sched_ext Apr 12, 2024
1 check passed

htejun deleted the htejun/cpufreq branch April 12, 2024 08:07

hodgesds mentioned this pull request Apr 24, 2024

Add CPU frequency related helpers and extend scx_layered sched-ext/scx#239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx: Implement cpufreq support #180

scx: Implement cpufreq support #180

htejun commented Apr 12, 2024 •

edited

Loading

Byte-Lab left a comment

Byte-Lab Apr 12, 2024

htejun Apr 12, 2024

Byte-Lab Apr 12, 2024

htejun Apr 12, 2024

Byte-Lab Apr 12, 2024

htejun Apr 12, 2024

Byte-Lab Apr 12, 2024

htejun Apr 12, 2024

htejun commented Apr 12, 2024

scx: Implement cpufreq support #180

scx: Implement cpufreq support #180

Conversation

htejun commented Apr 12, 2024 • edited Loading

Byte-Lab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htejun commented Apr 12, 2024

htejun commented Apr 12, 2024 •

edited

Loading