Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

scx: Implement cpufreq support #180

Merged
merged 8 commits into from
Apr 12, 2024
Merged

scx: Implement cpufreq support #180

merged 8 commits into from
Apr 12, 2024

Conversation

htejun
Copy link
Collaborator

@htejun htejun commented Apr 12, 2024

Add scx_bpf_cpuperf_cap(), scx_bpf_cpuperf_cur() and scx_bpf_cpuperf_set()which allows BPF schedulers to monitor and set each CPU's performance level.ops.tick()` is added too.

@htejun htejun requested a review from Byte-Lab April 12, 2024 00:39
Copy link
Collaborator

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, this looks good to me. I left a few comments, but overall this looks correct and like the right approach.

One thing I did want to ask -- now that we have support for dynamically setting performance levels for individual CPUs, do you think we should consider adding support for uclamp? Given that it's already an API that's available to users, and fits the schedutil CPU performance APIs quite well, it might be a natural API to expose and build on. Wdyt?

Comment on lines +503 to +496
if (!online)
return -ENOMEM;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be necessary to check for nullity here -- the kfunc doesn't advertise that it can return NULL via KF_RET_NULL

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, will drop. Thanks.

@@ -16,6 +16,45 @@ enum scx_consts {
SCX_EXIT_BT_LEN = 64,
SCX_EXIT_MSG_LEN = 1024,
SCX_EXIT_DUMP_DFL_LEN = 32768,

SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify the idea behind the name of this constant? Why wouldn't we want to call it SCX_CPUPERF_MAX? Is the idea that these relative capacity and frequency (or performance?) values we're querying are conceptually on a scale of [0, 1] relative to other cores on the system, even though in practice they're [0, 1024]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's treating it as fixed point scale where the actual number range is [0, 1]. In practice, it's the same as MAX but I tend to like ONEs better because it's conceptually more versatile. e.g. The current interface doesn't do it but if we want the interface turbo, doing it with ONE is more natural than with MAX.

* executing an SCX task. Setting @p->scx.slice to 0 will trigger an
* immediate dispatch cycle on the CPU.
*/
void (*tick)(struct task_struct *p);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be useful to also pass the s32 cpu? Certainly not strictly necessary, but I'd imagine it will be pretty common for callers to use scx_bpf_task_cpu() and bpf_get_smp_processor_id(), so it might be nice for ergonomics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit of keeping the interface params minimal likely outweighs the convenience factor. For internal functions, having extra args to avoid repeated lookups is fine but here I think it's better to keep things tighter.

* relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
* schedutil cpufreq governor chooses the target frequency. The actual
* performance level chosen is dependent on the hardware and cpufreq driver in
* use and can be monitored using scx_bpf_cpuperf_cur().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth mentioning that the latency for frequency changes to be observed is hardware dependent? Or should that already be pretty obvious to anyone who'd be playing with these knobs in the first place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fortify the comment a bit.

@htejun
Copy link
Collaborator Author

htejun commented Apr 12, 2024

One thing I did want to ask -- now that we have support for dynamically setting performance levels for individual CPUs, do you think we should consider adding support for uclamp? Given that it's already an API that's available to users, and fits the schedutil CPU performance APIs quite well, it might be a natural API to expose and build on. Wdyt?

uclamp overlays on top of schedutil and should work the same whether the underlying scheduler is fair or scx. ie. If the user clamps cpufreq range for a task, the clamping already works the same for scx.

Use a struct instead of u64[2]. This will ease future changes. No functional
changes.
These two functions being inlined ends up bringing out a bunch of other
stuff into kernel/sched/ext.h. Let's uninline them.

- Uninline both and rename scx_notify_sched_tick() to scx_tick() and
  scx_notify_pick_next_task() to scx_next_task_picked(). The notify term is
  a bit unusual and more often used with notifier chains which isn't the
  case here.

- Call scx_tick() while holding rq lock. This doesn't make difference now
  but will ease future changes.

- Move the stuff which was in kernel/sched/ext.h to support the inline
  functions into kernel/sched/ext.c. After this, both ext header files are
  really lean containing only what's needed to integrate with the rest of
  the kernel.

- Some other cosmetic changes.

Other than scx_tick() being called under rq lock. No functional changes.
…rs()

RT, DL, thermal and irq load and utilization metrics need to be decayed and
updated periodically and before consumption to keep the numbers reasonable.
This is currently done from __update_blocked_others() as a part of the fair
class load balance path. Let's factor it out to update_other_load_avgs().
Pure refactor. No functional changes.

This will be used by the new BPF extensible scheduling class to ensure that
the above metrics are properly maintained.

Signed-off-by: Tejun Heo <[email protected]>
…chanisms

Without this, e.g., RT util metric gets stuck high which can mislead the
schedutil cpufreq governor.
sugov_cpu_is_busy() is used to avoid decreasing performance level while the
CPU is busy and called by sugov_update_single_freq() and
sugov_update_single_perf(). Both callers repeat the same pattern to first
test for uclamp and then the business. Let's refactor so that the tests
aren't repeated.

The new helper is named sugov_hold_freq() and tests both the uclamp
exception and CPU business. No functional changes. This will make adding
more exception conditions easier.

Signed-off-by: Tejun Heo <[email protected]>
This gets called on every tick if the CPU is executing an SCX task. e.g. It
can be used to terminate the slice of the current task early.
To monitor the current performance state of each CPU.
This allows the BPF scheduler to request a specific performance level for
each CPU. SCX defaults to max perf if scx_bpf_cpuperf_set() is not called.
@htejun htejun merged commit 41442f4 into sched_ext Apr 12, 2024
1 check passed
@htejun htejun deleted the htejun/cpufreq branch April 12, 2024 08:07
hodgesds added a commit to hodgesds/scx that referenced this pull request Apr 24, 2024
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and
`scx_bpf_cpuperf_set` definitions that were recently introduced into
[`sched_ext`](sched-ext/sched_ext#180). It adds
a `perf` field to `scx_layered` to allow for controlling performance per
layer.
hodgesds added a commit to hodgesds/scx that referenced this pull request Apr 24, 2024
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and
`scx_bpf_cpuperf_set` definitions that were recently introduced into
[`sched_ext`](sched-ext/sched_ext#180). It adds
a `perf` field to `scx_layered` to allow for controlling performance per
layer.

Signed-off-by: Daniel Hodges <[email protected]>
hodgesds added a commit to hodgesds/scx that referenced this pull request Apr 24, 2024
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and
`scx_bpf_cpuperf_set` definitions that were recently introduced into
[`sched_ext`](sched-ext/sched_ext#180). It adds
a `perf` field to `scx_layered` to allow for controlling performance per
layer.

Signed-off-by: Daniel Hodges <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants