-
Notifications
You must be signed in to change notification settings - Fork 30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, this looks good to me. I left a few comments, but overall this looks correct and like the right approach.
One thing I did want to ask -- now that we have support for dynamically setting performance levels for individual CPUs, do you think we should consider adding support for uclamp? Given that it's already an API that's available to users, and fits the schedutil CPU performance APIs quite well, it might be a natural API to expose and build on. Wdyt?
if (!online) | ||
return -ENOMEM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be necessary to check for nullity here -- the kfunc doesn't advertise that it can return NULL
via KF_RET_NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, will drop. Thanks.
@@ -16,6 +16,45 @@ enum scx_consts { | |||
SCX_EXIT_BT_LEN = 64, | |||
SCX_EXIT_MSG_LEN = 1024, | |||
SCX_EXIT_DUMP_DFL_LEN = 32768, | |||
|
|||
SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify the idea behind the name of this constant? Why wouldn't we want to call it SCX_CPUPERF_MAX
? Is the idea that these relative capacity and frequency (or performance?) values we're querying are conceptually on a scale of [0, 1]
relative to other cores on the system, even though in practice they're [0, 1024]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's treating it as fixed point scale where the actual number range is [0, 1]. In practice, it's the same as MAX but I tend to like ONEs better because it's conceptually more versatile. e.g. The current interface doesn't do it but if we want the interface turbo, doing it with ONE is more natural than with MAX.
* executing an SCX task. Setting @p->scx.slice to 0 will trigger an | ||
* immediate dispatch cycle on the CPU. | ||
*/ | ||
void (*tick)(struct task_struct *p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be useful to also pass the s32 cpu
? Certainly not strictly necessary, but I'd imagine it will be pretty common for callers to use scx_bpf_task_cpu()
and bpf_get_smp_processor_id()
, so it might be nice for ergonomics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benefit of keeping the interface params minimal likely outweighs the convenience factor. For internal functions, having extra args to avoid repeated lookups is fine but here I think it's better to keep things tighter.
kernel/sched/ext.c
Outdated
* relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the | ||
* schedutil cpufreq governor chooses the target frequency. The actual | ||
* performance level chosen is dependent on the hardware and cpufreq driver in | ||
* use and can be monitored using scx_bpf_cpuperf_cur(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth mentioning that the latency for frequency changes to be observed is hardware dependent? Or should that already be pretty obvious to anyone who'd be playing with these knobs in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fortify the comment a bit.
uclamp overlays on top of schedutil and should work the same whether the underlying scheduler is fair or scx. ie. If the user clamps cpufreq range for a task, the clamping already works the same for scx. |
Use a struct instead of u64[2]. This will ease future changes. No functional changes.
These two functions being inlined ends up bringing out a bunch of other stuff into kernel/sched/ext.h. Let's uninline them. - Uninline both and rename scx_notify_sched_tick() to scx_tick() and scx_notify_pick_next_task() to scx_next_task_picked(). The notify term is a bit unusual and more often used with notifier chains which isn't the case here. - Call scx_tick() while holding rq lock. This doesn't make difference now but will ease future changes. - Move the stuff which was in kernel/sched/ext.h to support the inline functions into kernel/sched/ext.c. After this, both ext header files are really lean containing only what's needed to integrate with the rest of the kernel. - Some other cosmetic changes. Other than scx_tick() being called under rq lock. No functional changes.
…rs() RT, DL, thermal and irq load and utilization metrics need to be decayed and updated periodically and before consumption to keep the numbers reasonable. This is currently done from __update_blocked_others() as a part of the fair class load balance path. Let's factor it out to update_other_load_avgs(). Pure refactor. No functional changes. This will be used by the new BPF extensible scheduling class to ensure that the above metrics are properly maintained. Signed-off-by: Tejun Heo <[email protected]>
…chanisms Without this, e.g., RT util metric gets stuck high which can mislead the schedutil cpufreq governor.
sugov_cpu_is_busy() is used to avoid decreasing performance level while the CPU is busy and called by sugov_update_single_freq() and sugov_update_single_perf(). Both callers repeat the same pattern to first test for uclamp and then the business. Let's refactor so that the tests aren't repeated. The new helper is named sugov_hold_freq() and tests both the uclamp exception and CPU business. No functional changes. This will make adding more exception conditions easier. Signed-off-by: Tejun Heo <[email protected]>
This gets called on every tick if the CPU is executing an SCX task. e.g. It can be used to terminate the slice of the current task early.
To monitor the current performance state of each CPU.
This allows the BPF scheduler to request a specific performance level for each CPU. SCX defaults to max perf if scx_bpf_cpuperf_set() is not called.
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](sched-ext/sched_ext#180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer.
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](sched-ext/sched_ext#180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer. Signed-off-by: Daniel Hodges <[email protected]>
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](sched-ext/sched_ext#180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer. Signed-off-by: Daniel Hodges <[email protected]>
Add
scx_bpf_cpuperf_cap()
,scx_bpf_cpuperf_cur()
and scx_bpf_cpuperf_set()which allows BPF schedulers to monitor and set each CPU's performance level.
ops.tick()` is added too.