Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

scx: Pick an idle cpu when select runqueue with WF_EXEC #221

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions kernel/sched/ext.c
Original file line number Diff line number Diff line change
Expand Up @@ -3107,8 +3107,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
* which can decide to preempt self to force it through the regular
* scheduling path.
*/
if (unlikely(wake_flags & WF_EXEC))
return prev_cpu;
if (unlikely(wake_flags & WF_EXEC)) {
int idle_cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we're going to do this we should probably use scx_select_cpu_dfl(). But we have to be a bit careful here, not every scheduler will use built-in idle tracking, so that may actually cause a scheduler that implements ops.update_idle() (and doesn't specify SCX_OPS_KEEP_BUILTIN_IDLE) to be evicted. Maybe we want something like this:

if (unlikely(wake_flags & WF_EXEC)) {
        if (static_branch_likely(&scx_builtin_idle_enabled)) {
                bool found;

                return scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found);
        } else {
                return prev_cpu;
        }
}

A couple more thoughts:

The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling TTWU_QUEUE when sched_ext is enabled, despite there being some performance implications so that we never skip running ops.select_cpu() on the wakeup path.

So if we decide to do this, we might also want to introduce a new op flag for this which controls whether it's enabled (e.g. SCX_OPS_AUTO_IDLE_EXEC) so that implementations can choose to opt in or out of letting the core scheduler reserve idle CPUs on the exec path.

@htejun wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at WF_EXEC, the cache footprint is low; however, unlike kernel scheduler, migration is often frequent and easy for SCX scheds. There is no built-in resistance to migration in SCX, so the benefit of extra migration opportunity is questionable. So, introducing unconditional migration to an idle CPU as the default behavior doesn't make much sense. It's likely to just add unnecessary migration overhead in exec path.

However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.

Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more thoughts:

The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling TTWU_QUEUE when sched_ext is enabled, despite there being some performance implications so that we never skip running ops.select_cpu() on the wakeup path.

Interesting, so now EXT scheduling class is isolated from the rest of scheduling class like fair, deadline and so on, I suppose it should at least have the same priority level like normal class ? If that's the case then if we load BPF scheduler in the kernel, sched_entity in kernel are still going to be scheduled by the default scheduler like CFS/EEVDF ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.

Then that's why BPF scheduler has more advantage since it can take more controls on migrations due to different scheduling policies , we should leave this part for per-scheduler implementation to decide, rather than perform a no-brain migration whenever the migration cost is cheap. Thanks for providing this great insight .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.

No problem ! Thanks for sharing these insights and feedbacks. I'll close the PR later.

return idle_cpu;
}

if (SCX_HAS_OP(select_cpu)) {
s32 cpu;
Expand Down