scx: Pick an idle cpu when select runqueue with WF_EXEC #221

vax-r · 2024-06-13T13:35:04Z

Summary

If the caller of select_task_rq_scx() calls with WF_EXEC, it's a valuable balancing opportunity to perform task migration. Although runqueue can't dictate where the task is going to run, we can still take advantage of this good opportunity by selecting an idle core, rather than doing nothing.

Note

This might be an over-concerned case for current implementation of sched_ext since there's no caller like sched_exec() under /kernel/sched/core.c which would call select_task_rq() with WF_EXEC . However it's still worth considering the changing and maybe add a proper caller with WF_EXEC.

If the caller of select_task_rq_scx() calls with WF_EXEC, it's a valuable balancing opportunity to perform task migration. Although runqueue can't dictate where the task is going to run, we can still take advantage of this good opportunity by selecting an idle core, rather than doing nothing. Signed-off-by: I Hsin Cheng <[email protected]>

Byte-Lab

Thanks for the patch! Left a comment for some discussion.

Byte-Lab · 2024-06-14T15:53:53Z

kernel/sched/ext.c

-	if (unlikely(wake_flags & WF_EXEC))
-		return prev_cpu;
+	if (unlikely(wake_flags & WF_EXEC)) {
+		int idle_cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);


Hmm, if we're going to do this we should probably use scx_select_cpu_dfl(). But we have to be a bit careful here, not every scheduler will use built-in idle tracking, so that may actually cause a scheduler that implements ops.update_idle() (and doesn't specify SCX_OPS_KEEP_BUILTIN_IDLE) to be evicted. Maybe we want something like this:

if (unlikely(wake_flags & WF_EXEC)) { if (static_branch_likely(&scx_builtin_idle_enabled)) { bool found; return scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found); } else { return prev_cpu; } }

A couple more thoughts:

The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling TTWU_QUEUE when sched_ext is enabled, despite there being some performance implications so that we never skip running ops.select_cpu() on the wakeup path.

So if we decide to do this, we might also want to introduce a new op flag for this which controls whether it's enabled (e.g. SCX_OPS_AUTO_IDLE_EXEC) so that implementations can choose to opt in or out of letting the core scheduler reserve idle CPUs on the exec path.

@htejun wdyt?

Yes, at WF_EXEC, the cache footprint is low; however, unlike kernel scheduler, migration is often frequent and easy for SCX scheds. There is no built-in resistance to migration in SCX, so the benefit of extra migration opportunity is questionable. So, introducing unconditional migration to an idle CPU as the default behavior doesn't make much sense. It's likely to just add unnecessary migration overhead in exec path.

However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.

Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.

However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.

Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.

A couple more thoughts:

The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling TTWU_QUEUE when sched_ext is enabled, despite there being some performance implications so that we never skip running ops.select_cpu() on the wakeup path.

Interesting, so now EXT scheduling class is isolated from the rest of scheduling class like fair, deadline and so on, I suppose it should at least have the same priority level like normal class ? If that's the case then if we load BPF scheduler in the kernel, sched_entity in kernel are still going to be scheduled by the default scheduler like CFS/EEVDF ?

Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.

Then that's why BPF scheduler has more advantage since it can take more controls on migrations due to different scheduling policies , we should leave this part for per-scheduler implementation to decide, rather than perform a no-brain migration whenever the migration cost is cheap. Thanks for providing this great insight .

Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.

No problem ! Thanks for sharing these insights and feedbacks. I'll close the PR later.

vax-r force-pushed the pick_rq_exec branch from 445b763 to e0479c9 Compare June 13, 2024 13:36

Byte-Lab reviewed Jun 14, 2024

View reviewed changes

vax-r closed this Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx: Pick an idle cpu when select runqueue with WF_EXEC #221

scx: Pick an idle cpu when select runqueue with WF_EXEC #221

vax-r commented Jun 13, 2024

Byte-Lab left a comment

Byte-Lab Jun 14, 2024

htejun Jun 14, 2024

htejun Jun 14, 2024

Byte-Lab Jun 14, 2024

vax-r Jun 14, 2024

vax-r Jun 14, 2024

vax-r Jun 14, 2024

scx: Pick an idle cpu when select runqueue with WF_EXEC #221

scx: Pick an idle cpu when select runqueue with WF_EXEC #221

Conversation

vax-r commented Jun 13, 2024

Summary

Note

Byte-Lab left a comment

Choose a reason for hiding this comment

Byte-Lab Jun 14, 2024

Choose a reason for hiding this comment

htejun Jun 14, 2024

Choose a reason for hiding this comment

htejun Jun 14, 2024

Choose a reason for hiding this comment

Byte-Lab Jun 14, 2024

Choose a reason for hiding this comment

vax-r Jun 14, 2024

Choose a reason for hiding this comment

vax-r Jun 14, 2024

Choose a reason for hiding this comment

vax-r Jun 14, 2024

Choose a reason for hiding this comment