This repository has been archived by the owner on Jun 18, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we're going to do this we should probably use
scx_select_cpu_dfl()
. But we have to be a bit careful here, not every scheduler will use built-in idle tracking, so that may actually cause a scheduler that implementsops.update_idle()
(and doesn't specifySCX_OPS_KEEP_BUILTIN_IDLE
) to be evicted. Maybe we want something like this:A couple more thoughts:
The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling
TTWU_QUEUE
when sched_ext is enabled, despite there being some performance implications so that we never skip runningops.select_cpu()
on the wakeup path.So if we decide to do this, we might also want to introduce a new op flag for this which controls whether it's enabled (e.g.
SCX_OPS_AUTO_IDLE_EXEC
) so that implementations can choose to opt in or out of letting the core scheduler reserve idle CPUs on the exec path.@htejun wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at
WF_EXEC
, the cache footprint is low; however, unlike kernel scheduler, migration is often frequent and easy for SCX scheds. There is no built-in resistance to migration in SCX, so the benefit of extra migration opportunity is questionable. So, introducing unconditional migration to an idle CPU as the default behavior doesn't make much sense. It's likely to just add unnecessary migration overhead in exec path.However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so now EXT scheduling class is isolated from the rest of scheduling class like fair, deadline and so on, I suppose it should at least have the same priority level like normal class ? If that's the case then if we load BPF scheduler in the kernel,
sched_entity
in kernel are still going to be scheduled by the default scheduler likeCFS/EEVDF
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then that's why BPF scheduler has more advantage since it can take more controls on migrations due to different scheduling policies , we should leave this part for per-scheduler implementation to decide, rather than perform a no-brain migration whenever the migration cost is cheap. Thanks for providing this great insight .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem ! Thanks for sharing these insights and feedbacks. I'll close the PR later.