-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2678] Improve FAIR queue sort algorithm #935
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial revisions to work on.
@@ -65,8 +65,10 @@ func sortQueuesByPriorityAndFairness(queues []*Queue) { | |||
if lPriority < rPriority { | |||
return false | |||
} | |||
comp := resources.CompUsageRatioSeparately(l.GetAllocatedResource(), l.GetGuaranteedResource(), | |||
r.GetAllocatedResource(), r.GetGuaranteedResource()) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function needs some work as long as we're rewriting it. We are doing the expensive recursive computation of resources on every comparison. Since we already have a slice of queues, it would be better to precompute a slice of resources computed on each one, and then use that in the comparison function so that we only create the resource maps once per sort run instead of once per comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wilfred also pointed out in the community meeting that we may also be able to optimize the parent hierarchy separately as all the compared queues will have the same parent. If we do that, we do need to be sure that the parent is not the root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently on vacation, but I will add this optimization when I return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed. This should compute GetFairMaxResource O(n) times instead of NLog(N).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better, but to Wilfred's point from the community meeting, check out the callers of queue.sortQueues()
-- the are in queue.go under GetQueueOutstandingRequests()
, TryAllocate()
, TryPlaceholderAllocate()
and TryReservedAllocate()
. Each computes queue.getHeadRoom()
at the start which is used for further processing. Since these methods are all recursive from the root queue down, you can compute the parent values at each step and then pass the results into the sortQueues()
function. This avoids having to recompute all the parent queue values during sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with deferring that refactor until a follow-up PR. @wilfred-s what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can defer that to the next release. It is not just a question of performance but also a question about correctness. Since the queue values can change it might be the case that a parents' setting changes. If that happens between the child checks, specially when you have a larger number of children, you might not have the exact same inputs for each child computation.
The chance is low but we should try an prevent that.
The change would not be in the sorter but in the queue code when we build the fairMaxResources
slice
@psantacl please run |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #935 +/- ##
==========================================
+ Coverage 80.33% 80.39% +0.05%
==========================================
Files 97 97
Lines 12384 12431 +47
==========================================
+ Hits 9949 9994 +45
- Misses 2164 2165 +1
- Partials 271 272 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a final comment, I think it's worth waiting for #943.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psantacl please resolve merge conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some clone-safety and code tweaks. Also the big optimization work yet to-do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, other than the big refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@wilfred-s thoughts on committing this version and working on the performance refactor in a followup? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Follow up for the performance and correctness change YUNIKORN-2840
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
What is this PR for?
Changes to the Fair scheduling algorithm to always produce ratio's only when sorting queues for 'fair' usage.
Jira: https://issues.apache.org/jira/browse/YUNIKORN-2678
What type of PR is it?
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2678
How should this be tested?
Screenshots (if appropriate)
Please see Jira.