slo-controller: fix mid resource calculate formula #2291

lijunxin559 · 2024-12-09T12:27:25Z

Ⅰ. Describe what this PR does

When calculating Allocatable[mid] resources, due to possible oversold, ProdReclaimableMetric will be greater than NodeAllocatable * thresholdRatio, so the calculated Allocatable[mid] value accidentally includes the oversold part.
In particular, when thresholdRatio = 1, Allocatable[mid] = NodeAllocatable, which is not the original intention of the koordinator resource model design.

I suggest taking the current free resources of this node into account when calculating Allocatable[mid], that is, Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio, FreeResourceOnNode) + Unallocated[Mid] * midUnallocatedRatio

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

codecov · 2024-12-09T12:32:57Z

Codecov Report

Attention: Patch coverage is 46.51163% with 23 lines in your changes missing coverage. Please review.

Project coverage is 66.03%. Comparing base (7c94119) to head (ef6b1f9).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...g/slo-controller/noderesource/plugins/util/util.go	0.00%	23 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2291      +/-   ##
==========================================
- Coverage   66.04%   66.03%   -0.02%     
==========================================
  Files         457      457              
  Lines       53737    53785      +48     
==========================================
+ Hits        35490    35516      +26     
- Misses      15697    15719      +22     
  Partials     2550     2550

Flag	Coverage Δ
unittests	`66.03% <46.51%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

saintube · 2024-12-10T01:55:20Z

@lijunxin559 Thanks for your contribution.

Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio, FreeResourceOnNode) + Unallocated[Mid] * midUnallocatedRatio

I suppose the FreeResourceOnNode equals Unallocated[Mid], so what does the formula mean? If the node has no unallocated Prod resources, the allocatable Mid in the new formula will be zero, which obeys the expectation of the Prod overcommitment.
To avoid the Prod allocatable being overcommitted twice (i.e. overcommit the unused overcommitted Prod requests) in the node resource controller, we need to minimize it with the ProdAllocated and Node capacity. And we'd better revise the new peak predictor in the koordlet, e.g. the PriorityPeakPredictor should minimize the node capacity as the limit of the peak. WDYT?

lijunxin559 · 2024-12-10T02:27:39Z

@lijunxin559 Thanks for your contribution.

Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio, FreeResourceOnNode) + Unallocated[Mid] * midUnallocatedRatio

I suppose the FreeResourceOnNode equals Unallocated[Mid], so what does the formula mean? If the node has no unallocated Prod resources, the allocatable Mid in the new formula will be zero, which obeys the expectation of the Prod overcommitment. To avoid the Prod allocatable being overcommitted twice (i.e. overcommit the unused overcommitted Prod requests) in the node resource controller, we need to minimize it with the ProdAllocated and Node capacity. And we'd better revise the new peak predictor in the koordlet, e.g. the PriorityPeakPredictor should minimize the node capacity as the limit of the peak. WDYT?

Morning, thanks for your reply ~ (^^)

Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid] * midUnallocatedRatio is the original code logic. I think Unallocated[Mid] is intended to express the part of resources that are not allocated to prod. I just added a limit on the actual idleness of a node, which does not conflict with the original formula.

Yes, the predictor improvement you mentioned can also avoid secondary overselling to a certain extent, but I think the logic of dividing the predictor and the calculation of mid resources into two parts may be clearer. The predictor can focus on modeling and predicting the idle resources of the pod and ensure the reliability of the prediction, regardless of whether this part of the allocated resources actually exists or not. And the secondary utilization of this part of the idle resources is separated separately (mid and batch were also done before), so that the logic is decoupled and the subsequent use of the predictor function is more scalable. Right?

saintube · 2024-12-10T05:49:04Z

Morning, thanks for your reply ~ (^^)

Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid] * midUnallocatedRatio is the original code logic. I think Unallocated[Mid] is intended to express the part of resources that are not allocated to prod. I just added a limit on the actual idleness of a node, which does not conflict with the original formula.

Yes, the predictor improvement you mentioned can also avoid secondary overselling to a certain extent, but I think the logic of dividing the predictor and the calculation of mid resources into two parts may be clearer. The predictor can focus on modeling and predicting the idle resources of the pod and ensure the reliability of the prediction, regardless of whether this part of the allocated resources actually exists or not. And the secondary utilization of this part of the idle resources is separated separately (mid and batch were also done before), so that the logic is decoupled and the subsequent use of the predictor function is more scalable. Right?

@lijunxin559 OK. So the FreeResourceOnNode means the NodeCapacity - NodeUsage. It seems reasonable.

pkg/slo-controller/noderesource/plugins/midresource/plugin.go

pkg/slo-controller/noderesource/plugins/util/util.go

Signed-off-by: lijunxin <[email protected]>

saintube

/lgtm

saintube · 2024-12-13T06:27:05Z

PTAL /cc @zwzhang0107 @hormes

lijunxin559 changed the title ~~Fix mid resource calculate formula~~ slo-controller: fix mid resource calculate formula Dec 9, 2024

lijunxin559 force-pushed the fix-midResource-calculate-formula branch 3 times, most recently from c310647 to 51cae0a Compare December 10, 2024 01:50

saintube reviewed Dec 10, 2024

View reviewed changes

pkg/slo-controller/noderesource/plugins/midresource/plugin.go Outdated Show resolved Hide resolved

pkg/slo-controller/noderesource/plugins/midresource/plugin.go Show resolved Hide resolved

lijunxin559 force-pushed the fix-midResource-calculate-formula branch 3 times, most recently from 6746896 to 65514b4 Compare December 10, 2024 09:03

saintube reviewed Dec 10, 2024

View reviewed changes

pkg/slo-controller/noderesource/plugins/util/util.go Outdated Show resolved Hide resolved

lijunxin559 force-pushed the fix-midResource-calculate-formula branch 3 times, most recently from 1dd0e44 to 2924f9f Compare December 10, 2024 12:06

slo-controller: fix midResource calculate formula

ef6b1f9

Signed-off-by: lijunxin <[email protected]>

lijunxin559 force-pushed the fix-midResource-calculate-formula branch from 2924f9f to ef6b1f9 Compare December 10, 2024 12:32

saintube approved these changes Dec 11, 2024

View reviewed changes

saintube added the lgtm label Dec 11, 2024

jasonliu747 approved these changes Dec 18, 2024

View reviewed changes

saintube added the approved label Dec 19, 2024

koordinator-bot bot merged commit 9bb39a2 into koordinator-sh:main Dec 19, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slo-controller: fix mid resource calculate formula #2291

slo-controller: fix mid resource calculate formula #2291

lijunxin559 commented Dec 9, 2024

codecov bot commented Dec 9, 2024 •

edited

Loading

saintube commented Dec 10, 2024 •

edited

Loading

lijunxin559 commented Dec 10, 2024 •

edited

Loading

saintube commented Dec 10, 2024

saintube left a comment

saintube commented Dec 13, 2024

slo-controller: fix mid resource calculate formula #2291

slo-controller: fix mid resource calculate formula #2291

Conversation

lijunxin559 commented Dec 9, 2024

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

codecov bot commented Dec 9, 2024 • edited Loading

Codecov Report

saintube commented Dec 10, 2024 • edited Loading

lijunxin559 commented Dec 10, 2024 • edited Loading

saintube commented Dec 10, 2024

saintube left a comment

Choose a reason for hiding this comment

saintube commented Dec 13, 2024

codecov bot commented Dec 9, 2024 •

edited

Loading

saintube commented Dec 10, 2024 •

edited

Loading

lijunxin559 commented Dec 10, 2024 •

edited

Loading