fix: accurately track allocatable resources for nodes #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes aws/karpenter-provider-aws#5161
Description
The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the
VM_MEMORY_OVERHEAD_PERCENT
global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept.
In this pull request, I'm applying the same concept.
To demonstrate the issue:
VM_MEMORY_OVERHEAD_PERCENT
to 0Observed behaviors
Resolving Resource Overestimation:
Addressing Resource Underestimation:
Avoiding Extra Churn:
The above improvements are implemented using a shared cache that can be accessed from:
itFits
decisions from the cache, if available.I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.
How was this change tested?
For overestimation:
I ran this in one of our preprod EKS cluster with
vmMemoryOverheadPercent=0
, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.For underestimation:
The test was to
VM_MEMORY_OVERHEAD_PERCENT
value (like 0.2)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.