Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #716

jonathan-innis · 2023-03-16T20:48:33Z

Tell us about your request

We could consider a few options to discover the expected capacity overhead for a given instance type:

We could store the instance type capacity in memory once a version of that type has been launched and use that as the capacity value after the initial launch rather than basing our calculations off of some heuristic
We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Calculating the difference between the EC2-reported memory capacity and the actual capacity of the instance as reported by kubelet.

Are you currently working around this issue?

Using a heuristic vmMemoryOverheadPercent value right now that is tunable by users and passed through karpenter-global-settings

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2023-03-16T20:50:40Z

Linking one of the analyses that were performed: aws/karpenter-provider-aws#3568 (comment)

stevehipwell · 2023-03-29T08:59:10Z

@jonathan-innis do you have a bit more information on why EC2 reports one value and the kubelet reports another?

Would I be right is assuming, based on the current description of this, that this is a Karpenter specific issue and not directly related to system-reserved & kube-reserved as the memory value which matters for these is the one reported by kubelet?

alex-hunt-materialize · 2023-03-29T22:54:48Z

The vmMemoryOverheadPercent should not be a global variable, as the amount reserved varies drastically based on the instance type. This issue is exacerbated by the default kubelet configuration in EKS AMIs being way off what the node actually needs and causing OOM events for kube/system daemons for some instance types.

The following kubelet configuration works well for us. We are shoving most of the reserved memory into kubeReserved since we don't currently have separate cgroups for system daemons vs kube daemons, to keep the calculation easier.

def calc_mem_reservation_mib(total_memory_mib: int) -> int:
    # Calculation used by GKE as defined in
    # https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#memory_cpu
    # 4G     8G    16G   128G   512G   512G_total
    # 1024 + 819 + 819 + 6881 + 7864 = 17407
    # This seems safer than the AWS rules of 255MiB + 11MiB * MAX_PODS_PER_INSTANCE
    # which empirically doesn't reserve enough for some instance types.
    if total_memory_mib <= 4096:
        return int(total_memory_mib * 0.25)
    elif total_memory_mib <= 8192:
        return calc_mem_reservation_mib(4096) + int((total_memory_mib - 4096) * 0.2)
    elif total_memory_mib <= 16384:
        return calc_mem_reservation_mib(8192) + int((total_memory_mib - 8192) * 0.1)
    elif total_memory_mib <= 131072:
        return calc_mem_reservation_mib(16384) + int((total_memory_mib - 16384) * 0.06)
    else:
        return calc_mem_reservation_mib(131072) + int(
            (total_memory_mib - 131072) * 0.02
        )

An example kubelet configuration based on the above calculation we specify in our provisioner for 512Gi instances:

  kubeletConfiguration:
    clusterDNS:
    - 10.200.32.10
    containerRuntime: containerd
    evictionHard:
      memory.available: 100Mi
      nodefs.available: 10%
      nodefs.inodesFree: 10%
    evictionSoft:
      memory.available: 200Mi
    evictionSoftGracePeriod:
      memory.available: 1m0s
    kubeReserved:
      cpu: 230m
      memory: 17407Mi
    maxPods: 685
    systemReserved:
      memory: 100Mi

Some example capacities and allocatable fractions:

r6i.16xlarge
rated: 536870912Ki (512Gi)
capacity: 519890632Ki
allocatable: 501861064Ki
capacity fraction of rated: 0.968371763825


r5a.16xlarge
rated: 536870912Ki (512Gi)
capacity: 523715884Ki
allocatable: 505686316Ki
capacity fraction of rated: 0.975496850908


r5a.xlarge
rated: 33554432Ki (32Gi)
capacity: 32487924Ki
allocatable: 28550644Ki
capacity fraction of rated: 0.968215584755


r5a.large
rated: 16777216Ki (16Gi)
capacity: 16126960Ki
allocatable: 13196272Ki
capacity fraction of rated: 0.961241722107


m5.4xlarge
rated: 67108864Ki (64Gi)
capacity: 65033788Ki
allocatable: 62394940Ki
capacity fraction of rated: 0.969078958035


c6gn.xlarge
rated: 8388608Ki (8Gi)
capacity: 7956848Ki
allocatable: 5864816Ki
capacity fraction of rated: 0.948530197144

We've been running with a vmMemoryOverheadPercent of 0.04, but this is too small for the c6gn.xlarge while being much larger than needed for the larger instances. We have hit issues when launching large pods where Karpenter refused to spawn new nodes since it thought there wouldn't be enough space, when there really would have been.

We could store the instance type capacity in memory once a version of that type has been launched and use that as the capacity value after the initial launch rather than basing our calculations off of some heuristic

This seems problematic, since it would require launching a node before knowing if it would fit the desired pod. This new node might not fit the desired pod, but may fit some other pod that might otherwise have gone somewhere else. The default K8S scheduler configuration spreads pods onto the least full nodes, which would likely hit this behavior.

We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead.

This would be great!

Alternatively, at least making the value a part of the Provisioner, not a global value, would go a long way toward unblocking users who want to use the full capacity of their instances.

jonathan-innis · 2023-03-31T17:49:41Z

Would I be right is assuming, based on the current description of this, that this is a Karpenter specific issue and not directly...

Yes, that's correct. This has to do with the .status.capacity value, which is used before any subtraction is done due to kube-reserved or system-reserved

jonathan-innis · 2023-11-25T03:53:01Z

This one was erroneously transferred. Because this repo now exists in a new organization, I'm going to re-create and link this original issue over in aws/karpenter. It's unfortunate we lose some of the conversation context, but that's the best we can do right now.

jonathan-innis · 2023-11-25T03:54:25Z

Closing this one in favor of aws/karpenter-provider-aws#5161. We can move the majority of the conversation over there if there's anything to add.

jonathan-innis added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 16, 2023

jonathan-innis changed the title ~~Discover Instance Type Capacity Overhead Instead of a Heuristic~~ Discover Instance Type Capacity Memory Overhead Instead of a Heuristic Mar 16, 2023

jonathan-innis changed the title ~~Discover Instance Type Capacity Memory Overhead Instead of a Heuristic~~ Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent Mar 16, 2023

jonathan-innis mentioned this issue Jul 3, 2023

Karpenter simulates node capacity incorrectly causing pod scheduling to fail #692

Closed

jonathan-innis mentioned this issue Jul 18, 2023

Consolidation affecting Infra Stability aws/karpenter-provider-aws#4289

Closed

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

jonathan-innis mentioned this issue Nov 25, 2023

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent aws/karpenter-provider-aws#5161

Closed

jonathan-innis closed this as completed Nov 25, 2023

pH14 mentioned this issue May 3, 2024

feat: per-instance type VM memory overhead percentages aws/karpenter-provider-aws#6146

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #716

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #716

jonathan-innis commented Mar 16, 2023

jonathan-innis commented Mar 16, 2023

stevehipwell commented Mar 29, 2023

alex-hunt-materialize commented Mar 29, 2023

jonathan-innis commented Mar 31, 2023

jonathan-innis commented Nov 25, 2023

jonathan-innis commented Nov 25, 2023

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent #716

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent #716

Comments

jonathan-innis commented Mar 16, 2023

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

jonathan-innis commented Mar 16, 2023

stevehipwell commented Mar 29, 2023

alex-hunt-materialize commented Mar 29, 2023

jonathan-innis commented Mar 31, 2023

jonathan-innis commented Nov 25, 2023

jonathan-innis commented Nov 25, 2023

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #716

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #716