Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent #716

Closed
jonathan-innis opened this issue Mar 16, 2023 · 6 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@jonathan-innis
Copy link
Member

Tell us about your request

We could consider a few options to discover the expected capacity overhead for a given instance type:

  1. We could store the instance type capacity in memory once a version of that type has been launched and use that as the capacity value after the initial launch rather than basing our calculations off of some heuristic
  2. We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Calculating the difference between the EC2-reported memory capacity and the actual capacity of the instance as reported by kubelet.

Are you currently working around this issue?

Using a heuristic vmMemoryOverheadPercent value right now that is tunable by users and passed through karpenter-global-settings

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 16, 2023
@jonathan-innis jonathan-innis changed the title Discover Instance Type Capacity Overhead Instead of a Heuristic Discover Instance Type Capacity Memory Overhead Instead of a Heuristic Mar 16, 2023
@jonathan-innis jonathan-innis changed the title Discover Instance Type Capacity Memory Overhead Instead of a Heuristic Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent Mar 16, 2023
@jonathan-innis
Copy link
Member Author

Linking one of the analyses that were performed: aws/karpenter-provider-aws#3568 (comment)

@stevehipwell
Copy link

@jonathan-innis do you have a bit more information on why EC2 reports one value and the kubelet reports another?

Would I be right is assuming, based on the current description of this, that this is a Karpenter specific issue and not directly related to system-reserved & kube-reserved as the memory value which matters for these is the one reported by kubelet?

@alex-hunt-materialize
Copy link

The vmMemoryOverheadPercent should not be a global variable, as the amount reserved varies drastically based on the instance type. This issue is exacerbated by the default kubelet configuration in EKS AMIs being way off what the node actually needs and causing OOM events for kube/system daemons for some instance types.

The following kubelet configuration works well for us. We are shoving most of the reserved memory into kubeReserved since we don't currently have separate cgroups for system daemons vs kube daemons, to keep the calculation easier.

def calc_mem_reservation_mib(total_memory_mib: int) -> int:
    # Calculation used by GKE as defined in
    # https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#memory_cpu
    # 4G     8G    16G   128G   512G   512G_total
    # 1024 + 819 + 819 + 6881 + 7864 = 17407
    # This seems safer than the AWS rules of 255MiB + 11MiB * MAX_PODS_PER_INSTANCE
    # which empirically doesn't reserve enough for some instance types.
    if total_memory_mib <= 4096:
        return int(total_memory_mib * 0.25)
    elif total_memory_mib <= 8192:
        return calc_mem_reservation_mib(4096) + int((total_memory_mib - 4096) * 0.2)
    elif total_memory_mib <= 16384:
        return calc_mem_reservation_mib(8192) + int((total_memory_mib - 8192) * 0.1)
    elif total_memory_mib <= 131072:
        return calc_mem_reservation_mib(16384) + int((total_memory_mib - 16384) * 0.06)
    else:
        return calc_mem_reservation_mib(131072) + int(
            (total_memory_mib - 131072) * 0.02
        )

An example kubelet configuration based on the above calculation we specify in our provisioner for 512Gi instances:

  kubeletConfiguration:
    clusterDNS:
    - 10.200.32.10
    containerRuntime: containerd
    evictionHard:
      memory.available: 100Mi
      nodefs.available: 10%
      nodefs.inodesFree: 10%
    evictionSoft:
      memory.available: 200Mi
    evictionSoftGracePeriod:
      memory.available: 1m0s
    kubeReserved:
      cpu: 230m
      memory: 17407Mi
    maxPods: 685
    systemReserved:
      memory: 100Mi

Some example capacities and allocatable fractions:

r6i.16xlarge
rated: 536870912Ki (512Gi)
capacity: 519890632Ki
allocatable: 501861064Ki
capacity fraction of rated: 0.968371763825


r5a.16xlarge
rated: 536870912Ki (512Gi)
capacity: 523715884Ki
allocatable: 505686316Ki
capacity fraction of rated: 0.975496850908


r5a.xlarge
rated: 33554432Ki (32Gi)
capacity: 32487924Ki
allocatable: 28550644Ki
capacity fraction of rated: 0.968215584755


r5a.large
rated: 16777216Ki (16Gi)
capacity: 16126960Ki
allocatable: 13196272Ki
capacity fraction of rated: 0.961241722107


m5.4xlarge
rated: 67108864Ki (64Gi)
capacity: 65033788Ki
allocatable: 62394940Ki
capacity fraction of rated: 0.969078958035


c6gn.xlarge
rated: 8388608Ki (8Gi)
capacity: 7956848Ki
allocatable: 5864816Ki
capacity fraction of rated: 0.948530197144

We've been running with a vmMemoryOverheadPercent of 0.04, but this is too small for the c6gn.xlarge while being much larger than needed for the larger instances. We have hit issues when launching large pods where Karpenter refused to spawn new nodes since it thought there wouldn't be enough space, when there really would have been.

We could store the instance type capacity in memory once a version of that type has been launched and use that as the capacity value after the initial launch rather than basing our calculations off of some heuristic

This seems problematic, since it would require launching a node before knowing if it would fit the desired pod. This new node might not fit the desired pod, but may fit some other pod that might otherwise have gone somewhere else. The default K8S scheduler configuration spreads pods onto the least full nodes, which would likely hit this behavior.

We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead.

This would be great!

Alternatively, at least making the value a part of the Provisioner, not a global value, would go a long way toward unblocking users who want to use the full capacity of their instances.

@jonathan-innis
Copy link
Member Author

Would I be right is assuming, based on the current description of this, that this is a Karpenter specific issue and not directly...

Yes, that's correct. This has to do with the .status.capacity value, which is used before any subtraction is done due to kube-reserved or system-reserved

@jonathan-innis
Copy link
Member Author

This one was erroneously transferred. Because this repo now exists in a new organization, I'm going to re-create and link this original issue over in aws/karpenter. It's unfortunate we lose some of the conversation context, but that's the best we can do right now.

@jonathan-innis
Copy link
Member Author

Closing this one in favor of aws/karpenter-provider-aws#5161. We can move the majority of the conversation over there if there's anything to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants