Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: accurately track allocatable resources for nodes #1

Closed
wants to merge 9 commits into from

Conversation

BEvgeniyS
Copy link
Owner

@BEvgeniyS BEvgeniyS commented Jul 9, 2024

Fixes aws/karpenter-provider-aws#5161

Description
The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the VM_MEMORY_OVERHEAD_PERCENT global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.

Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept.
In this pull request, I'm applying the same concept.

To demonstrate the issue:

  1. Set VM_MEMORY_OVERHEAD_PERCENT to 0
  2. Create a nodepool with a single instance type:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: approaching-allocatable-nodepool-0
spec:
  limits:
    cpu: "18"
    memory: 36Gi
  template:
    metadata:
      labels:
        approaching-allocatable: nodepool-0
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: approaching-allocatable-nodeclass-0
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t4g.medium
      taints:
      - effect: NoExecute
        key:  approaching-allocatable
        value: "nodepool-0"
      kubelet:
        systemReserved:
          memory: "1Ki"
        kubeReserved:
          memory: "1Ki"
        evictionHard:
          memory.available: "1Ki"
  1. Create a workload with request close to node's allocatable:
apiVersion: v1
kind: Pod
metadata:
  name: approaching-allocatable-pod
  namespace: default
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: approaching-allocatable
            operator: In
            values:
            - nodepool-0
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause@sha256:c2518f6d82392ba799d551398805aaa7af70548015263d962afe9710c0eaa1b2
    name: trigger-pod
    resources:
      requests:
        cpu: 10m
        memory: 3686Mi
  tolerations:
  - effect: NoExecute
    key: approaching-allocatable
    operator: Equal
    value: nodepool-0

Observed behaviors

  1. Resolving Resource Overestimation:

    • v0.37.0 behavior: Karpenter continuously creates and consolidates nodes without realizing the impossibility of fitting the workload.
    • Patched behavior: Accurately tracks actual allocatable resources, preventing the endless loop of node creation and consolidation.
  2. Addressing Resource Underestimation:

    • v0.37.0 behavior: Karpenter leaves pods pending indefinitely or chooses an instance type larger than necessary, failing to learn from actual node allocatables when launched for other reasons.
    • Patched behavior: Remembers true allocatable resources if a node is ever launched, enabling correct node launches for previously pending pods.
  3. Avoiding Extra Churn:

    • v0.37.0 behavior: Incorrect predicted allocatable resources during consolidation lead to unnecessary churn.
    • Patched behavior: Scheduling simulations benefit from knowledge about true allocatable resources

The above improvements are implemented using a shared cache that can be accessed from:

  • lifecycle package: to populate the cache as soon as a node is registered.
  • scheduling package: to use real allocatable resources when making itFits decisions from the cache, if available.
  • hash package: to flush the cache for a nodepool after an update.

I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.

How was this change tested?
For overestimation:
I ran this in one of our preprod EKS cluster with vmMemoryOverheadPercent=0, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.

For underestimation:
The test was to

  1. set high VM_MEMORY_OVERHEAD_PERCENT value (like 0.2)
  2. Run the workload that was fitting before, observe it's pending
  3. Adding another workload for same nodepool, but with lower request. That launches the real node
  4. Another node would launch for the pod from step 2, and new pods with same requests are now correctly cause new node to be launched

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@BEvgeniyS BEvgeniyS changed the title feat: discover allocatable fix: accurately track allocatable resources for nodes Jul 13, 2024
@BEvgeniyS BEvgeniyS closed this Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent
1 participant