fix: accurately track allocatable resources for nodes #1

BEvgeniyS · 2024-07-09T08:01:08Z

Description
The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the VM_MEMORY_OVERHEAD_PERCENT global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.

Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept.
In this pull request, I'm applying the same concept.

To demonstrate the issue:

Set VM_MEMORY_OVERHEAD_PERCENT to 0
Create a nodepool with a single instance type:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: approaching-allocatable-nodepool-0
spec:
  limits:
    cpu: "18"
    memory: 36Gi
  template:
    metadata:
      labels:
        approaching-allocatable: nodepool-0
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: approaching-allocatable-nodeclass-0
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t4g.medium
      taints:
      - effect: NoExecute
        key:  approaching-allocatable
        value: "nodepool-0"
      kubelet:
        systemReserved:
          memory: "1Ki"
        kubeReserved:
          memory: "1Ki"
        evictionHard:
          memory.available: "1Ki"

Create a workload with request close to node's allocatable:

apiVersion: v1
kind: Pod
metadata:
  name: approaching-allocatable-pod
  namespace: default
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: approaching-allocatable
            operator: In
            values:
            - nodepool-0
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause@sha256:c2518f6d82392ba799d551398805aaa7af70548015263d962afe9710c0eaa1b2
    name: trigger-pod
    resources:
      requests:
        cpu: 10m
        memory: 3686Mi
  tolerations:
  - effect: NoExecute
    key: approaching-allocatable
    operator: Equal
    value: nodepool-0

Observed behaviors

Resolving Resource Overestimation:
- v0.37.0 behavior: Karpenter continuously creates and consolidates nodes without realizing the impossibility of fitting the workload.
- Patched behavior: Accurately tracks actual allocatable resources, preventing the endless loop of node creation and consolidation.
Addressing Resource Underestimation:
- v0.37.0 behavior: Karpenter leaves pods pending indefinitely or chooses an instance type larger than necessary, failing to learn from actual node allocatables when launched for other reasons.
- Patched behavior: Remembers true allocatable resources if a node is ever launched, enabling correct node launches for previously pending pods.
Avoiding Extra Churn:
- v0.37.0 behavior: Incorrect predicted allocatable resources during consolidation lead to unnecessary churn.
- Patched behavior: Scheduling simulations benefit from knowledge about true allocatable resources

The above improvements are implemented using a shared cache that can be accessed from:

lifecycle package: to populate the cache as soon as a node is registered.
scheduling package: to use real allocatable resources when making itFits decisions from the cache, if available.
hash package: to flush the cache for a nodepool after an update.

I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.

How was this change tested?
For overestimation:
I ran this in one of our preprod EKS cluster with vmMemoryOverheadPercent=0, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.

For underestimation:
The test was to

set high VM_MEMORY_OVERHEAD_PERCENT value (like 0.2)
Run the workload that was fitting before, observe it's pending
Adding another workload for same nodepool, but with lower request. That launches the real node
Another node would launch for the pod from step 2, and new pods with same requests are now correctly cause new node to be launched

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

BEvgeniyS added 9 commits July 7, 2024 00:50

naive cache

79fad6e

clear cache on nodepool update

6c508b4

clear earlier in case func returns early

7b5e846

do not recalculate

70b542e

Set default shared cache TTL

548617a

Add license

da8d5cb

Set, not Add + default TTL

254f119

Update=>Set

8921315

add prefix terminator

f350d36

BEvgeniyS changed the title ~~feat: discover allocatable~~ fix: accurately track allocatable resources for nodes Jul 13, 2024

BEvgeniyS closed this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: accurately track allocatable resources for nodes #1

fix: accurately track allocatable resources for nodes #1

BEvgeniyS commented Jul 9, 2024 •

edited

Loading

fix: accurately track allocatable resources for nodes #1

fix: accurately track allocatable resources for nodes #1

Conversation

BEvgeniyS commented Jul 9, 2024 • edited Loading

Observed behaviors

BEvgeniyS commented Jul 9, 2024 •

edited

Loading