Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidation does not happen even when there is cheaper combination of instances available #1962

Open
codeeong opened this issue Feb 5, 2025 · 4 comments
Labels
consolidation kind/bug Categorizes issue or PR as related to a bug. performance Issues relating to performance (memory usage, cpu usage, timing) priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@codeeong
Copy link

codeeong commented Feb 5, 2025

Description

Observed Behavior:
For context, we wanted to leave the cost effectiveness decision to Karpenter, thus we gave a variety of instance types: c5a,c6a,m6a,m5a,c7a,r6a,r5a,r4, in large/xlarge/2xlarge, thinking the different combinations of cpu:memory options would allow Karpenter to make the best decisions for optimal usage to cost on our behalf.

However, based on the nodes karpenter chose for us, our memory utilization is pretty good at 90%, however cpu usage is very low (around 50%).
For instance, we have many c5a.xlarge instances (in the same AZ) that use less than 50% CPU. So two of these could be consolidated into a cheaper m6a.xlarge that has double the amount of memory and the same CPU. But the event on the node says
Normal Unconsolidatable 4m36s (x47 over 15h) karpenter Can't replace with a cheaper node

Instance types with CPU usage looks like this:
Image

This ends up being more expensive than our original pre-provisioned nodepool of nodes (which was around 60-65% utilization for both cpu and memory).
To alleviate this issue, we have removed certain instance types from our list of instance types in our NodePool configuration. However we are curious to know if this is the expected behavior because if so then it seems users still have to understand which specific subset of instances fits the resource needs of our clusters, only then can we make use of Karpenter to minimise costs.

Expected Behavior:
We expect that we should see multi-node consolidation, defined as

Multi Node Consolidation - Try to delete two or more nodes in parallel, possibly launching a single replacement whose price is lower than that of all nodes being removed

For instance, we would expect to see 2 c5a.xlarge instances get consolidated into 1 m6a.xlarge as the cpu and memory would fit into that instance and cost lost.
Image

Reproduction Steps (Please include YAML):
nodepool config:

  "object": {
      "apiVersion": "karpenter.sh/v1",
      "kind": "NodePool",
      "metadata": {
        "annotations": {
          "karpenter.sh/nodepool-hash": "10589712261218411145",
          "karpenter.sh/nodepool-hash-version": "v3"
        },
        "creationTimestamp": null,
        "deletionGracePeriodSeconds": null,
        "deletionTimestamp": null,
        "finalizers": null,
        "generateName": null,
        "generation": null,
        "labels": null,
        "managedFields": null,
        "name": "node-pool-1",
        "namespace": null,
        "ownerReferences": null,
        "resourceVersion": null,
        "selfLink": null,
        "uid": null
      },
      "spec": {
        "disruption": {
          "budgets": [
            {
              "duration": null,
              "nodes": "5%",
              "reasons": null,
              "schedule": null
            },
          ],
          "consolidateAfter": "30s",
          "consolidationPolicy": "WhenEmptyOrUnderutilized"
        },
        "limits": {
          "cpu": "140",
          "memory": "1000Gi"
        },
        "template": {
          "metadata": {
            "annotations": null,
            "labels": null
          },
          "spec": {
            "expireAfter": "Never",
            "nodeClassRef": {
              "group": "karpenter.k8s.aws",
              "kind": "EC2NodeClass",
              "name": "node-pool-1"
            },
            "requirements": [
              {
                "key": "node.kubernetes.io/instance-type",
                "minValues": null,
                "operator": "In",
                "values": [
                  "c5a.xlarge",
                  "c5a.2xlarge",
                  "c6a.xlarge",
                  "c6a.2xlarge",
                  "c7a.xlarge",
                  "c7a.2xlarge",
                  "m5a.xlarge",
                  "m5a.2xlarge",
                  "m6a.xlarge",
                  "m6a.2xlarge",
                  "r4.xlarge",
                  "r4.2xlarge",
                  "r5a.xlarge",
                  "r5a.2xlarge",
                  "r6a.xlarge",
                  "r6a.2xlarge"
                ]
              },
              {
                "key": "karpenter.sh/capacity-type",
                "minValues": null,
                "operator": "NotIn",
                "values": [
                  "spot"
                ]
              },
              {
                "key": "eks.amazonaws.com/capacityType",
                "minValues": null,
                "operator": "In",
                "values": [
                  "ON_DEMAND"
                ]
              },
              {
                "key": "topology.kubernetes.io/zone",
                "minValues": null,
                "operator": "In",
                "values": [
                  "ap-southeast-1a",
                  "ap-southeast-1b",
                  "ap-southeast-1c"
                ]
              }
            ],
            "startupTaints": null,
            "taints": null,
            "terminationGracePeriod": null
          }
        },
        "weight": null
      }
    },
    "timeouts": [],
    "wait": [],
    "wait_for": null
  }
}

Versions:

  • Chart Version: v1.1.1
  • Kubernetes Version (kubectl version): v1.30
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@codeeong codeeong added the kind/bug Categorizes issue or PR as related to a bug. label Feb 5, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 5, 2025
@jonathan-innis
Copy link
Member

I'm imagining that this has to do with multi-node consolidation not being able to find the combination of two instances that could be consolidated. In general, it's tough for us to try all the combinations, though we could probably improve the overall heuristic that we use to consider which nodes we could combine

@jonathan-innis
Copy link
Member

This is effectively an issue about getting a better heuristic on multi-node consolidation selection before we actually perform the scheduling simulation
cc: @rschalo

@jonathan-innis jonathan-innis added consolidation performance Issues relating to performance (memory usage, cpu usage, timing) labels Feb 6, 2025
@jonathan-innis
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2025
@jonathan-innis
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consolidation kind/bug Categorizes issue or PR as related to a bug. performance Issues relating to performance (memory usage, cpu usage, timing) priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants