Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega Issue: Karpenter doesnt support custom resources requests/limit #751

Open
prateekkhera opened this issue Jun 6, 2022 · 41 comments · May be fixed by navvis-dev/karpenter#3
Open

Mega Issue: Karpenter doesnt support custom resources requests/limit #751

prateekkhera opened this issue Jun 6, 2022 · 41 comments · May be fixed by navvis-dev/karpenter#3
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. v1.x Issues prioritized for post-1.0

Comments

@prateekkhera
Copy link

Version

Karpenter: v0.10.1

Kubernetes: v1.20.15

Expected Behavior

Karpenter should be able to trigger an autoscale

Actual Behavior

Karpenter isnt able to trigger an autoscale

Steps to Reproduce the Problem

We're using Karpenter on EKS. We have pods that has custom resource requests/limits in their spec definition - smarter-devices/fuse: 1. Karpenter seems to not respecting this resource and fails to autoscale and the pod remains to be in pending state

Resource Specs and Logs

Provisioner spec

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  limits:
    resources:
      cpu: "100"
  provider:
    launchTemplate: xxxxx
    subnetSelector:
      xxxxx: xxxxx
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m5.large
    - m5.2xlarge
    - m5.4xlarge
    - m5.8xlarge
    - m5.12xlarge
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30
status:
  resources:
    cpu: "32"
    memory: 128830948Ki

pod spec

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fuse-test
  labels:
    app: fuse-test
spec:
  replicas: 1
  selector:
    matchLabels:
      name: fuse-test
  template:
    metadata:
      labels:
        name: fuse-test
    spec:
      containers:
      - name: fuse-test
        image: ubuntu:latest
        ports:
          - containerPort: 8080
            name: web
            protocol: TCP
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN
        resources:
          limits:
            cpu: 32
            memory: 4Gi
            smarter-devices/fuse: 1  # Custom resource
          requests:
            cpu: 32
            memory: 2Gi
            smarter-devices/fuse: 1  # Custom resource
        env:
        - name: S3_BUCKET
          value: test-s3
        - name: S3_REGION
          value: eu-west-1

karpenter controller logs:

controller 2022-06-06T15:59:00.499Z ERROR controller no instance type satisfied resources {"cpu":"32","memory":"2Gi","pods":"1","smarter-devices/fuse":"1"} and requirements kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/hostname In [hostname-placeholder-3403], node.kubernetes.io/instance-type In [m5.12xlarge m5.2xlarge m5.4xlarge m5.8xlarge m5.large], karpenter.sh/provisioner-name In [default], topology.kubernetes.io/zone In [eu-west-1a eu-west-1b], kubernetes.io/arch In [amd64];

@prateekkhera prateekkhera added the kind/bug Categorizes issue or PR as related to a bug. label Jun 6, 2022
@njtran njtran added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 6, 2022
@njtran
Copy link
Contributor

njtran commented Jun 6, 2022

Looks like you're running purely into the CPU resources here. I added the feature label as it looks like you're requesting to be able to add custom resources into the ProvisionerSpec.Limits?

@ellistarn
Copy link
Contributor

@njtran , this is the bit:

smarter-devices/fuse: 1 # Custom resource

@ellistarn
Copy link
Contributor

As discussed on slack:

@Todd Neal and I were recently discussing a mechanism to allow users to define extended resources that karpenter isn't aware of. Right now, we are aware of the extended resources on specific EC2 instance types, which is how we binpack them. One option would be to enable users to define a configmap of [{instancetype, provisioner, extendedresource}] that karpenter could use for binpacking.

@prateekkhera
Copy link
Author

Thanks @ellistarn - the proposed solution looks good. Sorry for asking, but any ETA on this? as we're unable to use Karpenter because of this.

@CodeBooster97
Copy link

I'm having the same issue with vGPU.

@suket22 suket22 removed the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2022
@parmeet-kumar
Copy link

@ellistarn Hope you are doing well !
I encountered the same issue while working on karpenter, So wanted to know it's been implemented via. any existing PR ?

@ellistarn
Copy link
Contributor

This isn't currently being worked on -- we're prioritizing consolidation and test/release infrastructure at the moment. If you're interested in picking up this work, check out https://karpenter.sh/v0.13.1/contributing/

@universam1
Copy link

universam1 commented Jul 19, 2022

For us this is a blocking issue with Karpenter. Our use case is fuse and snd devices that are created as custom device resources from smarter device manager

As a simpler workaround @ellistarn @tzneal why not just ignore resources that Karpenter is unaware of? Instead of having to create a configMap as a whitelist, Karpenter could just filter down well-known resources and act upon those, but ignore other resource is has no idea of. It can't do anything good about those anyway...

Taking this error message:

Failed to provision new node, incompatible with provisioner "default", no instance type satisfied resources {....smarter-devices/fuse":"2"} ...

it looks like Karpenter has all information available of "manageable" resources and those that are not?

universam1 referenced this issue in o11n/karpenter Jul 21, 2022
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled.

fixes #1900

This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
universam1 referenced this issue in o11n/karpenter Jul 22, 2022
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled.

fixes #1900

This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
universam1 referenced this issue in o11n/karpenter Jul 22, 2022
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled.

fixes #1900

This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
universam1 referenced this issue in o11n/karpenter Jul 22, 2022
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled.

fixes #1900

This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
@zeppelinen zeppelinen linked a pull request Dec 14, 2022 that will close this issue
3 tasks
@ghost
Copy link

ghost commented Feb 9, 2023

I'm having the same issue with hugepages

@universam1
Copy link

@james-callahan
Copy link

We also need this, for nitro enclaves.

@jonathan-innis jonathan-innis added the v1 Issues requiring resolution by the v1 milestone label Apr 18, 2023
@jonathan-innis jonathan-innis changed the title Karpenter doesnt support custom resources requests/limit Mega Issue: Karpenter doesnt support custom resources requests/limit May 1, 2023
@lzjqsdd
Copy link

lzjqsdd commented May 4, 2023

we also need this when using "fuse" device plugin resoruce, here is what we met and currently working around this issue.
#308

@bryantbiggs
Copy link
Member

bryantbiggs commented May 12, 2023

If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread?

Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective

@uniemimu
Copy link

uniemimu commented Mar 4, 2024

We also need this feature. Our use-case is related to a controller which creates extended resources to nodes immediately when a new node is created. Karpenter will not create a node for pods using such extended resources, because it doesn't understand the extended resources.

In our case, using node affinity and node selectors together with existing node labels is sufficient to direct Karpenter to pick a good node. The only thing we need is Karpenter to ignore a list of extended resources, when finding the correct instance type. Having said that, I do have a forked workaround, but forked workarounds are not acceptable where I work, for good reason.

Having ignorable extended resources wouldn't be new in Kubernetes. They exist also in the scheduler.

@garvinp-stripe
Copy link
Contributor

garvinp-stripe commented Mar 13, 2024

How much appetite is there to simply have an override config map that has per instance type override on resources capacity just for Karpenter simulation (Support for huge pages and possibly other extend resources!?!)?
https://github.com/aws/karpenter-provider-aws/blob/main/pkg/providers/instancetype/types.go#L179

Config map that as instance types + any resource override and if a particular resource isn't override, take what is provided from cloud provider.

Pin the configmap per NodeClass via a new setting on NodeClass instanceTypeResourceOverride. Note that changes to the configmap won't be reflected on current nodes, we would use drift to reconcile the changes.

This push the onus onto the users to ensure that their overrides are correct. We won't provide any sophisticated pattern matching and users can build their own generator for making this map.

apiVersion: v1
kind: ConfigMap
metadata:
  name: karpenter-instance-type-resource-override-config
  namespace: karpenter
data:
  nodepoolexample.overrides: |
    {
       m5.xlarge: {
            memory: 4 GiB,
            hugepages-2Mi: 10GiB,

       }

    }

@johngmyers
Copy link

Hopefully users wouldn't need to maintain their own list of acceptable instance types in order to handle the "fuse" use case, as fuse doesn't depend on particular instance types.

It's a bit frustrating that the fuse use case is being held up by hugepages. The fuse use case is probably common enough to justify being handled out of the box.

@GnatorX
Copy link

GnatorX commented Mar 14, 2024

I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?

@uniemimu
Copy link

I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?

In its current form DRA does not work with cluster autoscalers. Some future versions of DRA might work with cluster autoscalers, but such a version isn't available yet.

The current DRA relies on a node-level entity, namely the resource driver kubelet plugin daemonset, which will not deploy before the node is created. Since cluster autoscalers don't know anything about DRA, they will not create a node for a pending pod that requires DRA resource claims. DRA users are in the same limbo as are the extended resource users. The cluster autoscaler can't know whether the new resources will pop up in the node as a result of some controller or daemonset. Maybe they will, maybe they won't.

I'm all for giving the users the possibility to configure the resources for Karpenter in a form of a configmap or CRD or similar. A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.

@GnatorX
Copy link

GnatorX commented Mar 14, 2024

A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.

That feels fine also. Let me try to bring this up during working group meeting

@Bourne-ID
Copy link

Curious if this could be a configuration on the NodePool; we're able to add custom Requirements to allow Karpenter to schedule when hard affinities or tolerances are defined. Would having an entry to define capacity hints to karpenter 'this node pool will satisfy requests/limits for [custom capacity]' be an option?

My use case is smarter-devices/kvm - which can be filtered on a nodepool as metal. I could imagine the same for hugePages or similar - we know what the instances which has these so we can filter them using custom NodePools.

By using weighting we can define these after the main nodepools - so in my example, I would have Spot for all weight 100, on-demand for all weight 90, and then our KVM with capacity hints of 80.

In the mean time, I'm using an Overprovisioner marked with hard affinity for metal instances to ensure these pods can be scheduled; it's a tradeoff with extra cost but the ability to use Karpenter exclusively.

@ellistarn
Copy link
Contributor

I wonder if this is something that might be useful to configure both at the node pool level, and at the instance type level. Ultimately, we were learning away from an InstanceTypeOverride CRD due to the level of effort to configure it, but perhaps with support for both, it provides an escape hatch as well as the ability to define a simple blanket policy.

We could choose any/all of the following:

  1. Cloudprovider automatically knows extended resource values (e.g. GPU)
  2. NodePool (or class) lets you specify a flat resource value per node pool
  3. NodePool (or class) lets you specify a scalar resource values (e.g. hugePageMemoryPercent)
  4. InstanceType CRD (or config map) let's you define per instance type resource overrides.

@fmuyassarov
Copy link
Member

/cc

@Bryce-Soghigian
Copy link
Member

/assign Bryce-Soghigian

@AaronFriel
Copy link

I'm running into this as well, and I'd very much like a solution like this:

  nodepoolexample.overrides: |
    {
       m5.xlarge: {
            memory: 4 GiB,
            hugepages-2Mi: 10GiB,

       }

    }

Albeit as a NodePool configuration to specify manual node resources. My reasoning is a bit different than the fuse use case, but I think explains why it would be important for NodePool to have this capability.

First, the field of accelerators is changing rapidly, e.g.: NVidia Multi Instance GPU resources are complex and not stable. I don't think Cloud providers will keep up to date with what Nvidia's drivers ship.

Second, as evidenced by the above, resources can be hierarchical, and Kubernetes may eventually adapt to support complex hierarchies like so:

image

Users may wish to manually specify one NodePool which provides 1 7g.40gb per A100, and another NodePool for smaller models to more densely pack with 7 1g.5gb resources per A100. Allowing manual overrides allows to make better use of cloud resources, as the current accelerator resource labels are too coarse.

@Bryce-Soghigian
Copy link
Member

https://docs.google.com/document/d/1vEdd226PYlGmJqs6gWlC2pTyDKhZE8DyCU2SbNB35wM/edit Looking for user stories from customers on their extended resources support story here. Please leave some comments here!

After we feel confident we have captured all the critical usecases, I will go through and propose some RFCs to solve the various dimensions of these problems.

@AaronFriel
Copy link

@jmickey has captured my comments above in the doc.

@ellistarn
Copy link
Contributor

For anyone watching this issue, I have a proof of concept to solve this problem here: #1305

@daverin
Copy link

daverin commented Jul 18, 2024

Searching for a fix.

All I want is for Karpenter to ignore this custom resource.

My current workaround is absolutely hideous:

         resources:
            requests:
              cpu: 4000m
              memory: 16Gi
            limits:
              cpu: 11000m
              memory: 24Gi
              xilinx.com/fpga-xilinx_u30_gen3x4_base_2-0: 1
              # ---
              # karpenter will not provision this node if this custom device is here. 
              # To provision, comment out, wait for launch and daemon set to provision
              # then uncomment, sync kustomization, then kill the old pod

Does anyone have a better workaround?

@poussa
Copy link

poussa commented Sep 5, 2024

I guess this feature request did not make the v1.0 release. Can someone confirm?

@sergeykranga
Copy link

Hi everyone,

We too faced this exact issue. Karpenter consolidation stopped working after a third party software started to add extended resources to our pods and nodes (we run Spark on EKS). This increased costs significantly and we couldn't wait until this PR is merged/released.

I would like to propose a temporary solution to those who, just like us, need a quick workaround and, importantly, don't care about the extended resources being part of Karpenter's scheduling simulation to find replacement nodes. If you are looking for the inverse and want Karpenter to take extended resources into account, this will not work for you.

This workaround builds a custom Karpenter controller image. I would like to thank @jonathan-innis for his PR, where the required code changes come from.

It is assumed that you can publish the image to your own repository and use it in your Karpenter deployment (in our environment we use ECR and EKS, Karpenter is deployed with Helm using the official chart).

The steps:

  1. Fork the https://github.com/kubernetes-sigs/karpenter repo and clone it
  2. To ensure Karpenter version consistency, from the Releases page, select the commit of the version you are interested in (in our case we use 0.37.0) so the commit is 38b4c32). Checkout this commit locally, you'll get into detached head state
  3. Create a branch for the fix (i.e. ignore_extended_resources)
  4. Make the same changes as shown in the original pull request. Make sure to change smarter-devices to your own extended resource name that you would like to ignore
  5. Commit and push the branch to your fork
  6. Clone the https://github.com/aws/karpenter-provider-aws repo (no need to fork)
  7. Do the step 2 for this repo too, to ensure the provider and karpenter versions match
  8. Install ko and go
  9. Add this line to the end of go.mod file (change the branch name if it is different):
replace sigs.k8s.io/karpenter => github.com/<your-org-or-account-name>/karpenter ignore_extended_resources
  1. Run go mod tidy from the root of the repo, it should download the replacement dependency from your fork
$ go mod tidy
go: downloading github.com/<your-org-or-account-name>/karpenter v0.0.0-20241003171933-4834e30c1e45
go: downloading k8s.io/cloud-provider v0.30.1
go: downloading k8s.io/component-base v0.30.1
go: downloading k8s.io/csi-translation-lib v0.30.1
go: downloading github.com/stretchr/testify v1.9.0
go: downloading go.uber.org/goleak v1.3.0
go: downloading gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c
go: downloading go.uber.org/automaxprocs v1.5.3
go: downloading github.com/jmespath/go-jmespath/internal/testify v1.5.1
go: downloading github.com/blang/semver/v4 v4.0.0
go: downloading golang.org/x/mod v0.17.0
go: downloading github.com/kr/pretty v0.3.1
go: downloading github.com/spf13/cobra v1.8.0
go: downloading github.com/pmezard/go-difflib v1.0.0
go: downloading github.com/golang/glog v1.1.0
go: downloading github.com/prashantv/gostub v1.1.0
go: downloading github.com/inconshreveable/mousetrap v1.1.0
go: downloading github.com/rogpeppe/go-internal v1.10.0
go: downloading github.com/kr/text v0.2.0
  1. We didn't want the build process to publish the image anywhere yet, so we modified the Makefile to only build it locally. If you want the same behavior, modify the Makefile like this:
-       $(eval CONTROLLER_IMG=$(shell $(WITH_GOFLAGS) KOCACHE=$(KOCACHE) KO_DOCKER_REPO="$(KO_DOCKER_REPO)" ko build --bare github.com/aws/karpenter-provider-aws/cmd/controller))
+       $(eval CONTROLLER_IMG=$(shell $(WITH_GOFLAGS) ko build --bare -L github.com/aws/karpenter-provider-aws/cmd/controller))
  1. Ensure Docker is running on your machine
  2. Build the image:
$ make image
2024/10/03 20:24:38 Using base public.ecr.aws/eks-distro-build-tooling/eks-distro-minimal-base@sha256:5c9f8d3da61beb5ed4d60344767c7efaf24e76f6e1355a86697b507b5488a4d6 for github.com/aws/karpenter-provider-aws/cmd/controller
2024/10/03 20:24:39 Building github.com/aws/karpenter-provider-aws/cmd/controller for linux/amd64
2024/10/03 20:24:44 Loading ko.local:63c03b831c126d0b62d9df69a4d0cfbda8a57092bf7f2db08db8719ec05f20fa
2024/10/03 20:24:46 Loaded ko.local:63c03b831c126d0b62d9df69a4d0cfbda8a57092bf7f2db08db8719ec05f20fa
2024/10/03 20:24:46 Adding tag latest
2024/10/03 20:24:46 Added tag latest
make: Nothing to be done for 'image'.
  1. The ko.local is the karpenter controller image.

At this point, you can upload it to the repository of your choice, update the controller.image.repository, controller.image.tag and controller.image.digest Helm chart variables and deploy it.

We stopped seeing any could not schedule pod errors from Karpenter instantly, and consolidation started working properly. Thanks again @jonathan-innis !

P.S.
If you run YuniKorn in a cloud environment, don't forget to update the node sorting policy to binpacking to achieve better node utilization and reduce costs!

@gillg
Copy link

gillg commented Oct 22, 2024

For anyone watching this issue, I have a proof of concept to solve this problem here: #1305

This seems to be a universal way to go. Probably a bit complex to configure and document the first time, but it extends a lot the capabilities. Any plans ready for an implementation ?

@ellistarn to simplify a little bit don't you think it would make sense to be able using a selector pointing directly to a NodePool name ? Maybe I miss some logic in NodeOverlays, but as an example some resources like xilinx.com/fpga-xilinx_u30_gen3x4_base_2-0: 1 require special hardware based on a Device plugin (question could be continued in the PR directly)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2025
@mikeage
Copy link

mikeage commented Jan 22, 2025

/remove-lifecycle stale

This would be incredibly helpful for us.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment