-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mega Issue: Karpenter doesnt support custom resources requests/limit #751
Comments
Looks like you're running purely into the CPU resources here. I added the feature label as it looks like you're requesting to be able to add custom resources into the ProvisionerSpec.Limits? |
@njtran , this is the bit:
|
As discussed on slack:
|
Thanks @ellistarn - the proposed solution looks good. Sorry for asking, but any ETA on this? as we're unable to use Karpenter because of this. |
I'm having the same issue with vGPU. |
@ellistarn Hope you are doing well ! |
This isn't currently being worked on -- we're prioritizing consolidation and test/release infrastructure at the moment. If you're interested in picking up this work, check out https://karpenter.sh/v0.13.1/contributing/ |
For us this is a blocking issue with Karpenter. Our use case is As a simpler workaround @ellistarn @tzneal why not just ignore resources that Karpenter is unaware of? Instead of having to create a configMap as a whitelist, Karpenter could just filter down well-known resources and act upon those, but ignore other resource is has no idea of. It can't do anything good about those anyway... Taking this error message:
it looks like Karpenter has all information available of "manageable" resources and those that are not? |
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled. fixes #1900 This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled. fixes #1900 This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled. fixes #1900 This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled. fixes #1900 This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
I'm having the same issue with hugepages |
We also need this, for nitro enclaves. |
we also need this when using "fuse" device plugin resoruce, here is what we met and currently working around this issue. |
If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread? Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective |
We also need this feature. Our use-case is related to a controller which creates extended resources to nodes immediately when a new node is created. Karpenter will not create a node for pods using such extended resources, because it doesn't understand the extended resources. In our case, using node affinity and node selectors together with existing node labels is sufficient to direct Karpenter to pick a good node. The only thing we need is Karpenter to ignore a list of extended resources, when finding the correct instance type. Having said that, I do have a forked workaround, but forked workarounds are not acceptable where I work, for good reason. Having ignorable extended resources wouldn't be new in Kubernetes. They exist also in the scheduler. |
How much appetite is there to simply have an override config map that has per instance type override on resources capacity just for Karpenter simulation (Support for huge pages and possibly other extend resources!?!)? Config map that as instance types + any resource override and if a particular resource isn't override, take what is provided from cloud provider. Pin the configmap per NodeClass via a new setting on NodeClass This push the onus onto the users to ensure that their overrides are correct. We won't provide any sophisticated pattern matching and users can build their own generator for making this map. apiVersion: v1
kind: ConfigMap
metadata:
name: karpenter-instance-type-resource-override-config
namespace: karpenter
data:
nodepoolexample.overrides: |
{
m5.xlarge: {
memory: 4 GiB,
hugepages-2Mi: 10GiB,
}
}
|
Hopefully users wouldn't need to maintain their own list of acceptable instance types in order to handle the "fuse" use case, as fuse doesn't depend on particular instance types. It's a bit frustrating that the fuse use case is being held up by hugepages. The fuse use case is probably common enough to justify being handled out of the box. |
I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ? |
In its current form DRA does not work with cluster autoscalers. Some future versions of DRA might work with cluster autoscalers, but such a version isn't available yet. The current DRA relies on a node-level entity, namely the resource driver kubelet plugin daemonset, which will not deploy before the node is created. Since cluster autoscalers don't know anything about DRA, they will not create a node for a pending pod that requires DRA resource claims. DRA users are in the same limbo as are the extended resource users. The cluster autoscaler can't know whether the new resources will pop up in the node as a result of some controller or daemonset. Maybe they will, maybe they won't. I'm all for giving the users the possibility to configure the resources for Karpenter in a form of a configmap or CRD or similar. A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case. |
That feels fine also. Let me try to bring this up during working group meeting |
Curious if this could be a configuration on the NodePool; we're able to add custom Requirements to allow Karpenter to schedule when hard affinities or tolerances are defined. Would having an entry to define capacity hints to karpenter 'this node pool will satisfy requests/limits for [custom capacity]' be an option? My use case is smarter-devices/kvm - which can be filtered on a nodepool as metal. I could imagine the same for hugePages or similar - we know what the instances which has these so we can filter them using custom NodePools. By using weighting we can define these after the main nodepools - so in my example, I would have Spot for all weight 100, on-demand for all weight 90, and then our KVM with capacity hints of 80. In the mean time, I'm using an Overprovisioner marked with hard affinity for metal instances to ensure these pods can be scheduled; it's a tradeoff with extra cost but the ability to use Karpenter exclusively. |
I wonder if this is something that might be useful to configure both at the node pool level, and at the instance type level. Ultimately, we were learning away from an InstanceTypeOverride CRD due to the level of effort to configure it, but perhaps with support for both, it provides an escape hatch as well as the ability to define a simple blanket policy. We could choose any/all of the following:
|
/cc |
/assign Bryce-Soghigian |
I'm running into this as well, and I'd very much like a solution like this: nodepoolexample.overrides: |
{
m5.xlarge: {
memory: 4 GiB,
hugepages-2Mi: 10GiB,
}
} Albeit as a NodePool configuration to specify manual node resources. My reasoning is a bit different than the First, the field of accelerators is changing rapidly, e.g.: NVidia Multi Instance GPU resources are complex and not stable. I don't think Cloud providers will keep up to date with what Nvidia's drivers ship. Second, as evidenced by the above, resources can be hierarchical, and Kubernetes may eventually adapt to support complex hierarchies like so: Users may wish to manually specify one NodePool which provides 1 |
https://docs.google.com/document/d/1vEdd226PYlGmJqs6gWlC2pTyDKhZE8DyCU2SbNB35wM/edit Looking for user stories from customers on their extended resources support story here. Please leave some comments here! After we feel confident we have captured all the critical usecases, I will go through and propose some RFCs to solve the various dimensions of these problems. |
@jmickey has captured my comments above in the doc. |
For anyone watching this issue, I have a proof of concept to solve this problem here: #1305 |
Searching for a fix. All I want is for Karpenter to ignore this custom resource. My current workaround is absolutely hideous: resources:
requests:
cpu: 4000m
memory: 16Gi
limits:
cpu: 11000m
memory: 24Gi
xilinx.com/fpga-xilinx_u30_gen3x4_base_2-0: 1
# ---
# karpenter will not provision this node if this custom device is here.
# To provision, comment out, wait for launch and daemon set to provision
# then uncomment, sync kustomization, then kill the old pod Does anyone have a better workaround? |
I guess this feature request did not make the v1.0 release. Can someone confirm? |
Hi everyone, We too faced this exact issue. Karpenter consolidation stopped working after a third party software started to add extended resources to our pods and nodes (we run Spark on EKS). This increased costs significantly and we couldn't wait until this PR is merged/released. I would like to propose a temporary solution to those who, just like us, need a quick workaround and, importantly, don't care about the extended resources being part of Karpenter's scheduling simulation to find replacement nodes. If you are looking for the inverse and want Karpenter to take extended resources into account, this will not work for you. This workaround builds a custom Karpenter controller image. I would like to thank @jonathan-innis for his PR, where the required code changes come from. It is assumed that you can publish the image to your own repository and use it in your Karpenter deployment (in our environment we use ECR and EKS, Karpenter is deployed with Helm using the official chart). The steps:
At this point, you can upload it to the repository of your choice, update the We stopped seeing any P.S. |
This seems to be a universal way to go. Probably a bit complex to configure and document the first time, but it extends a lot the capabilities. Any plans ready for an implementation ? @ellistarn to simplify a little bit don't you think it would make sense to be able using a selector pointing directly to a NodePool name ? Maybe I miss some logic in NodeOverlays, but as an example some resources like |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale This would be incredibly helpful for us. |
Version
Karpenter: v0.10.1
Kubernetes: v1.20.15
Expected Behavior
Karpenter should be able to trigger an autoscale
Actual Behavior
Karpenter isnt able to trigger an autoscale
Steps to Reproduce the Problem
We're using Karpenter on EKS. We have pods that has custom resource requests/limits in their spec definition -
smarter-devices/fuse: 1
. Karpenter seems to not respecting this resource and fails to autoscale and the pod remains to be in pending stateResource Specs and Logs
Provisioner spec
pod spec
karpenter controller logs:
controller 2022-06-06T15:59:00.499Z ERROR controller no instance type satisfied resources {"cpu":"32","memory":"2Gi","pods":"1","smarter-devices/fuse":"1"} and requirements kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/hostname In [hostname-placeholder-3403], node.kubernetes.io/instance-type In [m5.12xlarge m5.2xlarge m5.4xlarge m5.8xlarge m5.large], karpenter.sh/provisioner-name In [default], topology.kubernetes.io/zone In [eu-west-1a eu-west-1b], kubernetes.io/arch In [amd64];
The text was updated successfully, but these errors were encountered: