You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are situations where Console users hit the GPU pricing page (on the website) or the providers page (in console), see that there are enough "available" GPUs of the desired model, proceed to deploy via console, only to NOT get ANY bids for their deployment. This can happen due to the following primary reasons:
While there may be enough "available" GPUs in aggregate (across multiple providers), there may not be enough GPUs on a single provider
While there may be enough GPUs on a single provider, there aren't enough (to fulfill the gpu count in the user SDL) on a single node of the provider. This can happen if past small requests (1-2 GPUs per deployment) happened to get scheduled across different nodes of the provider, leaving the provider "fragmented" in terms of available GPUs.
While there may be enough GPUs on a single node to satisfy the gpu count, the specific node may not have enough other (non-GPU) resources available to satisfy all the resource requirements outlined in the compute profile. We have sometimes seen this happen when a provider's CPU count gets maxed out (90%) with work loads while they have little usage of GPUs.
Solution(s)
The solution requires some deeper thought and brainstorming but here are some initial thoughts and approaches. Note that the ideal solution is one that prevents the issue from occurring in the first place but an improvement (to the current experience) is a solution that provides information to the user about the (apparent) discrepancy and/ or prevents them from requesting bids for workloads that are not likely to get any
Resources Per Node: Providing per node counts for GPUs or at least max available on any node of the provider in https://console.akash.network/providers -- This would be a column called "Max requestable per deployment" or something. Alternatively, it could be a filter on the table that lets the user specify the count they intend to request and it filters and shows the providers that have >= that count available on any node.
Quick Check before initiating deployment: Implementing a "Quick Check" button on the SDL builder page that the user can click, which will run a query to return if there are any providers that can meet the needs, while indicating which recommending which resource should be reduced to increase the number of bids received. Note that the reason for doing it here (rather than in the bids stage, is because the user can adjust the resources here while to do that once the deployment is created requires closing the existing deployment and starting a new one)
@devalpatel67@andy108369 - we talked about this issue and approach #3 independently. Turns out Andrey was suggested updating the k8s scheduler policy a while back as well (https://github.com/ovrclk/engineering/issues/320) - I'm wondering if we should try this on a test cluster and see how it goes. Specifically: we will be testing out a new provider (VP) for h100s soon (likely next week) so I was thinking in addition to testing throughput and IO performance we can maybe test changing this and seeing if it optimizes per node utilization better than out current k8s config does, without causing any other issues.
cc'ing @troian@boz and @chainzero for their thoughts as well. This is something I think that will cause our H100 providers to be underutilized if we have a bunch of users deploy SDLs with 1-2 H100s. And in the extreme case, we will have trouble getting bids for a Llama-3.1-405B type deployment (that needs 8x H100s) even if there are enough GPUs available on the provider as whole but not enough on a single node.
From the end-user perspective the proposed solution No. 2 "Quick Check before initiating deployment" is great one. Exactly what I was think would be nice to have when faced with deployments that never seems to got proper bids from available providers. Most of the time asking for too much CPUs is the culprit, other times the storage request size or type is the issue. So having the suggestion on what in particular need to be reduced for more receiving more bids will be nice.
Implementation wise, maybe the feedback optimization could simply start with, more bid -> better ?
Because surely the recommendation system need a way to sort of know if the suggestion is actually getting user closer to actually accepting some bid from providers and deploying successfully. Just to be sure, this query and quick check will be off-chain? so no transaction fees for every query cycle? User will then only charged once when they actually shoot up the real deployment initiation.
Would love to see this coming to shape and eventually be a feature in Console 2.0.
Problem Statement
There are situations where Console users hit the GPU pricing page (on the website) or the providers page (in console), see that there are enough "available" GPUs of the desired model, proceed to deploy via console, only to NOT get ANY bids for their deployment. This can happen due to the following primary reasons:
Solution(s)
The solution requires some deeper thought and brainstorming but here are some initial thoughts and approaches. Note that the ideal solution is one that prevents the issue from occurring in the first place but an improvement (to the current experience) is a solution that provides information to the user about the (apparent) discrepancy and/ or prevents them from requesting bids for workloads that are not likely to get any
Resources Per Node: Providing per node counts for GPUs or at least max available on any node of the provider in https://console.akash.network/providers -- This would be a column called "Max requestable per deployment" or something. Alternatively, it could be a filter on the table that lets the user specify the count they intend to request and it filters and shows the providers that have >= that count available on any node.
Quick Check before initiating deployment: Implementing a "Quick Check" button on the SDL builder page that the user can click, which will run a query to return if there are any providers that can meet the needs, while indicating which recommending which resource should be reduced to increase the number of bids received. Note that the reason for doing it here (rather than in the bids stage, is because the user can adjust the resources here while to do that once the deployment is created requires closing the existing deployment and starting a new one)
Better per node "bin packing": Improving the way pods get scheduled so that the nodes with least amount of resources available to meet the request are prioritized over ones that have greater amount of resources -- essentially a "bin packing strategy" of sorts. This would be much more involved -- TBD if this is something that can be achieved by customizing the behavior of the default kube-scheduler (https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1/#kubescheduler-config-k8s-io-v1-KubeSchedulerConfiguration) or if it would require implementing a custom scheduler. Another consideration (besides implementation complexity) is what this would do to the time it takes for the scheduler to find a node to schedule (since it may have to go through the entire cluster). One thing to look into is the Scoring Strategy (https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1/#kubescheduler-config-k8s-io-v1-ScoringStrategy)
The text was updated successfully, but these errors were encountered: