-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate outcomes of a user pod using all of a node's CPU #2448
Comments
I'd strongly advocate that any investigation into isolation and multitenancy have some form of load testing (there may be existing in SIGNODE in the kubernetes e2e tests to validate assumptions as cloud providers. I'd also like to read more about the tradeoffs of the current node sharing design - ie what node untilizaton we aim for, why minimum starting requests, etc. You mention performance and cost but not reloiability which is also something we're trading off against. |
@pnasrat I've failed to reduce your comment into actionable items to take in scope of this issue =/ I'm looking to understand how to provide a reliable setup, ideally idependent of cpu requests/limits for user pods as that compromises performance/costs. I'll probably do some manual testing where a user server is hogging CPU to trial hypothesises formed while investigating this. Is there something else you wish to see done as part of this issue? |
perhaps this is more a longer term thing - but the subject impliied to me understanding the scenarios of cpu exhaustion, and that's where I'd generally expect a reproducible load test running against various configuration scenarios |
In terms of reliability as tradeoff I was thinking around eg the tradeoffs around bin packing pods to nodes and what sort of utilization/availability/latency targets are etc |
As I mentioned on slack these are just suggestions. It's your investigation look forward to the results |
In #2445 it seems that in at least utoronto's AKS based cluster, a single user using all CPUs can cause trouble. This can be mitigated with #2447 that sets limits to half of the node's capacity.
I'd like to explore this situation in depth to ensure that we understand how to provide a reliable setup without compromising more than needed on performance and/or cost. I find this investigation to be critical as we are relying on shared nodes (#2121) in other clusters besides the educational hubs addressed by #2447.
Here are some preliminary concrete questions I'd like to look into as part of this issue:
kube-proxy
in AKS/EKS/GCP?The text was updated successfully, but these errors were encountered: