Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate outcomes of a user pod using all of a node's CPU #2448

Open
3 tasks
consideRatio opened this issue Mar 31, 2023 · 5 comments
Open
3 tasks

Investigate outcomes of a user pod using all of a node's CPU #2448

consideRatio opened this issue Mar 31, 2023 · 5 comments

Comments

@consideRatio
Copy link
Member

In #2445 it seems that in at least utoronto's AKS based cluster, a single user using all CPUs can cause trouble. This can be mitigated with #2447 that sets limits to half of the node's capacity.

I'd like to explore this situation in depth to ensure that we understand how to provide a reliable setup without compromising more than needed on performance and/or cost. I find this investigation to be critical as we are relying on shared nodes (#2121) in other clusters besides the educational hubs addressed by #2447.

Here are some preliminary concrete questions I'd like to look into as part of this issue:

  • What CPU requests/limits was granted for user pods in [utoronto] Investigate report of slow/unresponsive hub #2445?
  • What is the minimum CPU capacity needed for a successfull startup of a user server with the utoronto based image?
    • This could possibly be trialed using very low CPU requests matching limits
  • What requests/limits are granted to system pods like kube-proxy in AKS/EKS/GCP?
@pnasrat
Copy link
Contributor

pnasrat commented Mar 31, 2023

I'd strongly advocate that any investigation into isolation and multitenancy have some form of load testing (there may be existing in SIGNODE in the kubernetes e2e tests to validate assumptions as cloud providers.

I'd also like to read more about the tradeoffs of the current node sharing design - ie what node untilizaton we aim for, why minimum starting requests, etc.

You mention performance and cost but not reloiability which is also something we're trading off against.

@consideRatio
Copy link
Member Author

@pnasrat I've failed to reduce your comment into actionable items to take in scope of this issue =/

I'm looking to understand how to provide a reliable setup, ideally idependent of cpu requests/limits for user pods as that compromises performance/costs. I'll probably do some manual testing where a user server is hogging CPU to trial hypothesises formed while investigating this.

Is there something else you wish to see done as part of this issue?

@pnasrat
Copy link
Contributor

pnasrat commented Mar 31, 2023

perhaps this is more a longer term thing - but the subject impliied to me understanding the scenarios of cpu exhaustion, and that's where I'd generally expect a reproducible load test running against various configuration scenarios

@pnasrat
Copy link
Contributor

pnasrat commented Mar 31, 2023

In terms of reliability as tradeoff I was thinking around eg the tradeoffs around bin packing pods to nodes and what sort of utilization/availability/latency targets are etc

@pnasrat
Copy link
Contributor

pnasrat commented Mar 31, 2023

As I mentioned on slack these are just suggestions. It's your investigation look forward to the results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

2 participants