Investigate outcomes of a user pod using all of a node's CPU #2448

consideRatio · 2023-03-31T08:11:37Z

In #2445 it seems that in at least utoronto's AKS based cluster, a single user using all CPUs can cause trouble. This can be mitigated with #2447 that sets limits to half of the node's capacity.

I'd like to explore this situation in depth to ensure that we understand how to provide a reliable setup without compromising more than needed on performance and/or cost. I find this investigation to be critical as we are relying on shared nodes (#2121) in other clusters besides the educational hubs addressed by #2447.

Here are some preliminary concrete questions I'd like to look into as part of this issue:

What CPU requests/limits was granted for user pods in [utoronto] Investigate report of slow/unresponsive hub #2445?
What is the minimum CPU capacity needed for a successfull startup of a user server with the utoronto based image?
- This could possibly be trialed using very low CPU requests matching limits
What requests/limits are granted to system pods like kube-proxy in AKS/EKS/GCP?

The text was updated successfully, but these errors were encountered:

pnasrat · 2023-03-31T08:54:15Z

I'd strongly advocate that any investigation into isolation and multitenancy have some form of load testing (there may be existing in SIGNODE in the kubernetes e2e tests to validate assumptions as cloud providers.

I'd also like to read more about the tradeoffs of the current node sharing design - ie what node untilizaton we aim for, why minimum starting requests, etc.

You mention performance and cost but not reloiability which is also something we're trading off against.

consideRatio · 2023-03-31T12:47:31Z

@pnasrat I've failed to reduce your comment into actionable items to take in scope of this issue =/

I'm looking to understand how to provide a reliable setup, ideally idependent of cpu requests/limits for user pods as that compromises performance/costs. I'll probably do some manual testing where a user server is hogging CPU to trial hypothesises formed while investigating this.

Is there something else you wish to see done as part of this issue?

pnasrat · 2023-03-31T13:13:05Z

perhaps this is more a longer term thing - but the subject impliied to me understanding the scenarios of cpu exhaustion, and that's where I'd generally expect a reproducible load test running against various configuration scenarios

pnasrat · 2023-03-31T13:30:33Z

In terms of reliability as tradeoff I was thinking around eg the tradeoffs around bin packing pods to nodes and what sort of utilization/availability/latency targets are etc

pnasrat · 2023-03-31T13:34:38Z

As I mentioned on slack these are just suggestions. It's your investigation look forward to the results

yuvipanda mentioned this issue Mar 31, 2023

[utoronto] Investigate report of slow/unresponsive hub #2445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate outcomes of a user pod using all of a node's CPU #2448

Investigate outcomes of a user pod using all of a node's CPU #2448

consideRatio commented Mar 31, 2023

pnasrat commented Mar 31, 2023

consideRatio commented Mar 31, 2023

pnasrat commented Mar 31, 2023

pnasrat commented Mar 31, 2023

pnasrat commented Mar 31, 2023

Investigate outcomes of a user pod using all of a node's CPU #2448

Investigate outcomes of a user pod using all of a node's CPU #2448

Comments

consideRatio commented Mar 31, 2023

pnasrat commented Mar 31, 2023

consideRatio commented Mar 31, 2023

pnasrat commented Mar 31, 2023

pnasrat commented Mar 31, 2023

pnasrat commented Mar 31, 2023