Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disruptive maintenance? Consider transitioning to highmem nodes (core / user) #2511

Closed
consideRatio opened this issue Apr 26, 2023 · 3 comments
Assignees
Labels
tech:cloud-infra Optimization of cloud infra to reduce costs etc.

Comments

@consideRatio
Copy link
Member

In #2488 we got a new set of default instance types for new clusters.

I think we should opportunistically try to transition to highmem nodes when perform disruptive maintenance such as:

Also specifically for GKE clusters, I think we should combine disurptive maintenance with an attempt to reduce the amount of nodes needed by avoiding calico-typha forcing us to use too many nodes:

What highmem nodes?

With a highmem node, I refer to nodes with 1:8 ratio between CPU cores and GB memory in, and I've suggested a group of machine types I think are sensible for each cloud provider below.

GKE

  • n2 is used systematically over n1. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:6.5
  • n2-highmem-2 as core node pool for basehubs
  • n2-highmem-4 as core node pool for daskhubs
  • n2-highmem-4, -16, and -64 for user node pool for basehub and daskhub

EKS

  • r5 is used systematically over m5. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:4.
  • r5.xlarge as core node pool for basehubs and daskhubs (equivalent to n2-highmem-4), where the smaller option r5.large allows for too few pods per node
  • r5.xlarge, r5.4xlarge, and r5.16xlarge for user node pool for basehub and daskhub (equivalent to n2-highmem-4, -16, and -64)

AKS

Details to be figured out for AKS still, but I think Standard_EX_v4 is the equivalent of n2-highmem and r5 and that they are suitable to use as instance type overall.

  • Standard_EX_v4 is used systematically over Standard_EX_v3. They provide a bit better performance per CPU and keeps the 1:8 ratio of CPU:Memory better for the 64 CPU core nodes where the v3 version only has 432 GB memory while the v4 at is almost at the 1:8 ratio with 504 GB memory.
  • n2-highmem-2 as core node pool for basehubs
  • n2-highmem-4 as core node pool for daskhubs
  • Standard_E4_v4, Standard_E16_v4, and Standard_E64_v4 for user node pool for basehub and daskhub (equivalent to n2-highmem-4, -16, and -64)
@damianavila
Copy link
Contributor

Looks like a nice idea!!
Since "disruptive maintenance" is a process yet to be defined I think this is currently and generally blocked until we push forward with some version of that process. I can see us taking opportunistic disruptions as windows to land this as well but that should not be a pattern, IMHO.

@consideRatio
Copy link
Member Author

I can see us taking opportunistic disruptions as windows to land this as well but that should not be a pattern, IMHO.

I'm looking to balance the value of getting to highmem nodes systematically with the cost in terms of our time getting it done and causing disruptions for users. I think this may not merit causing a disruption for users, and perhaps not even our time scheduling a disruption with users and performing it. But - it would be worth doing if bundled with some other disruptive maintenance!

I think its critical that we manage to perform k8s upgrades somewhat regularly, even though its not very often.

@consideRatio
Copy link
Member Author

This issue resolves itself if we help communities transition to node sharing setups over user-dedicated nodes. This is to be tracked sometime at least, but for now I'm not opening another issue, its on my radar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tech:cloud-infra Optimization of cloud infra to reduce costs etc.
Projects
No open projects
Status: Done 🎉
Development

No branches or pull requests

2 participants