Disruptive maintenance? Consider transitioning to highmem nodes (core / user) #2511

consideRatio · 2023-04-26T07:04:51Z

In #2488 we got a new set of default instance types for new clusters.

I think we should opportunistically try to transition to highmem nodes when perform disruptive maintenance such as:

k8s upgrades
Related: Goal: k8s maintenance #2293
cluster recreations
Example: https://github.com/2i2c-org/meta/issues/539
Transitions to a user node sharing setup
Related: New default machine types and profile list options - sharing nodes is great! #2121

Also specifically for GKE clusters, I think we should combine disurptive maintenance with an attempt to reduce the amount of nodes needed by avoiding calico-typha forcing us to use too many nodes:

Decide and document on pod scaling of GKE's clusters calico-typha pods, and konnectivity-agent pods #2490

What highmem nodes?

With a highmem node, I refer to nodes with 1:8 ratio between CPU cores and GB memory in, and I've suggested a group of machine types I think are sensible for each cloud provider below.

GCP - highmem instances, n2-highmem-X types specifically
AWS - memory optimized, r5.X types specifically
AKS - memory optimized, Standard_EX_v4 types specifically

GKE

n2 is used systematically over n1. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:6.5
n2-highmem-2 as core node pool for basehubs
n2-highmem-4 as core node pool for daskhubs
n2-highmem-4, -16, and -64 for user node pool for basehub and daskhub

EKS

r5 is used systematically over m5. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:4.
r5.xlarge as core node pool for basehubs and daskhubs (equivalent to n2-highmem-4), where the smaller option r5.large allows for too few pods per node
r5.xlarge, r5.4xlarge, and r5.16xlarge for user node pool for basehub and daskhub (equivalent to n2-highmem-4, -16, and -64)

AKS

Details to be figured out for AKS still, but I think Standard_EX_v4 is the equivalent of n2-highmem and r5 and that they are suitable to use as instance type overall.

Standard_EX_v4 is used systematically over Standard_EX_v3. They provide a bit better performance per CPU and keeps the 1:8 ratio of CPU:Memory better for the 64 CPU core nodes where the v3 version only has 432 GB memory while the v4 at is almost at the 1:8 ratio with 504 GB memory.
n2-highmem-2 as core node pool for basehubs
n2-highmem-4 as core node pool for daskhubs
Standard_E4_v4, Standard_E16_v4, and Standard_E64_v4 for user node pool for basehub and daskhub (equivalent to n2-highmem-4, -16, and -64)

The text was updated successfully, but these errors were encountered:

damianavila · 2023-04-26T20:14:08Z

Looks like a nice idea!!
Since "disruptive maintenance" is a process yet to be defined I think this is currently and generally blocked until we push forward with some version of that process. I can see us taking opportunistic disruptions as windows to land this as well but that should not be a pattern, IMHO.

consideRatio · 2023-04-27T21:58:27Z

I can see us taking opportunistic disruptions as windows to land this as well but that should not be a pattern, IMHO.

I'm looking to balance the value of getting to highmem nodes systematically with the cost in terms of our time getting it done and causing disruptions for users. I think this may not merit causing a disruption for users, and perhaps not even our time scheduling a disruption with users and performing it. But - it would be worth doing if bundled with some other disruptive maintenance!

I think its critical that we manage to perform k8s upgrades somewhat regularly, even though its not very often.

consideRatio · 2023-10-23T09:23:41Z

This issue resolves itself if we help communities transition to node sharing setups over user-dedicated nodes. This is to be tracked sometime at least, but for now I'm not opening another issue, its on my radar!

consideRatio mentioned this issue Apr 26, 2023

Use of highmem nodes for new clusters node pools and require a few tf variables to be set #2488

Merged

consideRatio mentioned this issue Oct 11, 2023

New default machine types and profile list options - sharing nodes is great! #2121

Closed

consideRatio added the tech:cloud-infra Optimization of cloud infra to reduce costs etc. label Oct 12, 2023

consideRatio closed this as completed Oct 23, 2023

damianavila assigned consideRatio Oct 27, 2023

consideRatio mentioned this issue Oct 29, 2023

Adopt highmem machines and replace cost inefficient standard machines #3340

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disruptive maintenance? Consider transitioning to highmem nodes (core / user) #2511

Disruptive maintenance? Consider transitioning to highmem nodes (core / user) #2511

consideRatio commented Apr 26, 2023

damianavila commented Apr 26, 2023

consideRatio commented Apr 27, 2023

consideRatio commented Oct 23, 2023

Disruptive maintenance? Consider transitioning to highmem nodes (core / user) #2511

Disruptive maintenance? Consider transitioning to highmem nodes (core / user) #2511

Comments

consideRatio commented Apr 26, 2023

What highmem nodes?

GKE

EKS

AKS

damianavila commented Apr 26, 2023

consideRatio commented Apr 27, 2023

consideRatio commented Oct 23, 2023