You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Also specifically for GKE clusters, I think we should combine disurptive maintenance with an attempt to reduce the amount of nodes needed by avoiding calico-typha forcing us to use too many nodes:
With a highmem node, I refer to nodes with 1:8 ratio between CPU cores and GB memory in, and I've suggested a group of machine types I think are sensible for each cloud provider below.
n2 is used systematically over n1. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:6.5
n2-highmem-2 as core node pool for basehubs
n2-highmem-4 as core node pool for daskhubs
n2-highmem-4, -16, and -64 for user node pool for basehub and daskhub
EKS
r5 is used systematically over m5. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:4.
r5.xlarge as core node pool for basehubs and daskhubs (equivalent to n2-highmem-4), where the smaller option r5.large allows for too few pods per node
r5.xlarge, r5.4xlarge, and r5.16xlarge for user node pool for basehub and daskhub (equivalent to n2-highmem-4, -16, and -64)
AKS
Details to be figured out for AKS still, but I think Standard_EX_v4 is the equivalent of n2-highmem and r5 and that they are suitable to use as instance type overall.
Standard_EX_v4 is used systematically over Standard_EX_v3. They provide a bit better performance per CPU and keeps the 1:8 ratio of CPU:Memory better for the 64 CPU core nodes where the v3 version only has 432 GB memory while the v4 at is almost at the 1:8 ratio with 504 GB memory.
n2-highmem-2 as core node pool for basehubs
n2-highmem-4 as core node pool for daskhubs
Standard_E4_v4, Standard_E16_v4, and Standard_E64_v4 for user node pool for basehub and daskhub (equivalent to n2-highmem-4, -16, and -64)
The text was updated successfully, but these errors were encountered:
Looks like a nice idea!!
Since "disruptive maintenance" is a process yet to be defined I think this is currently and generally blocked until we push forward with some version of that process. I can see us taking opportunistic disruptions as windows to land this as well but that should not be a pattern, IMHO.
I can see us taking opportunistic disruptions as windows to land this as well but that should not be a pattern, IMHO.
I'm looking to balance the value of getting to highmem nodes systematically with the cost in terms of our time getting it done and causing disruptions for users. I think this may not merit causing a disruption for users, and perhaps not even our time scheduling a disruption with users and performing it. But - it would be worth doing if bundled with some other disruptive maintenance!
I think its critical that we manage to perform k8s upgrades somewhat regularly, even though its not very often.
This issue resolves itself if we help communities transition to node sharing setups over user-dedicated nodes. This is to be tracked sometime at least, but for now I'm not opening another issue, its on my radar!
In #2488 we got a new set of default instance types for new clusters.
I think we should opportunistically try to transition to highmem nodes when perform disruptive maintenance such as:
Related: Goal: k8s maintenance #2293
Example: https://github.com/2i2c-org/meta/issues/539
Related: New default machine types and profile list options - sharing nodes is great! #2121
Also specifically for GKE clusters, I think we should combine disurptive maintenance with an attempt to reduce the amount of nodes needed by avoiding
calico-typha
forcing us to use too many nodes:calico-typha
pods, andkonnectivity-agent
pods #2490What highmem nodes?
With a highmem node, I refer to nodes with 1:8 ratio between CPU cores and GB memory in, and I've suggested a group of machine types I think are sensible for each cloud provider below.
n2-highmem-X
types specificallyr5.X
types specificallyStandard_EX_v4
types specificallyGKE
n2
is used systematically overn1
. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:6.5n2-highmem-2
as core node pool for basehubsn2-highmem-4
as core node pool for daskhubsn2-highmem-4
,-16
, and-64
for user node pool for basehub and daskhubEKS
r5
is used systematically overm5
. They provide a bit better performance per CPU and a 1:8 ratio of CPU:Memory instead of 1:4.r5.xlarge
as core node pool for basehubs and daskhubs (equivalent ton2-highmem-4
), where the smaller optionr5.large
allows for too few pods per noder5.xlarge
,r5.4xlarge
, andr5.16xlarge
for user node pool for basehub and daskhub (equivalent ton2-highmem-4
,-16
, and-64
)AKS
Details to be figured out for AKS still, but I think
Standard_EX_v4
is the equivalent ofn2-highmem
andr5
and that they are suitable to use as instance type overall.Standard_EX_v4
is used systematically overStandard_EX_v3
. They provide a bit better performance per CPU and keeps the 1:8 ratio of CPU:Memory better for the 64 CPU core nodes where the v3 version only has 432 GB memory while the v4 at is almost at the 1:8 ratio with 504 GB memory.n2-highmem-2
as core node pool for basehubsn2-highmem-4
as core node pool for daskhubsStandard_E4_v4
,Standard_E16_v4
, andStandard_E64_v4
for user node pool for basehub and daskhub (equivalent ton2-highmem-4
,-16
, and-64
)The text was updated successfully, but these errors were encountered: