unclear description of the service interruptions caused by an in-place k8s upgrade #430

aieri · 2020-07-14T23:06:24Z

For a running cluster, there are two different ways to proceed:

Blue-green upgrade - This requires more resources, but should ensure a safe, zero-downtime transition of workloads to an updated cluster
In-place upgrade - this simply upgrades the workers in situ, which may involve some service interruption but doesn't require extra resources

The level of detail is not really sufficient for being able to decide which strategy is appropriate for a site.

I think the in-place upgrade description should more clearly specify if the service interruptions would affect the control and/or data planes, as well as the order of magnitude of the expected outages.

johnsca · 2020-07-15T15:28:51Z

What about something like

Blue-green upgrade - This amounts to having a duplicate cluster with all of the same workloads running on the new version, and changing routing to point clients at the new cluster to take the upgrade live, or the old cluster to revert it. This requires twice as many the resources as running a single cluster but also provides the most ability to validate the upgrade and the most reliable fall-back should something go wrong. Updating routing to switch between clusters is outside the scope of this document, since it depends on the environment in which the cluster is deployed.

Hybrid in-place / blue-green upgrade - This involves adding one or more new workers with the new version and draining workloads off of non-upgraded nodes to them (essentially doing a blue-green upgrade of individual workers). This requires the resources for at least one additional worker, but allows for the same level of cluster utilization while still allowing testing on the new version. Reverting is accomplished by draining the workloads back to the non-upgraded node, but this could potentially lead to downtime if non-HA workloads experience issues while on the upgraded node. Additionally, not all workloads can be automatically drained (see kubectl drain), and may require manual handling.

In-place upgrade - This is similar to the hybrid approach, except that it only uses the existing workers by draining workloads off one worker, upgrading it, and then draining the workloads back. This has the same potential for downtime of non-HA workloads or issues with draining certain workloads as the hybrid approach, but can also lead to somewhat uneven distribution of workloads across nodes at the end and means that your cluster will be short one node during the upgrade, with the upside being that it requires no additional resources.

aieri · 2020-07-17T17:20:11Z

Thanks, the latest revised wording is a lot clearer!

I'm still unsure if having a "hybrid" paragraph is necessary. If I understand correctly the hybrid strategy would only make sense if:

you want to do blue-green, but only need to do so for a subset of your pods and can't afford having 2x workers
you want to upgrade as quickly as possible by pausing multiple workers at a time
your k8s cluster is at 100% capacity, but you have the ability of temporarily spawning new workers

If the above is correct I would suggest splitting the hybrid section as follows:

Blue-green upgrade - This amounts to having a duplicate cluster with all of the same workloads running on the new version, and changing routing to point clients at the new cluster to take the upgrade live, or the old cluster to revert it. This requires twice as many resources as running a single cluster but also provides the most ability to validate the upgrade and the most reliable fall-back should something go wrong. Updating routing to switch between clusters is outside the scope of this document, since it depends on the environment in which the cluster is deployed.
Note: if resources are limited, adding only a few extra workers to test the most critical workloads can be a viable compromise.

In-place upgrade - This approach only involves draining workloads off one worker, upgrading it, and then draining the workloads back. As such, it does not require coordination between the cluster and the services it is hosting, but could potentially lead to downtime if non-HA workloads experience issues while on the upgraded node. Furthermore, not all workloads can be automatically drained (see kubectl drain) and may require manual handling, and uneven distribution of workloads across nodes may be experienced as a result of the pods' movement.
One or more temporary workers may optionally be added to either speed up operations, or to maintain cluster utilization more constant.

Cynerva · 2020-08-03T18:40:20Z

Another user reported similar confusion here: https://bugs.launchpad.net/charm-kubernetes-worker/+bug/1853444

In their case, they were more specifically concerned with downtime of the Ingress component during upgrade:

when following the official upgrade manual, you come to the following point: juju upgrade-charm kubernetes-master

Doing this, is enough to apparently have your ingresses starting to upgrade. This at least happened for me when going to kubernetes-master:v754.

It could also have been the upgrading of the workers -- kubernetes-worker:v590.

The manual does not reference ingresses at all (only in an unrelated context) and should point out when ingress downtime can happen, since it's such a crucial data-plane component.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unclear description of the service interruptions caused by an in-place k8s upgrade #430

unclear description of the service interruptions caused by an in-place k8s upgrade #430

aieri commented Jul 14, 2020

johnsca commented Jul 15, 2020 •

edited

Loading

aieri commented Jul 17, 2020

Cynerva commented Aug 3, 2020

unclear description of the service interruptions caused by an in-place k8s upgrade #430

unclear description of the service interruptions caused by an in-place k8s upgrade #430

Comments

aieri commented Jul 14, 2020

johnsca commented Jul 15, 2020 • edited Loading

aieri commented Jul 17, 2020

Cynerva commented Aug 3, 2020

johnsca commented Jul 15, 2020 •

edited

Loading