Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reversible upgrades #1096

Open
kimwnasptd opened this issue Oct 2, 2024 · 5 comments
Open

Reversible upgrades #1096

kimwnasptd opened this issue Oct 2, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@kimwnasptd
Copy link
Contributor

kimwnasptd commented Oct 2, 2024

Context

We need to have a story for being able to do reversible upgrades. The goal is that if during an upgrade something goes wrong that we can revert to a state before the upgrade.

There are 2 approaches for this:

Canary upgrades

  1. Install the new version in the same cluster, alongside the first
  2. Monitor how the upgraded version behaves
  3. Once everything in the new version is ready then remove the old version

Blue-Green upgrades

  1. Install the new version in a different cluster
  2. Move all the state from previous version to new version
  3. Verify the new version is working as expected
  4. Delete the old cluster and redirect all traffic to the new one

In-Place upgrades

  1. Refresh the components one by one to the new version (how the current upgrade instructions work)
  2. If something goes wrong you'll need to refresh back to an older version

As part of reversible upgrades story we'll need to have a plan on how to approach reversible upgrades.

What needs to get done

  1. Expose the limitations of both approaches
  2. Look into how to fill the gaps for doing both approaches

Definition of Done

  1. Fully document our limitations for how to do reversible upgrades, with both approaches
  2. Decide on which approach to follow
  3. Have a list of next steps for achieving reversible upgrades
@kimwnasptd kimwnasptd added the enhancement New feature or request label Oct 2, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6344.

This message was autogenerated

@kimwnasptd
Copy link
Contributor Author

kimwnasptd commented Oct 2, 2024

Canary

There are some projects that follow this approach, i.e. Istio. For this strategy we need to ensure that all KF sub-components (Istio, KServe, KFP etc) can support canary upgrades. This means the control plane, of the different apps, will need to ensure they:

  1. Allow another instance to be running (i.e. K8s controller, web app etc)
  2. Configure which instance should handle a resource (i.e. new KServe Controller should handle ISVC B)

Some more mature projects like Istio support canary upgrades, but a lot of Kubeflow components don't provide such mechanism.

So this results in the following problematic potential scenarios, in which we can't have 2 versions of:

  1. Web apps at the same time, since VirtualServices will be overwriting each other
  2. K8s Controllers running at the same time, since their reconciliation loops will be conflicting with each other

Regarding CRDs: The pattern in K8s that a lot of Controllers follow is to have a webhook conversion, since the K8s API might persist an older version of a CR and a component to request a newer
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion

Because of the above we can't do canary upgrades in Kubeflow

@kimwnasptd
Copy link
Contributor Author

Blue / Green

In this case we'll need to install a new KF version in a separate cluster and then move all the state to the new cluster.

The above method can give us rollback support (reversible upgrades) out of the box, since the original installation is left intact. The main drawbacks to this are:

  1. We need to migrate all state
    1. Need to define what state is. Control plane data, Profiles, what else?)
    2. Need a huge S3 bucket for holding the state between the clusters
  2. Need to run a second cluster in parallel

The above result mainly in more cost and time to do the upgrade, but is more safe in terms of rollback. Going down this approach also means we have a full backup and restore strategy

@kimwnasptd
Copy link
Contributor Author

In-Place

In this case we will refresh every charm to the new version, which is essentially the current upgrade instructions that we have.

There are some immediate limitations we need to expose:

  1. We can't deploy the charms in a different model Discuss deployment in non-kubeflow namespace #698
  2. Even if we could, we need to ensure CRDs are never deleted as this could result in data loss (Profiles being deleted)

The main benefit of this approach is that we don't have to move data across and it's relatively more straight forward. The downside is that we don't have a silver bullet approach for refreshing to older version, in case of issues (charms being in blocked state)

@kimwnasptd
Copy link
Contributor Author

From the above the most promising one IMO is the Blue / Green upgrade strategy, as it

  1. gives us the rollback support
  2. gets us to double down our backup / restore strategy

To fully implement the above strategy though we will need to ensure we can copy over all state from one cluster to the next. By state we consider:

  1. The data in the control plane (MinIO, MySQL for KFP and Katib, MLMD etc)
  2. The user Profiles (cluster-scoped resources)
  3. The contents of the users' namespaces (Notebooks, ISVCs, PVCs and their contents, etc)

For the control plane we already have manual steps for making backup and restores
https://charmed-kubeflow.io/docs/backup
https://charmed-kubeflow.io/docs/restore

The missing piece is to have a story for taking a snapshot of Profile CRs and user namespace objects and contents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant