Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add hacluster integration to the ops version of kubernetes-control-plane. WIP.
This adds the
ha
relation endpoint and two config options:ha-cluster-vip
andha-cluster-dns
. The implementation in this PR is able to register VIPs or DNS records with hacluster. If used, then those VIPs/hostnames will be used for Kubernetes API endpoints in kubeconfigs used by kubelet, kube-proxy, and end-users.However, this work so far is missing any sort of failover mechanism. If kube-apiserver goes down on the unit that is holding the VIP, it will continue to hold that VIP. In the reactive charm, failover was handled by the charm during update-status hooks, where it would check the status of control-plane systemd services and update Pacemaker node status accordingly. See here. This is obviously not ideal since it can mean failovers take up to 5 minutes with default Juju configuration, and could stop occurring entirely if the charm is in a bad state.
Prior to that, we used to register the systemd services to hacluster. However, that was removed because of a long history of bugs involving Pacemaker taking control of the systemd services and failing to run them when it should. See here.
At this point, I see three potential ways to resolve this:
I suspect it may also be worth talking to the OpenStack team to figure out what the trajectory of the hacluster charm is. Will it be getting an ops uplift?