-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent control plane state #3735
Comments
This morning I'm also seeing this:
|
Replication controller is broken too... you can't scale things down (here I was trying to scale down ownCloud while my cluster is inconsistent)
|
All nodes are printing a lot of errors in their
Does this point to a loss of dqlite quorum? |
Hi @djjudas21 could you remove the offending node (kube07) with |
Thanks @ktsakalozos. I already tried this, but I will try again if you think it will help |
Well, I removed
I confirmed that
I am concerned about doing anything more destructive at the moment because I am using OpenEBS CStor volumes, which places 3 volume replicas across my 4 nodes. Therefore it is safe to destroy 2 nodes but not 3 (it can recover from 1 replica). I also can't accurately verify or migrate my CStor replicas because of the api-server. Now |
I've left it a few hours and the cluster hasn't settled down - the pods on the now-absent |
I still can't figure out what's going on with this. I don't have control of my cluster, the dqlite master is running at 100% CPU. I am reluctant to do anything really invasive because I have OpenEBS/CStor volumes on my nodes which obviously depend on the api-server for their own quorum. My gut instinct is to destroy the cluster and recreate from scratch to restore service, but I had to do this 3 weeks ago too (#3204 (comment)), and it just broke again on its own. I don't have confidence in recent releases of MicroK8s to be stable and durable, especially with hyperconverged storage. At the moment I'm not touching anything but I'm thinking of improving my off-cluster storage first, and rebuilding without using OpenEBS/CStor, for the security of my data. I appreciate that MicroK8s is free and that people put a lot of hard work into it, but I feel that there are some serious problems that aren't getting enough attention. There are several open issues about reliability problems potentially related to dqlite that remain open: |
@djjudas21 a MicroK8s cluster starts with a single node to which you join other nodes to form a multinode cluster. As soon as the cluster has three control plane nodes (nodes joined without the As we said the first 3 nodes in an HA cluster replicate the datastore. Any write on the datastore needs to be acked by the majority of the nodes, the quorum. As we scale the control-plane nodes beyond 3 the next two nodes maintain a replica of the datastore and are in standby in case a node from the first three departs. When a node misses some heartbeats it gets replaced and the rest of the nodes agree on the role each node plays. This node is stored in
Node The nodes cannot change IP. Even if they have crashed or left the cluster or are misbehaving they are still considered part of the cluster because they may be rebooting or have some network connectivity problems or go through maintenance etc. If a know a node will not be coming back in its previous state we must call Of course the above do not explain why the cluster initially started to freeze. Hopefully it may give you some insight on what it happening under the hood. I am not sure what happened around "Feb 06 00:43:43". Will keep digging in the logs. A way to reproduce the issue would be ideal. One question I have is why do I see in the dmesg logs on a couple of nodes errors like this:
|
Thanks @ktsakalozos for the detailed explanation, I appreciate it. I had incorrectly assumed that if you have an HA cluster and you remove nodes to return it to 1 remaining control plane node, it would resume quorum automatically because it's the only node. This explains why removing nodes made my problem worse. I understand about having to do I didn't know about the api-server becoming silently read-only in this situation. That certainly explains some of my symptoms. Is it possible to make this failure mode more prominent? Like every In the error snippet about So the underlying problem is the loss of quorum. At Feb 06 00:43:43 I was asleep and not working on the cluster. The uptime of the nodes shows there was no power outage etc. I have two theories:
I will look at |
All 4 of my kube nodes have the same default snap refresh window of 00:00, 4 times during the day.
Snap only keeps a log of successful refresh events for 24 hours so I can't tell if my nodes updated simultaneously, but it seems likely given that they all have the same window. This is my workaround:
I think it is worth documenting this somewhere, because it is a simple workaround for a serious problem. I don't know much about how Snap works but it looks like there are several options you can set which could have prevented my failure. Are you able to set any of these options on the MicroK8s package as a Snap package maintainer? |
I do not see a snap refresh around the time of the incident. I do not even see a service restart let alone a refresh. Did you see anything that leads you to believe it is a snap refresh related issue? Is it possible that at some point any of the nodes run out of disk space? I see in the inspection scripts that the nodes report that they do not have disk pressure but kube07 has "lastTransitionTime" to NoDiskPressure state at "2023-02-06T17:36:22Z" and kube06 the "lastTransitionTime" to NoDiskPressure is "2023-02-06T04:34:55Z". Is there a way for us to know what was the disks state at that point? Is there a workload that could consume too much disk space? |
It was just a suspicion, based on that snap auto-refreshes overnight. It seems unlikely about them running out of disk. They all have plenty of free space and I don't think any workloads use ephemeral storage in any real way. I do have monitoring for disk space and that wasn't firing an alert, but I have since lost my Prometheus volume so I can't check historical data.
|
So I ran through the guide to recover from lost quorum, with 2 nodes ( - Address: 192.168.0.57:19001
ID: 3297041220608546238
Role: 0
- Address: 192.168.0.58:19001
ID: 796923914728165793
Role: 2 I started up both nodes but there is no change in behaviour.
So I don't know what I did wrong but it looks like the dqlite is still read-only. Maybe worth killing |
I tried again to restore |
@djjudas21 could you share a new inspection tarball? I wonder what logs are saying. |
Do you think it would be possible to somehow share the contents of the |
Corruption seems plausible. It took 3m 45s to create this inspection report, and it seemed to be stuck on the dqlite step. inspection-report-20230209_152432.tar.gz I could share the |
@djjudas21 would it be possible to append the |
@ktsakalozos we've actually had a couple of power outages in the area tonight so I would've lost quorum anyway! 😅 The nodes are all powered back on now, I've added the |
@djjudas21 could you run a quick test? While dqlite is in a 100% CPU utilization do a |
Can you perform
like described here #3419 (comment) please? Can you then please provide the dqlite logs of a period with high CPU usage (couple of minutes is fine)? Don't forget to remove the env variables from the file again as the edit: it would be interesting to have those logs with |
Output of
And after stopping kubelite:
So, no change in dqlite CPU usage. |
Thanks. I added the env vars and restarted dqlite at 10:07. Then I let it run for a minute or two and ran |
@djjudas21 Can you do the same with |
@ktsakalozos Do you know who/what causes
Logs look to be spammed with that. |
The "TRIGGERED" message is caused by the watchers monitoring the respective keys. It comes from https://github.com/canonical/kine/blob/v0.4.1-k8s-dqlite.3/pkg/logstructured/sqllog/sql.go#L438 |
For those watching/maintaining, trying to install deployKF can sometimes trigger this behavior, probably because deployKF makes a LOT of Kubernetes API calls during the first install. Read more in my write-up here: deployKF/deployKF#39 (comment) Personally, I think |
Even on single node cluster it crashes after reboot. I'm moving to k3s. |
Summary
Something has gone wrong with (I think) the api-server or dqlite backend. Cluster state is inconsistent and replication controllers are not working.
I have 4 nodes (
kube05
-kube08
) and earlier today I noticed that some of my workloads stopped. I saw thatkube07
had some stuck processes and 100% CPU usage on it, so I attempted to drain it and reboot it. It never drained properly; all the pods that were onkube07
went toUnknown
state and never got rescheduled.I did
microk8s leave
onkube07
but all thekube07
resources remained on the cluster. I force-deleted all of those pods but they never got rescheduled. Now I re-addedkube07
but nothing is being scheduled on it, even where there are daemonsets etc.For example, in the
monitoring
namespace there are 2 daemonsets and both are broken in different ways:There are only 3 of the
prometheus-smartctl-exporter
and only 2 of theprometheus-stack-prometheus-node-exporter
, but the Desired/Current/Ready status is wrong. Something is obviously seriously wrong with the Kubernetes control plane, but I can't figure out what.What Should Happen Instead?
Reproduction Steps
I can't consistently reproduce.
Introspection Report
microk8s inspect
took a long time to run on all nodes, but reported no errors. It waskube07
that was removed and re-added to the cluster.kube05-inspection-report-20230206_185244.tar.gz
kube06-inspection-report-20230206_185337.tar.gz
kube07-inspection-report-20230206_185248.tar.gz
kube08-inspection-report-20230206_185353.tar.gz
Can you suggest a fix?
It may be related to #2724 which I reported over a year ago, and was never resolved
Are you interested in contributing with a fix?
Yes, if I can. I feel like this is a serious and ongoing problem with MicroK8s
The text was updated successfully, but these errors were encountered: