-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to update lock in leaderelection - microk8s 1.26.0 #3675
Comments
Sounds like the dqlite store is having a hard time keeping up with all the leases and leader elections of the controller pods. Would you mind sharing an inspection report from the cluster? I am particularly keen to see what the status of the dqlite cluster is. cc @MathieuBordere |
Hi @neoaggelos, I'm afraid I've given up on that install as I'm up against a deadline and it was completely unresponsive. I have however rebuilt the cluster without HA mode (two workers, one master) in a new setup. I'm still getting the leadership election problems: E0119 12:22:38.255418 1 leaderelection.go:367] Failed to update lock: Put "https://10.152.183.1:443/api/v1/namespaces/openebs/endpoints/openebs.io-local": context deadline exceeded This causes cycling in CrashLoopBackOff. I've attached an inspection report from this setup. Let me know if you'd rather this was opened in a separate issue, as it is a fresh deployment from the OP. |
Hi @ctromans, we (if I may speak for the microk8s team) would be very interested in replicating your setup to reproduce your problem. Do you have a script / steps we can follow to replicate your cluster? |
Sorry for being slow getting back to you @MathieuBordere, I've been running various tests to try and get to the bottom of this. The errors detailed have been occurring when micro8s was installed on spinning storage in a software RAID configuration:
Comprising of identical disks:
However moving from this storage, to NVMe:
The problem just disappeared. I can only assume the storage speed was insufficient to support the length of time-out, and frequent race conditions were being lost. |
For those watching/maintaining, trying to install deployKF can sometimes trigger this behavior, probably because deployKF makes a LOT of Kubernetes API calls during the first install. Read more in my write-up here: deployKF/deployKF#39 (comment) Personally, I think |
Hi @thesuperzapper, I tried to reproduce the issue with
sudo microk8s start
sudo microk8s status —wait-ready
# install addons which are required to have a functional cluster
microk8s enable dns
microk8s enable hostpath-storage
microk8s enable metallb:10.64.140.43-10.64.140.49
However, I was not able to reproduce the issue - all pods came up after a while. Thanks |
@bschimke95 this is my problem too, I was only able to reproduce the issue on FIRST the time I used microk8s. As I was saying to the user in deployKF/deployKF#39 (comment), after a clean install of microk8s I could not get it to happen again. For reference, this was on my Ubuntu 22.04 server, which is not exactly "resource-constrained":
One thing I may have done differently the first time was not setting a specific version of microk8s to install, e.g. something like |
same issure, hostpath-storage and dashboard pod always restarted,
And hostpath-provisioner logs
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is still an issue in microk8s v1.31.1:
This happens at a random interval at about 2 to 20 min on a 1 node cluster with medium/high load(~50%) on CPU/GPU and DiskIO <100M/s read and write on a 3G/s NVME. This cause microk8s to fail and die each time:
|
I still experience this issue, any fixes so far?
|
Same here, we very often have an alert about this. Creating a new issue for this |
Summary
Many pods (particularly operators) are continually restarting after failing to conduct leader elections due to being unable to update a lock. For example, csi-nfs-controller and gpu-operator.
microk8s kubectl logs csi-nfs-controller-7bd5678cbc-8n2hl -n kube-system
E0117 11:11:11.212861 1 leaderelection.go:367] Failed to update lock: Put "https://10.152.183.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/nfs-csi-k8s-io": context deadline exceeded
I0117 11:11:11.212898 1 leaderelection.go:283] failed to renew lease kube-system/nfs-csi-k8s-io: timed out waiting for the condition
F0117 11:11:11.212909 1 leader_election.go:182] stopped leading
microk8s kubectl logs gpu-operator-567cf74d9d-vl648 -n gpu-operator-resources
E0117 11:17:15.507355 1 leaderelection.go:367] Failed to update lock: Put "https://10.152.183.1:443/api/v1/namespaces/gpu-operator-resources/configmaps/53822513.nvidia.com": context deadline exceeded
I0117 11:17:15.507400 1 leaderelection.go:283] failed to renew lease gpu-operator-resources/53822513.nvidia.com: timed out waiting for the condition
1.673954235507425e+09 ERROR setup problem running manager {"error": "leader election lost"}
microk8s kubectl get services -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.152.183.1 443/TCP 2d22h
kube-system kube-dns ClusterIP 10.152.183.10 53/UDP,53/TCP,9153/TCP 2d22h
k8s-dqlite is also showing up as continual 100% CPU usage in top on the master node, and the logs contain:
microk8s.daemon-k8s-dqlite[740714]: time="2023-01-17T12:00:28Z" level=error msg="error in txn: query (try: 0): context canceled"
microk8s.daemon-k8s-dqlite[740714]: time="2023-01-17T12:00:28Z" level=error msg="error in txn: query (try: 0): context canceled"
microk8s.daemon-k8s-dqlite[740714]: time="2023-01-17T12:00:28Z" level=error msg="error in txn: query (try: 0): context canceled"
microk8s.daemon-k8s-dqlite[740714]: time="2023-01-17T12:00:28Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
microk8s.daemon-k8s-dqlite[740714]: time="2023-01-17T11:59:29Z" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
microk8s.daemon-k8s-dqlite[740714]: time="2023-01-17T11:58:59Z" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
3 node, High Availability cluster, MicroK8s 1.26 on Ubuntu 22.04.
What Should Happen Instead?
I would expect it to be able to update the lock, and for pods dependant on this functionality not to keep crashing and restarting.
The text was updated successfully, but these errors were encountered: