Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

microk8s is not running - on a 4 node Rasp Pi 3 B+ cluster #4449

Open
steentottrup opened this issue Mar 3, 2024 · 10 comments
Open

microk8s is not running - on a 4 node Rasp Pi 3 B+ cluster #4449

steentottrup opened this issue Mar 3, 2024 · 10 comments

Comments

@steentottrup
Copy link

Summary

I've just install microk8s on 4 Rasp Pi 3 B+. They were installed with Ubuntu 22.04.4 64 bit server OS.
The first 3 nodes are joined with the control plane etc. 4th node is just a worker. Node 1 boots off a USB HDD, other 3 are on SD cards.
When I try to get status, all I get back is this text:

"microk8s is not running. Use microk8s inspect for a deeper inspection."

Trying to enable dns and storage etc. fails, here the output from 'microk8s enable dns':

Traceback (most recent call last):
File "/snap/microk8s/6565/scripts/wrappers/enable.py", line 41, in
enable(prog_name="microk8s enable")
File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/snap/microk8s/6565/scripts/wrappers/enable.py", line 37, in enable
xable("enable", addons)
File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 470, in xable
protected_xable(action, addon_args)
File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 498, in protected_xable
unprotected_xable(action, addon_args)
File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 514, in unprotected_xable
enabled_addons_info, disabled_addons_info = get_status(available_addons_info, True)
File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 566, in get_status
kube_output = kubectl_get("all,ingress")
File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 248, in kubectl_get
return run(KUBECTL, "get", cmd, "--all-namespaces", die=False)
File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 69, in run
result.check_returncode()
File "/snap/microk8s/6565/usr/lib/python3.8/subprocess.py", line 448, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '('/snap/microk8s/6565/microk8s-kubectl.wrapper', 'get', 'all,ingress', '--all-namespaces')' returned non-zero exit status 1.

What Should Happen Instead?

No errors when I joined the nodes together, so I was hoping everything was working and I could start putting workloads/services in the cluster.

Reproduction Steps

I've installed the microk8s a few times now, first on 2 nodes, and latest on 4 to see if the number og nodes was the issue. Same thing every time.
This is what I'm doing on the freshly install Ubuntu 22.04.4:

sudo apt update && sudo apt upgrade -y && sudo reboot

sudo nano /boot/firmware/cmdline.txt
Adding 'cgroup_enable=memory cgroup_memory=1' to the file

sudo apt install linux-modules-extra-raspi

sudo snap install microk8s --classic

sudo usermod -a -G microk8s rasppi
sudo chown -f -R rasppi ~/.kube

microk8s status --wait-ready

The last command seems to never return/end.

Introspection Report

Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite

Building the report tarball
Report tarball is at /var/snap/microk8s/6565/inspection-report-20240303_075153.tar.gz

inspection-report-20240303_075153.tar.gz

@ktsakalozos
Copy link
Member

Hi @steentottrup,

The error in the logs causing k8s to crashloop is:

Mar 03 07:50:01 pimk8s01 microk8s.daemon-kubelite[2282]: F0303 07:50:01.502923    2282 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory

I think you are missing sudo apt install linux-modules-extra-raspi. Have a look at this docs page: https://microk8s.io/docs/install-raspberry-pi

@steentottrup
Copy link
Author

Thank you for getting back to me.
I'm using a "playbook" to get them all installed properly, and was pretty sure I already had installed the raspi extras.

Just to make sure I ran it again on all 4 nodes.

microk8s-3
microk8s
microk8s-1
microk8s-2

It doesn't seem to be the problem. I'll dig around now that you have located the issue for me.

@bartecargo
Copy link

@steentottrup did you ever get to the bottom of this?

@steentottrup
Copy link
Author

No, I'm no closer to a solution. I'm not really a Linux/Ubuntu expert, so I've look at the logs, but haven't found the actual problem (or solution) yet.

@bartecargo
Copy link

I'm experiencing the same problem, but on Ubuntu 22.04. Someone else also appears to have encountered it with a clean install of the same operating system:

It appears that I've been able to temporarily get the node back up by running the following:

modprobe nf_conntrack

@nickbrennan1
Copy link

nickbrennan1 commented May 3, 2024

@bartecargo that's a great spot thanks. I've been running stable on Ubuntu 20.04.5 LTS for ~18 months, took microk8s up to v1.28 ~6 months ago without issue. Took v1.30 last week, saw the same error stack ~4 days after upgrading:

"Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"

Rebuilt microk8s on Monday @ v1.30, just happened again. Bart's modprobe resolved for me

@matpen
Copy link

matpen commented Jul 6, 2024

I can confirm the above after upgrading to microk8s 1.30/stable. Switching to 1.30/edge as suggested in #4361 does not help.

The modprobe command posted in #4449 (comment), followed by microk8s start will instead fix the problem. To make the change permanent, follow the instructions in this SO answer.

The same is also outlined in this blog post and appears to be a microk8s shortcoming. If someone of the dev team sees this, they might want to investigate.

@neoaggelos
Copy link
Contributor

Hi @matpen

So, MicroK8s should load br_netfilter before the services start in

if ! [ -f /proc/sys/net/bridge/bridge-nf-call-iptables ]
then
# NOTE(neoaggelos): https://github.com/canonical/microk8s/issues/3085
# Attempt to use modprobe from the host, otherwise fallback to the one
# provided with the snap.
if /sbin/modprobe br_netfilter || modprobe br_netfilter
then
echo "Successfully loaded br_netfilter module."
else
echo "Failed to load br_netfilter. Calico might not work properly."
fi
fi

Would you mind sharing some logs from your machine, after the reboot? Can you check if there are any log lines like the ones shown? An inspection report would also do wonders to see what might be up.

For example, I wonder if this code is running early in the boot process, then br_netfilter fails to load and the code just proceeds

@matpen
Copy link

matpen commented Jul 8, 2024

Hi @neoaggelos,

Thank you for following up on this.

Would you mind sharing some logs from your machine, after the reboot? Can you check if there are any log lines like the ones shown?

Here is a grep for br_netfilter. The second set of logs on July 6th is related to the reboot for which I wrote my comment above.

Filtered logs

sudo grep br_netfilter /var/log/syslog.1

Jul  2 17:26:36 kube02 microk8s.daemon-kubelite[2706]: + /sbin/modprobe br_netfilter
Jul  2 17:26:37 kube02 kernel: [  205.575330] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Jul  2 17:26:37 kube02 microk8s.daemon-kubelite[2706]: + echo 'Successfully loaded br_netfilter module.'
Jul  2 17:26:37 kube02 microk8s.daemon-kubelite[2706]: Successfully loaded br_netfilter module.
Jul  6 11:50:54 kube02 microk8s.daemon-kubelite[3267]: + /sbin/modprobe br_netfilter
Jul  6 11:50:54 kube02 kernel: [  231.365695] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Jul  6 11:50:54 kube02 microk8s.daemon-kubelite[3267]: + echo 'Successfully loaded br_netfilter module.'
Jul  6 11:50:54 kube02 microk8s.daemon-kubelite[3267]: Successfully loaded br_netfilter module.

This being a production machine, I am hesitant in sharing more info on the open channel, but I have a slice around the time where microk8s starts which might be useful. It looks anyway like the module is loaded properly.

Unfiltered logs

sudo grep 'Jul 6 11:50' /var/log/syslog.1

Jul  6 11:50:21 kube02 systemd[1]: Created slice User Slice of UID 10001.
Jul  6 11:50:21 kube02 systemd[1]: Starting User Runtime Directory /run/user/10001...
Jul  6 11:50:21 kube02 systemd[1]: Finished User Runtime Directory /run/user/10001.
Jul  6 11:50:21 kube02 systemd[1]: Starting User Manager for UID 10001...
Jul  6 11:50:22 kube02 systemd[2693]: Queued start job for default target Main User Target.
Jul  6 11:50:22 kube02 systemd[2693]: Created slice User Application Slice.
Jul  6 11:50:22 kube02 systemd[2693]: Reached target Paths.
Jul  6 11:50:22 kube02 systemd[2693]: Reached target Timers.
Jul  6 11:50:22 kube02 systemd[2693]: Starting D-Bus User Message Bus Socket...
Jul  6 11:50:22 kube02 systemd[2693]: Listening on GnuPG network certificate management daemon.
Jul  6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jul  6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jul  6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jul  6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent and passphrase cache.
Jul  6 11:50:22 kube02 systemd[2693]: Listening on debconf communication socket.
Jul  6 11:50:22 kube02 systemd[2693]: Listening on REST API socket for snapd user session agent.
Jul  6 11:50:22 kube02 systemd[2693]: Listening on D-Bus User Message Bus Socket.
Jul  6 11:50:22 kube02 systemd[2693]: Reached target Sockets.
Jul  6 11:50:22 kube02 systemd[2693]: Reached target Basic System.
Jul  6 11:50:22 kube02 systemd[2693]: Reached target Main User Target.
Jul  6 11:50:22 kube02 systemd[2693]: Startup finished in 206ms.
Jul  6 11:50:22 kube02 systemd[1]: Started User Manager for UID 10001.
Jul  6 11:50:22 kube02 systemd[1]: Started Session 1 of User ansible.
Jul  6 11:50:36 kube02 systemd[2693]: Started D-Bus User Message Bus.
Jul  6 11:50:36 kube02 dbus-daemon[2816]: [session uid=10001 pid=2816] AppArmor D-Bus mediation is enabled
Jul  6 11:50:36 kube02 systemd[2693]: Started snap.microk8s.microk8s-e468b3be-a472-49f4-bc7a-632f1224bdfd.scope.
Jul  6 11:50:40 kube02 systemd[2693]: Started snap.microk8s.microk8s-49e263f6-23d1-4b0d-ae2c-c71f4d48ad98.scope.
Jul  6 11:50:40 kube02 dbus-daemon[1963]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.11' (uid=0 pid=1975 comm="/usr/lib/snapd/snapd " label="unconfined")
Jul  6 11:50:40 kube02 systemd[1]: Starting Time & Date Service...
Jul  6 11:50:41 kube02 dbus-daemon[1963]: [system] Successfully activated service 'org.freedesktop.timedate1'
Jul  6 11:50:41 kube02 systemd[1]: Started Time & Date Service.
Jul  6 11:50:41 kube02 systemd[1]: Reloading.
Jul  6 11:50:41 kube02 systemd[1]: Configuration file /run/systemd/system/netplan-ovs-cleanup.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jul  6 11:50:41 kube02 systemd[1]: Started Service for snap application microk8s.daemon-apiserver-kicker.
Jul  6 11:50:41 kube02 systemd[1]: Started Service for snap application microk8s.daemon-apiserver-proxy.
Jul  6 11:50:41 kube02 systemd[1]: Started Service for snap application microk8s.daemon-cluster-agent.
Jul  6 11:50:41 kube02 systemd[1]: Starting Service for snap application microk8s.daemon-containerd...
Jul  6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + source /snap/microk8s/6876/actions/common/utils.sh
Jul  6 11:50:41 kube02 microk8s.daemon-containerd[2932]: ++ [[ /snap/microk8s/6876/run-containerd-with-args == \/\s\n\a\p\/\m\i\c\r\o\k\8\s\/\6\8\7\6\/\a\c\t\i\o\n\s\/\c\o\m\m\o\n\/\u\t\i\l\s\.\s\h ]]
Jul  6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + use_snap_env
Jul  6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + export PATH=/snap/microk8s/6876/usr/bin:/snap/microk8s/6876/bin:/snap/microk8s/6876/usr/sbin:/snap/microk8s/6876/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Jul  6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + PATH=/snap/microk8s/6876/usr/bin:/snap/microk8s/6876/bin:/snap/microk8s/6876/usr/sbin:/snap/microk8s/6876/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

From the above, it looks to me like microk8s correctly loads the module. However, I am also quite confident of what I reported in #4449 (comment). The situation was as follow:

  • microk8s status says "not running"
  • microk8s start takes long time and exits with code 0
  • microk8s status still says "not running"
  • upgraded to 1.30/edge
  • microk8s status still says "not running"
  • microk8s start takes long time and exits with code 0
  • issued modprobe nf_conntrack
  • microk8s start takes just a few seconds and exits with code 0
  • microk8s status now correctly reports a running cluster
  • downgraded to 1.30/stable
  • rebooted
  • still all good after reboot

So there is a slight chance that the combination "upgrade to edge + modprobe" somehow fixed the problem.

@swagfin
Copy link

swagfin commented Sep 19, 2024

You can set this to be done automatically during boot
The commands also checks if the config already exists

sudo modprobe nf_conntrack && grep -qxF 'nf_conntrack' /etc/modules || echo 'nf_conntrack' | sudo tee -a /etc/modules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants