Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't upgrade k8s using snap refresh #634

Closed
mhalano opened this issue Aug 28, 2024 · 12 comments
Closed

Can't upgrade k8s using snap refresh #634

mhalano opened this issue Aug 28, 2024 · 12 comments

Comments

@mhalano
Copy link

mhalano commented Aug 28, 2024

Summary

I tried to upgrade my snap k8s installed on a single node and got this error:

mhalano@skynet:~$ sudo snap refresh k8s
2024-08-28T22:51:36Z INFO Waiting for "snap.k8s.kube-apiserver.service" to stop.
error: cannot perform the following tasks:
- Run configure hook of "k8s" snap if present (run hook "configure": Node is not part of a cluster: <nil>)

What Should Happen Instead?

Should complete the upgrade correctly

Reproduction Steps

  1. Install k8s in an older version. Mine is 1.30.3
  2. Execute sudo snap refresh k8s to get the latest version. At this moment is 1.31.0
  3. See the error

System information

inspection-report-20240828_225625.tar.gz

Can you suggest a fix?

No response

Are you interested in contributing with a fix?

I don't know how I can fix it, but I can help troubleshooting and validate any possible solution. My cluster is not critical.

@bschimke95
Copy link
Contributor

Hi @mhalano

Thanks for reporting this issue.
I think that I have addressed this issue with #633 yesterday evening.

For me, the upgrade to the latest version works:

➜ sudo snap refresh k8s --channel=1.30-classic/candidate --classic
k8s (1.30-classic/candidate) v1.30.3 from Canonical✓ refreshed

➜ sudo k8s bootstrap                                              
Bootstrapping the cluster. This may take a few seconds, please wait.
Bootstrapped a new Kubernetes cluster with node address "192.168.178.36:6400".
The node will be 'Ready' to host workloads after the CNI is deployed successfully.

➜ sudo snap refresh k8s --channel=latest/edge --classic           
2024-08-29T08:20:22+02:00 INFO Waiting for "snap.k8s.kube-apiserver.service" to stop.
k8s (edge) v1.31.0 from Canonical✓ refreshed

Would you mind retrying this upgrade on latest/edge (revision 991)? Thanks!

@mhalano
Copy link
Author

mhalano commented Aug 29, 2024

@bschimke95 I did the test you mentioned, and the upgrade process works, but it crashes my cluster after:

root@skynet:/var/lib/snapd# k8s kubectl get pods
Error: Failed to retrieve the node status.

The error was: failed to GET /k8sd/node: Get "http://control.socket/1.0/k8sd/node": dial unix /var/snap/k8s/common/var/lib/k8sd/state/control.socket: connect: no such file or directory

I don't know if this problem is still related to the snap anymore. Any tips?

@bschimke95
Copy link
Contributor

This looks like something inside k8sd crashed, causing the unix socket to not be available.

Would you mind sharing an inspection report and the output of snap services k8s?

@mhalano
Copy link
Author

mhalano commented Aug 29, 2024

Here it is:

root@skynet:/var/lib/snapd# snap services k8s
Service                      Startup   Current   Notes
k8s.containerd               enabled   active    -
k8s.k8s-apiserver-proxy      disabled  inactive  -
k8s.k8s-dqlite               enabled   active    -
k8s.k8sd                     enabled   inactive  -
k8s.kube-apiserver           enabled   active    -
k8s.kube-controller-manager  enabled   active    -
k8s.kube-proxy               enabled   active    -
k8s.kube-scheduler           enabled   active    -
k8s.kubelet                  enabled   active    -
root@skynet:/var/lib/snapd# 

@bschimke95
Copy link
Contributor

Thanks! So as assumed, the k8sd is not running.
Could you share the inspection report for this run?

@mhalano
Copy link
Author

mhalano commented Aug 29, 2024

Here it is.
inspection-report-20240829_142849.tar.gz

@bschimke95
Copy link
Contributor

Thanks @mhalano.

It looks like the migrations are not correctly applied.

Aug 29 12:19:31 skynet k8s.k8sd[77766]: Error: Failed to run k8sd: failed to run microcluster: Daemon stopped with error: Daemon failed to start: Failed to re-establish cluster connection: "SELECT\n    t.name, t.expiry\nFROM\n    worker_tokens AS t\nWHERE\n    ( t.token = ? )\nLIMIT 1\n": no such column: t.expiry`

The expiry field was added to the worker token database in latest/edge.
We will work on a fix.

@kot0dama
Copy link

kot0dama commented Oct 4, 2024

I think I got the same issue, or a very similar one while working on a k8s cluster deployed through https://charmhub.io/k8s.
All k8s juju units ended up in Blocked state with error Failed to install snaps.

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/reconciler.py", line 35, in reconcile
    self.reconcile_function(event)
  File "/var/lib/juju/agents/unit-k8s-2/charm/./src/charm.py", line 608, in _reconcile
    self._install_snaps()
  File "/usr/lib/python3.10/contextlib.py", line 78, in inner
    with self._recreate_cm():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/contextual_status.py", line 106, in on_error
    raise ReconcilerError(msg) from e
charms.contextual_status.ReconcilerError: Found expected exception: Snap: 'k8s'; command ['snap', 'refresh', 'k8s', '--channel="edge"'] failed with output = '2024-10-04T06:10:28Z INFO Waiting for "snap.k8s.kube-apiserver.service" to stop.\n'
$ snap services k8s
Service                      Startup   Current   Notes
k8s.containerd               enabled   active    -
k8s.k8s-apiserver-proxy      disabled  inactive  -
k8s.k8s-dqlite               enabled   active    -
k8s.k8sd                     enabled   active    -
k8s.kube-apiserver           enabled   active    -
k8s.kube-controller-manager  enabled   active    -
k8s.kube-proxy               enabled   active    -
k8s.kube-scheduler           enabled   active    -
k8s.kubelet                  enabled   active    -

I tried restarting snap.k8s.k8s-apiserver-proxy.service but it didn't help.

@bschimke95
Copy link
Contributor

Hey @kot0dama
The `k8s-apiserver-proxy is expected to be inactive on control-plane nodes. This service is only relevant for worker nodes.

@louiseschmidtgen could you confirm that the microcluster issue should be resolved?

@kot0dama I think you are running into #642. If the snap is refreshed, the relevant services are restarted which causes the apiserver to break (right now).
@berkayoz or @mateoflorido could you confirm? There should be a fix on the way, right?

@kot0dama
Copy link

kot0dama commented Oct 7, 2024

Thanks @bschimke95 , I've tried the workaround listed on that bug report but that doesn't seem to have fixed it for me.

@berkayoz
Copy link
Member

berkayoz commented Oct 7, 2024

Hey @kot0dama, that bug report might not exactly be related. Can you share inspection reports for the units? Also can you check if systemctl stop snap.k8s.kube-apiserver hangs?

@louiseschmidtgen
Copy link
Contributor

Dear @mhalano,

the fix is populated to all channels, you may now upgrade from 1.30-> 1.31 without any troubles.

Thank you for raising the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants