You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My microk8s cluster is dropping outgoing connections intermittently, to cut it short I captured the network packets of running multiple curl -kvvvL www.google.com in a debug container and it seems that the cluster is reusing ports for outgoing requests. Here's one of the lines:
[Expert Info (Note/Sequence): A new tcp session is started with the same ports as an earlier session in this trace]
An example of where it gets stuck:
❯ kubectl exec -it pod/api-7d6566dfb-x8ltx -- curl -kvvvL www.google.com
* Trying 216.58.210.132:80...
** connect to 216.58.210.164 port 80 failed: Connection timed out
* Failed to connect to www.google.com port 80: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to www.google.com port 80: Connection timed out
This is not specific to pods, I'm having this issue intermittently when pulling Images too.
This section will be used to add some context too:
The second one (217) is the Gateway (Proxmox Debian Physical server), that interface is used for internal communication, right after that it moves the to IP ending with 65 which is used for external communication, the rest is probably a bunch of network devices in hetzner's datacenter
I tried running all this with hostNetwork=true and the connection is working perfectly everytime, I ran inside a debug container watch curl -I www.google.com (curl every 2 seconds for 30 seconds):
working.mp4
What Should Happen Instead?
Outgoing traffic work everytime
Reproduction Steps
Altho I doubt this is easily reproducible, here's are the enabled addons:
ubuntu@k8s-prod-0-master:~$ sudo microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 10.10.103.219:19001 10.10.103.220:19001 10.10.103.222:19001
datastore standby nodes: 10.10.103.218:19001
addons:
enabled:
cert-manager # (core) Cloud native certificate management
dashboard # (core) The Kubernetes dashboard
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
hostpath-storage # (core) Storage class; allocates storage from host directory
ingress # (core) Ingress controller for external access
metallb # (core) Loadbalancer for your Kubernetes cluster
metrics-server # (core) K8s Metrics Server for API access to service metrics
rook-ceph # (core) Distributed Ceph storage using Rook
storage # (core) Alias to hostpath-storage add-on, deprecated
The nodes are communicating through the internal interface, apart of the above I have nothing special on the cluster except grafana, argocd and a few apps.
Note: I've had this issue since the beginning, I have a similiar issue with timeouts in my ArgoCD but now I know that the "Context Timeout" I'm getting there is being caused by the Kubernetes Cluster instead of a bad configuration on ArgoCD (I didn't really care since ArgoCD would just continue trying the sync and my apps would eventually become up to date)
I've tested on the nodes and they all work perfectly (I ran a watch -n 0.2 curl -kvvvL www.google.com and it would work everytime, the same loop in the debug container would fail most of the time
Here's the error log I'm getting from ArgoCD, maybe it could help identify the problem:
Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = Get "https://github.com/REDACTED/REDACTED/info/refs?service=git-upload-pack": dial tcp 140.82.121.3:443: connect: connection timed out
Introspection Report
ubuntu@k8s-prod-0-master:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
cp: cannot stat '/var/snap/microk8s/7180/var/kubernetes/backend/localnode.yaml': No such file or directory
Building the report tarball
Report tarball is at /var/snap/microk8s/7180/inspection-report-20240912_040549.tar.gz
Summary
I would like to start with this superb video:
v.mp4
My microk8s cluster is dropping outgoing connections intermittently, to cut it short I captured the network packets of running multiple
curl -kvvvL www.google.com
in a debug container and it seems that the cluster is reusing ports for outgoing requests. Here's one of the lines:An example of where it gets stuck:
This is not specific to pods, I'm having this issue intermittently when pulling Images too.
This section will be used to add some context too:
Using tracepath:
The first IP (ending with 219) is the node on which that debug container is running
The second one (217) is the Gateway (Proxmox Debian Physical server), that interface is used for internal communication, right after that it moves the to IP ending with 65 which is used for external communication, the rest is probably a bunch of network devices in hetzner's datacenter
I tried running all this with
hostNetwork=true
and the connection is working perfectly everytime, I ran inside a debug containerwatch curl -I www.google.com
(curl every 2 seconds for 30 seconds):working.mp4
What Should Happen Instead?
Outgoing traffic work everytime
Reproduction Steps
Altho I doubt this is easily reproducible, here's are the enabled addons:
The nodes are communicating through the internal interface, apart of the above I have nothing special on the cluster except grafana, argocd and a few apps.
Note: I've had this issue since the beginning, I have a similiar issue with timeouts in my ArgoCD but now I know that the "Context Timeout" I'm getting there is being caused by the Kubernetes Cluster instead of a bad configuration on ArgoCD (I didn't really care since ArgoCD would just continue trying the sync and my apps would eventually become up to date)
I've tested on the nodes and they all work perfectly (I ran a
watch -n 0.2 curl -kvvvL www.google.com
and it would work everytime, the same loop in the debug container would fail most of the timeHere's the error log I'm getting from ArgoCD, maybe it could help identify the problem:
Introspection Report
inspection-report-20240912_040549.tar.gz
Can you suggest a fix?
Are you interested in contributing with a fix?
Yes
The text was updated successfully, but these errors were encountered: