Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outbound Traffic Intermittent Failure #4671

Closed
serafdev opened this issue Sep 12, 2024 · 1 comment
Closed

Outbound Traffic Intermittent Failure #4671

serafdev opened this issue Sep 12, 2024 · 1 comment

Comments

@serafdev
Copy link

serafdev commented Sep 12, 2024

Summary

I would like to start with this superb video:

v.mp4

My microk8s cluster is dropping outgoing connections intermittently, to cut it short I captured the network packets of running multiple curl -kvvvL www.google.com in a debug container and it seems that the cluster is reusing ports for outgoing requests. Here's one of the lines:

[Expert Info (Note/Sequence): A new tcp session is started with the same ports as an earlier session in this trace]

An example of where it gets stuck:

❯ kubectl exec -it pod/api-7d6566dfb-x8ltx -- curl -kvvvL www.google.com
*   Trying 216.58.210.132:80...
* * connect to 216.58.210.164 port 80 failed: Connection timed out
* Failed to connect to www.google.com port 80: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to www.google.com port 80: Connection timed out

This is not specific to pods, I'm having this issue intermittently when pulling Images too.

This section will be used to add some context too:

Using tracepath:

[root@debug-container-2 ~]# tracepath www.google.com
 1?: [LOCALHOST]                      pmtu 1450
 1:  static.219.REDACTED.clients.your-server.de          0.022ms 
 1:  static.219.REDACTED.clients.your-server.de          0.016ms 
 2:  static.217.REDACTED.clients.your-server.de          0.068ms 
 3:  static.65.REDACTED.clients.your-server.de             0.437ms 
 4:  core32.hel1.hetzner.com                               0.391ms 
 5:  core52.sto.hetzner.com                                6.772ms 
 6:  core40.sto.hetzner.com                                6.910ms 
 7:  142.250.161.204                                       8.503ms 
 8:  no reply
 9:  no reply

The first IP (ending with 219) is the node on which that debug container is running

❯ kubectl get pods -o wide|grep debug-container-2
debug-container-2     1/1     Running     0             10m   10.1.8.82     k8s-prod-1-master   <none>           <none>

The second one (217) is the Gateway (Proxmox Debian Physical server), that interface is used for internal communication, right after that it moves the to IP ending with 65 which is used for external communication, the rest is probably a bunch of network devices in hetzner's datacenter

I tried running all this with hostNetwork=true and the connection is working perfectly everytime, I ran inside a debug container watch curl -I www.google.com (curl every 2 seconds for 30 seconds):

working.mp4

What Should Happen Instead?

Outgoing traffic work everytime

Reproduction Steps

Altho I doubt this is easily reproducible, here's are the enabled addons:

ubuntu@k8s-prod-0-master:~$ sudo microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 10.10.103.219:19001 10.10.103.220:19001 10.10.103.222:19001
  datastore standby nodes: 10.10.103.218:19001
addons:
  enabled:
    cert-manager         # (core) Cloud native certificate management
    dashboard            # (core) The Kubernetes dashboard
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    rook-ceph            # (core) Distributed Ceph storage using Rook
    storage              # (core) Alias to hostpath-storage add-on, deprecated

The nodes are communicating through the internal interface, apart of the above I have nothing special on the cluster except grafana, argocd and a few apps.

Note: I've had this issue since the beginning, I have a similiar issue with timeouts in my ArgoCD but now I know that the "Context Timeout" I'm getting there is being caused by the Kubernetes Cluster instead of a bad configuration on ArgoCD (I didn't really care since ArgoCD would just continue trying the sync and my apps would eventually become up to date)

I've tested on the nodes and they all work perfectly (I ran a watch -n 0.2 curl -kvvvL www.google.com and it would work everytime, the same loop in the debug container would fail most of the time

Here's the error log I'm getting from ArgoCD, maybe it could help identify the problem:

Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = Get "https://github.com/REDACTED/REDACTED/info/refs?service=git-upload-pack": dial tcp 140.82.121.3:443: connect: connection timed out

Introspection Report

ubuntu@k8s-prod-0-master:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/7180/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/7180/inspection-report-20240912_040549.tar.gz

inspection-report-20240912_040549.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

Yes

@serafdev
Copy link
Author

serafdev commented Sep 12, 2024

Ok wow, this comment saved me: k3s-io/k3s#5349 (comment)

I switched the ack to range ports from 0 instead of 32k in the firewall of hetzner and it worked. Maybe there's no standard yet on this yet?

Here I went aggressive using curl -kvvvL www.google.com with watch -n .2 to be sure it's not a port being reused or some other random thing:

yesss.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant