Skip to content
This repository has been archived by the owner on Oct 29, 2021. It is now read-only.

Load balancer example tcp connection to application server is not working #90

Open
mardim91 opened this issue Apr 27, 2020 · 9 comments

Comments

@mardim91
Copy link

The nc connection to the application server through the "nc 10.2.2.0 5001" command is not working.

Executing tcpdump commands in application server in the nsm0 interface I observe that the SRC ip of the encapsulated packet is not the 10.70.0.0 but some random IP. Something is not working very well in load balancer plugin when it comes to the TCP connections. The ICMP connections are working fine and the source IPs are 10.70.0.0.

Steps to reproduce:

  1. Deploy the load-balancer example
  2. Login to the application server pod and execute tcpdump -i nsm0
  3. login to load balancer pod and execute "nc 10.2.2.0 5001"
  4. Check the Source IPs of the encapsulated packet.
@nickolaev
Copy link
Member

Could that be a VPP problem?

@edwarnicke
Copy link
Member

@uablrek thoughts?

@uablrek
Copy link
Contributor

uablrek commented May 21, 2020

Access works when initiated from outside the cluster, i.e when the k8s-node forwards the traffic. When traffic is initiated from the k8s-node itself it seem to fail. I can't see how linux can mess this up so IMHO the fault must be in NSM (vpp?).

@uablrek
Copy link
Contributor

uablrek commented May 21, 2020

tcpdump inside an application-server POD

When traffic is initiated from the k8s-node the src is trashed as described;

$ tcpdump -lni nsm0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on nsm0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:45:15.856313 IP 10.60.1.1 > 10.60.1.3: GREv0, length 64: IP 9.74.2.1.37217 > 10.2.2.2.5001: Flags [S], seq 3667427917, win 64240, options [mss 1460,sackOK,TS val 1802552007 ecr 0,nop,wscale 7], length 0
08:45:16.868429 IP 10.60.1.1 > 10.60.1.3: GREv0, length 64: IP 5.84.2.1.37217 > 10.2.2.2.5001: Flags [S], seq 3667427917, win 64240, options [mss 1460,sackOK,TS val 1802553021 ecr 0,nop,wscale 7], length 0

But when traffic is initiated from outside the cluster it works;

$ tcpdump -lni nsm0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on nsm0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:48:16.305146 IP 10.60.1.1 > 10.60.1.3: GREv0, length 64: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [S], seq 1134753072, win 64240, options [mss 1460,sackOK,TS val 3857117458 ecr 0,nop,wscale 6], length 0
08:48:16.305339 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [S.], seq 2880819978, ack 1134753073, win 65160, options [mss 1460,sackOK,TS val 3266929533 ecr 3857117458,nop,wscale 7], length 0
08:48:16.329451 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3857117477 ecr 3266929533], length 0
08:48:16.332074 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [P.], seq 1:37, ack 1, win 510, options [nop,nop,TS val 3266929559 ecr 3857117477], length 36
08:48:16.332461 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [F.], seq 37, ack 1, win 510, options [nop,nop,TS val 3266929560 ecr 3857117477], length 0
08:48:16.349096 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [.], ack 37, win 1004, options [nop,nop,TS val 3857117501 ecr 3266929559], length 0
08:48:16.389033 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [.], ack 38, win 1004, options [nop,nop,TS val 3857117545 ecr 3266929560], length 0
08:48:17.553320 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [F.], seq 1, ack 38, win 1004, options [nop,nop,TS val 3857118706 ecr 3266929560], length 0
08:48:17.553450 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [.], ack 2, win 51

@uablrek
Copy link
Contributor

uablrek commented May 21, 2020

Note that it is the first 16 bits in the src address that are over-written with some garbage. Last 16 bit are ok.

@mardim91
Copy link
Author

From some investigation that I did long ago i could isolate the problem and my conclusion is that the problem must be on the load balancer vpp plugin. Everything else looks alright until the traffic reaches the tunnel that is created from the load balancer inside vpp towards the application server. There the traffic gets messed up. So my best bet would be that the bug is on the vpp load balancer plugin side and on the way that sets up the tunnel.

@uablrek
Copy link
Contributor

uablrek commented May 21, 2020

BTW This problem did not exist when the example was submitted.

@uablrek
Copy link
Contributor

uablrek commented May 21, 2020

16-bit and very random, a misplaced CRC?

@edwarnicke
Copy link
Member

@uablrek Might be good to poke vpp-dev

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants