Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface is not restored after restarting the forwarder #1047

Open
glazychev-art opened this issue Feb 6, 2024 · 10 comments
Open

Interface is not restored after restarting the forwarder #1047

glazychev-art opened this issue Feb 6, 2024 · 10 comments
Assignees

Comments

@glazychev-art
Copy link
Contributor

glazychev-art commented Feb 6, 2024

Steps to reproduce

  1. Create kind cluster with 2 nodes
  2. Deploy spire
  3. Deploy basic NSM setup
  4. Deploy 2 NSEs on the same node: with IPv4 and IPv6 CIDRs
  5. Deploy 16 NSC with disabled liveliness checker. Each one connects to both NSEs
  6. Delete the nsmgr located on the NSEs node. After each restart check connectivity.
  7. Catch a case when the ping does not work.

Issue: networkservicemesh/sdk-vpp#802
Issue: networkservicemesh/sdk#1586

@glazychev-art glazychev-art self-assigned this Feb 6, 2024
@glazychev-art
Copy link
Contributor Author

This is probably related: #664

@glazychev-art
Copy link
Contributor Author

@szvincze
Could you please provide additional logs once this problem is reproduced for you?
We found a couple of interesting behaviors, but I don't see it in your logs.
So it will be cool to have more information.
Thank you

@szvincze
Copy link

Hi @glazychev-art, Let me check when we can schedule these tests again. I will come back with the logs as soon as I get them.

@glazychev-art
Copy link
Contributor Author

@szvincze
Thanks!
And if possible, please change the logging level from INFO to TRACE for all NSM components

@szvincze
Copy link

Hi @glazychev-art,

In this case the issue was reproduced after process restarts in multiple pods. Note that there were massive amount of robustness tests before it occurred. So, below I provide the important timestamps.
Forwarder process in pod forwarder-vpp-ffjzv and spire-agent process in pod spire-agent-sb8cp were killed around 2024-02-20T06:58:22.385Z.

The affected pods became ready at 2024-02-20T06:58:29.160Z but the traffic only recovered at 2024-02-20T07:01:18.759Z. The problematic connection was between nse-ipv6-6f976dd8df-688hn and nsc-c58b69c55-sfvc2 [100:100::7]:5003.

I have attached the logs and this time the logging level was set to TRACE.

@glazychev-art
Copy link
Contributor Author

Thanks @szvincze,
As I see from the logs, you are using the previous NSM v1.11.2.
This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

@szvincze
Copy link

As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

Hi @glazychev-art, The intention was to test it on the latest RC. So, let me double-check what happened.

@szvincze
Copy link

Here I attach the logs for the case I mentioned.

This time a NSMgr (nsmgr-vlnnw) pod was deleted at 2024-02-21T22:37:23, then almost all connections were broken immediately except two traffic instances. When the new pod (nsmgr-w2l6d) came up the connections restored quite soon.
At the start of the next traffic iteration one traffic instance was not able to connect at all during the monitored period which was longer than 10 minutes. The affected pods were nse-ipv6-7c8cd797b5-p98x5 and nsc-6d5476bfbf-ss2rv.

@glazychev-art
Copy link
Contributor Author

glazychev-art commented Mar 12, 2024

Problem area

We found that there is a problem that is most likely related to VPP tap interfaces.

To prove it:

  1. Reproduce the problem
  2. Go to the forwarder-vpp that serves the endpoint involved in the problem (nse-ipv6 for example).
  3. Please run and share the output:
# vppctl show int
# vppctl show hardware-interfaces
# vppctl show errors
...
49367508            virtio-input                    buffer alloc error           error
...
# vppctl show buffers
Pool Name            Index NUMA  Size  Data Size  Total  Avail  Cached   Used  
default-numa-0         0     0   2304     2048    16808    0       0     16808 
# vppctl show acl-plugin acl

If you don't see any problems here, try getting similar information from the forwarder that serves the problematic NSC

@denis-tingaikin
Copy link
Member

cc @VitalyGushin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants