Interface is not restored after restarting the forwarder #1047

glazychev-art · 2024-02-06T11:10:32Z

Steps to reproduce

Create kind cluster with 2 nodes
Deploy spire
Deploy basic NSM setup
Deploy 2 NSEs on the same node: with IPv4 and IPv6 CIDRs
Deploy 16 NSC with disabled liveliness checker. Each one connects to both NSEs
Delete the nsmgr located on the NSEs node. After each restart check connectivity.
Catch a case when the ping does not work.

Issue: networkservicemesh/sdk-vpp#802
Issue: networkservicemesh/sdk#1586

glazychev-art · 2024-02-06T11:11:22Z

This is probably related: #664

glazychev-art · 2024-02-12T13:36:18Z

@szvincze
Could you please provide additional logs once this problem is reproduced for you?
We found a couple of interesting behaviors, but I don't see it in your logs.
So it will be cool to have more information.
Thank you

szvincze · 2024-02-12T13:54:39Z

Hi @glazychev-art, Let me check when we can schedule these tests again. I will come back with the logs as soon as I get them.

glazychev-art · 2024-02-12T14:07:58Z

@szvincze
Thanks!
And if possible, please change the logging level from INFO to TRACE for all NSM components

szvincze · 2024-02-25T10:46:29Z

Hi @glazychev-art,

In this case the issue was reproduced after process restarts in multiple pods. Note that there were massive amount of robustness tests before it occurred. So, below I provide the important timestamps.
Forwarder process in pod forwarder-vpp-ffjzv and spire-agent process in pod spire-agent-sb8cp were killed around 2024-02-20T06:58:22.385Z.

The affected pods became ready at 2024-02-20T06:58:29.160Z but the traffic only recovered at 2024-02-20T07:01:18.759Z. The problematic connection was between nse-ipv6-6f976dd8df-688hn and nsc-c58b69c55-sfvc2 [100:100::7]:5003.

I have attached the logs and this time the logging level was set to TRACE.

glazychev-art · 2024-02-26T08:25:08Z

Thanks @szvincze,
As I see from the logs, you are using the previous NSM v1.11.2.
This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

szvincze · 2024-02-26T08:54:49Z

As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

Hi @glazychev-art, The intention was to test it on the latest RC. So, let me double-check what happened.

szvincze · 2024-02-28T11:17:13Z

Here I attach the logs for the case I mentioned.

This time a NSMgr (nsmgr-vlnnw) pod was deleted at 2024-02-21T22:37:23, then almost all connections were broken immediately except two traffic instances. When the new pod (nsmgr-w2l6d) came up the connections restored quite soon.
At the start of the next traffic iteration one traffic instance was not able to connect at all during the monitored period which was longer than 10 minutes. The affected pods were nse-ipv6-7c8cd797b5-p98x5 and nsc-6d5476bfbf-ss2rv.

glazychev-art · 2024-03-12T12:47:43Z

Problem area

We found that there is a problem that is most likely related to VPP tap interfaces.

To prove it:

Reproduce the problem
Go to the forwarder-vpp that serves the endpoint involved in the problem (nse-ipv6 for example).
Please run and share the output:

# vppctl show int

# vppctl show hardware-interfaces

# vppctl show errors
...
49367508            virtio-input                    buffer alloc error           error
...

# vppctl show buffers
Pool Name            Index NUMA  Size  Data Size  Total  Avail  Cached   Used  
default-numa-0         0     0   2304     2048    16808    0       0     16808

# vppctl show acl-plugin acl

If you don't see any problems here, try getting similar information from the forwarder that serves the problematic NSC

denis-tingaikin · 2024-04-03T10:00:10Z

cc @VitalyGushin

glazychev-art self-assigned this Feb 6, 2024

glazychev-art mentioned this issue Mar 15, 2024

tap: use default ring size networkservicemesh/sdk-vpp#807

Merged

denis-tingaikin assigned VitalyGushin Apr 3, 2024

denis-tingaikin unassigned VitalyGushin Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interface is not restored after restarting the forwarder #1047

Interface is not restored after restarting the forwarder #1047

glazychev-art commented Feb 6, 2024 •

edited

Loading

glazychev-art commented Feb 6, 2024

glazychev-art commented Feb 12, 2024

szvincze commented Feb 12, 2024

glazychev-art commented Feb 12, 2024

szvincze commented Feb 25, 2024

glazychev-art commented Feb 26, 2024

szvincze commented Feb 26, 2024

szvincze commented Feb 28, 2024

glazychev-art commented Mar 12, 2024 •

edited

Loading

denis-tingaikin commented Apr 3, 2024

Interface is not restored after restarting the forwarder #1047

Interface is not restored after restarting the forwarder #1047

Comments

glazychev-art commented Feb 6, 2024 • edited Loading

Steps to reproduce

glazychev-art commented Feb 6, 2024

glazychev-art commented Feb 12, 2024

szvincze commented Feb 12, 2024

glazychev-art commented Feb 12, 2024

szvincze commented Feb 25, 2024

glazychev-art commented Feb 26, 2024

szvincze commented Feb 26, 2024

szvincze commented Feb 28, 2024

glazychev-art commented Mar 12, 2024 • edited Loading

Problem area

denis-tingaikin commented Apr 3, 2024

glazychev-art commented Feb 6, 2024 •

edited

Loading

glazychev-art commented Mar 12, 2024 •

edited

Loading