Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty behavior of forwarder-vpp blocks the heal process #1161

Open
szvincze opened this issue Aug 21, 2024 · 20 comments
Open

Faulty behavior of forwarder-vpp blocks the heal process #1161

szvincze opened this issue Aug 21, 2024 · 20 comments
Assignees
Labels
bug Something isn't working

Comments

@szvincze
Copy link

We observed a problem with forwarder-vpp which randomly starts ignoring requests for certain node internal connections.
For example, a refresh seems to fail randomly which triggers the healing mechanism. As a part of it, in most of the cases even the Close is unsuccessful. Then no matter how hard the heal tries to repair the connection every request times out in nsmgr, because even if the forwarder-vpp gets the requests but it does not handle them, for instance does not send it to the NSE.

So, there are request timeouts in nsmgr in every 15 seconds (which comes from the request timeout on NSC side).
However in the forwarder-vpp we only saw that the request received 15 seconds earlier but forwarder-vpp did nothing with it.
Even though it results a leaked interface but in our opinion here it is just a consequence and not the root cause of the problem.

It is important to mention that so far this behavior observed only for node internal connections.
Also note that the request does not reach the discover/discoverCandidateServer.Request() in the forwarder-vpp where the first debug message would come but it did not appear.

The situation recovers after restart of forwarder-vpp.

The used NSM version is v1.13.2 while NSCs use v1.11.2 SDK version.

Unfortunately the DEBUG logs (as they are in the used NSM release) are not sufficient to analyze the case deeper.
However it seems the problem encountered quite frequently (in couple of days), the TRACE log level is set on the system and we are waiting for reproduction, so hopefully it would appear soon and I can add detailed logs to this issue.

@denis-tingaikin
Copy link
Member

Could you try to use v1.14.0-rc.1?

@szvincze
Copy link
Author

It comes on a customer deployment. So far we did not manage to reproduce this situation in lab environments, but working on it. If it succeeds the we will be able to try v1.14.0-rc.1.

@denis-tingaikin denis-tingaikin added the bug Something isn't working label Aug 27, 2024
@denis-tingaikin denis-tingaikin self-assigned this Aug 27, 2024
@denis-tingaikin
Copy link
Member

@ljkiraly I found that healing could be triggered by deleting the network service. 

So you may run. 

kubectl delete networkservicemesh $networkserviceName

If the endpoint has the option of self-service registration, then it's enough.
Otherwise, just apply the service after some period of time.

kubectl apply -f $networkserviceName

@ljkiraly
Copy link
Contributor

ljkiraly commented Sep 4, 2024

Unfortunately I can not reproduce this issue in my environment. (with NSMv 1.13.2 neither with rc.2)
The problem was detected yesterday in a test environment (NSM v1.13.2) and we got some logs which contains TRACE level printouts.

"dsc-fdr-7bdfc7b8b5-f7js4" pod can not connect to NS: "proxy.vpn1-b.sc-diam.deascmqv02".
The connection request (id:65102507-b8d3-4266-8633-421e569ee7c7) is blocked in beginServer.Request() as you can see in the attached log:
forwarder-vpp-sljd6_forwarder-vpp.txt

Once again: the failure popped up randomly during a long running traffic case, where no elements were restarted.

@NikitaSkrynnik
Copy link
Contributor

I've managed to reproduce the problem on NSM v1.13.2, but it looks like the issue is gone on NSM v1.14.0-rc.2. Changes in begin fix it. But we've encountered a couple of new issues: continuous closes and sometimes recvfd freezes.

@ljkiraly
Copy link
Contributor

@NikitaSkrynnik These are good news!
Can you describe the reproduction steps?

@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Sep 11, 2024

@ljkiraly, sure

  1. Create 1 node cluster
  2. set up NSM (basic setup, v1.13.2)
  3. Deploy 1 NSE
  4. Deploy 40 NSCs with NSM_MAX_TOKEN_LIFETIME=3m
  5. Wait until some NSCs start to get errors on refreshes. Sometimes I don't get any errors, so I repeat steps 2-4 again. Usually after repeating them for 3-4 times I get the error

@ljkiraly
Copy link
Contributor

ljkiraly commented Sep 11, 2024

@NikitaSkrynnik , Thank you. Based on this description I was able to get different connection errors with NSM v1.13.2. Sometimes only 6 NSC connected successfully. After checking the forwarder logs I having concerns if we reproducing the same fault as originally reported. As I noticed in the previous comment the last TRACE level log output related to a faulty connection was the "beginServer.Request()" (#1161 (comment)).
I repeated the steps 2-4 for 5 times and getting many type of connection errors, but seemingly the forwarder always run through the beginServer. Maybe I have to be more patient and try further.

Please confirm if you have seen any connection stuck in 'beginServer' based on forwarder logs.

@NikitaSkrynnik
Copy link
Contributor

@ljkiraly, you can also try the scripts i used healbug copy.zip

Run ./run.sh command and after it's finishe search for string policy failed in nsc's logs. Usually I get this error on 3-4 iteration. This error usually means that beginServer is stuck.

@denis-tingaikin
Copy link
Member

@NikitaSkrynnik Could you please attach results from our the last internal test?

@NikitaSkrynnik
Copy link
Contributor

@denis-tingaikin, @ljkiraly after 30 runs on rc.6 I can't reproduce the error

@denis-tingaikin
Copy link
Member

@NikitaSkrynnik And how often it's reproducing without the fix? It's about 1-3 runs right?

@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Sep 17, 2024

@denis-tingaikin, 3-5 runs usually

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 17, 2024

@NikitaSkrynnik Could you please also check it with rc.7?

@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Sep 18, 2024

ran tests 30 times on rc.7. Couldn't reproduce the problem

@ljkiraly
Copy link
Contributor

Also verified on v1.14.0-rc.7 (with 1.11.2 and 1.13.2 and 1.14.0-rc7 endpoints) and the connections seems much more stable. I can not reproduce the problem. Thank you for the fix.

@ljkiraly ljkiraly reopened this Sep 20, 2024
@zolug
Copy link
Contributor

zolug commented Sep 27, 2024

Hi @NikitaSkrynnik,
Could you point me to the commit(s) in 1.14.0 covering the fix for this problem?
Also, could you share some more information what was causing this behaviour and how it got resolved?
Thanks in advance.

@denis-tingaikin
Copy link
Member

Hello @zolug,

We used the patch from the PR networkservicemesh/sdk-kernel#679.
See more details at networkservicemesh/sdk-kernel#679 (comment).

We are currently not planning to merge it into main since we'd like to get a fix via a NetLink release.

@zolug
Copy link
Contributor

zolug commented Sep 27, 2024

Hi @denis-tingaikin,

Hmm, if I got it right, then it's not resolved in NSM 1.14.0 yet, is that correct?
I'm just somewhat confused because of the label saying 'Done in 1.14.0'. Also, according to some comments there were no successful reproduction attempts with 1.14.0 or rc7 that is.

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 27, 2024

To be clear, 

We actually have two solutions:

  1. begin chain element changes => networkservicemesh/sdk@3016313
  2. workaroud for netlink library => Remove waiting in ipaddress chain element sdk-kernel#679

As far as we know at this moment, both fixes are working and can be used together, but the begin chain element was not enough tested. So we decided not to include in the final release.

Situation at this moment:

main brach --> problem is resolved via the begin chain element changes.

releases:
release/v1.14.0-rc.1 -->    problem is resolved via the begin chain element changes.
release/v1.14.0-rc.2 -->     problem is resolved via the begin chain element changes.
release/v1.14.0-rc.3 -->     problem is resolved via the begin chain element changes + netlink workaournd.
release/v1.14.0-rc.4 -->  The problem is resolved via a netlink workaround.
release/v1.14.0-rc.5--> problem is resolved via a netlink workaround.
release/v1.14.0-rc.6--> problem is resolved via a netlink workaround.
release/v1.14.0-rc.7--> problem is resolved via a netlink workaround.
release/v1.14.0 --> problem is resolved via netlink workaround

Since we did not get enough time for testing on customer like env, we considered getting the netlink workaround for release 1.14.0 since it looks more stable and safe.

In the future release (v1.15.0) is planning to have the begin fixes + new released version of the NetLink library.

Please let me know if some things are still not looking clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

5 participants