Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP timeout too short #171

Open
mdavidsaver opened this issue Jan 14, 2021 · 8 comments
Open

TCP timeout too short #171

mdavidsaver opened this issue Jan 14, 2021 · 8 comments
Assignees

Comments

@mdavidsaver
Copy link
Member

The 30 second idle timeout introduced by #144 was based a misunderstanding (by me) of the meaning of connectionTimeout in the java code. pvAccessJava clients are sending a echo every 30 seconds, while pvAccessCPP (and now also PVXS) servers timeout after 30 seconds. So there is a race between these two ~equal intervals.

The symptom of this is that otherwise idle connections will sometimes timeout after a multiple of 30 seconds. eg. with client and server both on the same host (my laptop) this can sometimes take several minutes.

I guess the only reasonable course of action is to increase the timeout in pvAccessCPP from 30 seconds to 60, while leaving the echo interval at 15 seconds?

cf. #139 and epics-base/pvxs#13 (comment)

@mdavidsaver mdavidsaver self-assigned this Jan 14, 2021
@anjohnson
Copy link
Member

This kind of issue (different server implementations using different beacon periods) was why Jeff made CA's watch-dog timeout adaptive, measuring how often it normally receives a beacon and setting the timeout to be double that period. Could you do something similar?

@mdavidsaver
Copy link
Member Author

This is too clever for my taste. I dislike coupling UDP beacons and TCP connections in this manner. I also dislike the unpredictability introduced. imo. timeouts on LANs should be bounded and short (in terms of human attention span) to ensure responsive applications.

@anjohnson
Copy link
Member

Don't these conflict though?

increase the timeout in pvAccessCPP from 30 seconds to 60

vs.

short (in terms of human attention span)

I was trying to suggest how you might avoid setting any timeout value to be as long as 60 seconds, but I don't know exactly what effect it has.

What happens if you set it to 35 or 40 seconds say? With CA there's the chance of a lost beacon (UDP packet), is the same true here or is this all TCP traffic?

@mdavidsaver
Copy link
Member Author

mdavidsaver commented Jan 14, 2021

Don't these conflict though?

Absolutely! imo. 30 seconds is too long. If this were a "green field" I would start with a default timeout of <=10 seconds.

What happens if you set it to 35 or 40 seconds say?

I actually tested 40 seconds first. It works, but then decided to go with a more conservative 60 seconds (once bitten, twice shy). I'm happy to go with 40 seconds though. More strictly, this would mean 4/3*$EPICS_PVA_CONN_TMO since $EPICS_PVA_CONN_TMO despite the name has actually meant the keepalive/echo period, which I don't want to change.

is the same true here or is this all TCP traffic?

atm. this is all TCP (on the c++ side at least).

@mdavidsaver
Copy link
Member Author

On further reflection I think I'll go with the shorter default of 40 seconds. I must take notice when @anjohnson is suggesting that I'm being overly conservative.

@mdavidsaver
Copy link
Member Author

2702e60

@anjohnson
Copy link
Member

Since this is all TCP there's no danger of a lost ack, so a timeout must mean either network congestion or the other end was blocked, both of which are cases that should be raised as potential problems.

@kasemir
Copy link
Collaborator

kasemir commented Jan 20, 2021

Finally got around to catching up with this discussion.
What is our conclusion?
The protocol description, https://github.com/epics-base/pvAccessCPP/wiki/protocol, is for now unchanged:

v2 server's 'Echo' reply must include the request payload.
v2 peers must close TCP connections when no data has been received in $EPICS_PVA_CONN_TMO seconds (default 30 sec.).
v2 clients must send 'Echo' more often than $EPICS_PVA_CONN_TMO seconds. The recommended interval is half of $EPICS_PVA_CONN_TMO (default 15 sec.).

That's exactly what the 'new' PVA Java client does:
The default CONN_TMO is 30 seconds, and it's used two ways.

  1. Client checks server

If Java client receives no TCP traffic (monitor updates) from server for half the CONN_TMO (~15 seconds +-3 seconds), it sends an 'ECHO' request.
A received 'ECHO' reply marks the TCP channel as 'alive', and cycle might repeat after CONN_TMO/2 seconds.
If nothing is received for CONN_TMO seconds, i.e. CONN_TMO/2 seconds after 'ECHO' request was sent, we consider the server dead, close the connection, and start over by searching for channels.

  1. Server checks client

The client assumes that if the server sees nothing from client for CONN_TMO seconds, the server will close the connection on us.
The client thus sends an 'echo' every CONN_TMO/2 (15) seconds unless it had sent other traffic like a 'monitor' request, 'put', or the 'echo' triggered by the above client-checks-server mechanism.

With softIocPVA from R7.0.4.1, I can confirm that the C++ PVA server will indeed close the connection on a running monitor after 30 seconds unless the (Java) client sends 'echo'.

So both the C++ server and the (new) Java client use CONN_TMO (default 30) to time out, and they prevent a timeout by sending 'echo' requests after CONN_TMO/2 (default 15), unless there is already other TCP traffic.

pvAccessJava clients send CMD_ECHO every 30 seconds, and timeout after 60 seconds.

I'd say that was a bug/misunderstanding in the original PVA java client. Setting the server's CONN_TMO to 40 should both fix it.

Do we keep the protocol description and basic mechanism, or should that now be updated?

(For what it's worth, just noticed that the new Java client actually reads EPICS_CA_CONN_TMO, not EPICS_PVA_CONN_TMO. Will need to update that to allow separate CA vs. PVA settings).

kasemir pushed a commit to ControlSystemStudio/phoebus that referenced this issue Jan 21, 2021
Had accidentally used the CA setting.
To be compatible with C++ implementation, should use
'EPICS_PVA_CONN_TMO'.
epics-base/pvAccessCPP#171

Also listing key settings in README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants