Description
A bit of context:
We started using a SockJS based protocol in production a couple of weeks ago. After a couple of days of monitoring we noticed some clients exhibiting weird behaviour in rare cases: it seemed as if newly connected clients started in the middle of the application-layer protocol rather than with a regular handshake.
After a bit of debugging we realized that the cause is that sometimes the XHR polling client sends a delayed (> 5 second delay) poll request, which is treated as a new connection. If in the meantime an /xhr_send also happens then on the server this is treated as the first message of a newly connected client.
The server side SockJS is our own implementation so I checked with the reference sockjs-node implementation and could reproduce the same issue.
Steps to reproduce:
I'll use the example echo service from the https://github.com/sockjs/sockjs-node README. The poor network condition is simulated with a port-forwarding proxy.
- Start server on port 9999
- Start port forward on localhost from 9998 to 9999
$ curl -X POST localhost:9998/echo/000/000/xhr
open frame$ curl -X POST localhost:9998/echo/000/000/xhr
poll$ curl localhost:9998/echo/000/000/xhr_send --data '["Hello"]' -H content-type:text/plain
poll returns, server is waiting 5 seconds for next poll before close- Send SIGSTOP to the proxy. The port will stay open but data sent will be buffered by the OS
$ curl -X POST localhost:9998/echo/000/000/xhr
poll, does not reach server$ curl localhost:9998/echo/000/000/xhr_send --data '["World!"]' -H content-type:text/plain
hangs- Wait 5 seconds so that server drops connection
- Send SIGCONT to port forward. This will cause the /xhr and /xhr_send requests to be delivered (hopefully in this order). The /xhr request will open a new connection with the same session_id (return code 200) and the /xhr_send will send the message (return code 204), most probably causing havoc in the application layer protocol.
Solution proposal 1: Move the "open connection" functionality to a new endpoint, e.g. /xhr_open. If a rogue /xhr request arrives the server can simply disregard it. I think this is The Right Way to solve this, however it is obviously not backwards-compatible. Maybe keep the the original behaviour around as deprecated for a few versions?
Solution proposal 2: Keep around a set of recently-closed-session_ids on the server. New connections opened with such a session_id should be rejected. This has the benefit of being backwards compatible, however it adds additional complexity to the server and I reckon it also makes testing more cumbersome, as they may rely on being able to instantly open a new connection with the same session_id.