Skip to content

XHR Polling misbehaves under (very) poor network conditions #91

Open
@exFalso

Description

@exFalso

A bit of context:
We started using a SockJS based protocol in production a couple of weeks ago. After a couple of days of monitoring we noticed some clients exhibiting weird behaviour in rare cases: it seemed as if newly connected clients started in the middle of the application-layer protocol rather than with a regular handshake.
After a bit of debugging we realized that the cause is that sometimes the XHR polling client sends a delayed (> 5 second delay) poll request, which is treated as a new connection. If in the meantime an /xhr_send also happens then on the server this is treated as the first message of a newly connected client.
The server side SockJS is our own implementation so I checked with the reference sockjs-node implementation and could reproduce the same issue.

Steps to reproduce:
I'll use the example echo service from the https://github.com/sockjs/sockjs-node README. The poor network condition is simulated with a port-forwarding proxy.

  1. Start server on port 9999
  2. Start port forward on localhost from 9998 to 9999
  3. $ curl -X POST localhost:9998/echo/000/000/xhr open frame
  4. $ curl -X POST localhost:9998/echo/000/000/xhr poll
  5. $ curl localhost:9998/echo/000/000/xhr_send --data '["Hello"]' -H content-type:text/plain poll returns, server is waiting 5 seconds for next poll before close
  6. Send SIGSTOP to the proxy. The port will stay open but data sent will be buffered by the OS
  7. $ curl -X POST localhost:9998/echo/000/000/xhr poll, does not reach server
  8. $ curl localhost:9998/echo/000/000/xhr_send --data '["World!"]' -H content-type:text/plain hangs
  9. Wait 5 seconds so that server drops connection
  10. Send SIGCONT to port forward. This will cause the /xhr and /xhr_send requests to be delivered (hopefully in this order). The /xhr request will open a new connection with the same session_id (return code 200) and the /xhr_send will send the message (return code 204), most probably causing havoc in the application layer protocol.

Solution proposal 1: Move the "open connection" functionality to a new endpoint, e.g. /xhr_open. If a rogue /xhr request arrives the server can simply disregard it. I think this is The Right Way to solve this, however it is obviously not backwards-compatible. Maybe keep the the original behaviour around as deprecated for a few versions?
Solution proposal 2: Keep around a set of recently-closed-session_ids on the server. New connections opened with such a session_id should be rejected. This has the benefit of being backwards compatible, however it adds additional complexity to the server and I reckon it also makes testing more cumbersome, as they may rely on being able to instantly open a new connection with the same session_id.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions