Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error from long-running event handler #33

Open
mikegibney opened this issue Mar 19, 2018 · 14 comments
Open

Error from long-running event handler #33

mikegibney opened this issue Mar 19, 2018 · 14 comments

Comments

@mikegibney
Copy link

(Disclaimer I'm not sure if this error is coming from event store client, from event store itself, or something completely different that has not yet been identified.)

We have an event handler that throws the following error when it runs for a sufficiently long time:

Error: write after end
    at writeAfterEnd (_stream_writable.js:166:12)
    at Socket.Writable.write (_stream_writable.js:217:5)
    at Socket.write (net.js:650:40)
    at Connection.sendMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:533:21)
    at receiveMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:160:22)
    at Socket.onData (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:81:21)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:172:18)
    at Socket.Readable.push (_stream_readable.js:130:10)
    at TCP.onread (net.js:542:20)
/home/app/tableservice/tableEventStore/index.js:404
            throw err;
            ^

Error: write after end
    at writeAfterEnd (_stream_writable.js:166:12)
    at Socket.Writable.write (_stream_writable.js:217:5)
    at Socket.write (net.js:650:40)
    at Connection.sendMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:533:21)
    at receiveMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:160:22)
    at Socket.onData (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:81:21)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:172:18)
    at Socket.Readable.push (_stream_readable.js:130:10)
    at TCP.onread (net.js:542:20)

If I restart the service running this event handler, it picks up where it died just fine and continues processing events.

Something that might be relevant - the event handler that appears to be the source of this error is subscribed to one stream. When it picks up a certain class of event, it opens up a different stream to read detailed events relating to the subscribed stream.

@x-cubed
Copy link
Owner

x-cubed commented Mar 20, 2018

My suspicion is that Event Store is closing the connection due to a timeout. If you need to maintain a long-running connection to Event Store, it's necessary to send data occasionally for the server to consider the connection to be active. The easiest way to do this is to call the ping method on a regular basis. See #24 for another issue describing this scenario.

@mikegibney
Copy link
Author

Is there an idiomatic way of checking the current state of the connection, or would I need to track that myself in the onConnect/onError/onClose callbacks?

@x-cubed
Copy link
Owner

x-cubed commented Mar 20, 2018

You would need to track that yourself from those events.

@mikegibney
Copy link
Author

As near as we've been able to determine, this problem stems from the automated heartbeat messages that Event Store sends (adding the sendPing helped but the connection still eventually died). In event-store-client/lib/connection.js, method receiveMessage:

        // Handle the message
        if (command == Commands.HeartbeatRequest) {
            // Automatically respond to heartbeat requests
            instance.sendMessage(correlationId, Commands.HeartbeatResponse);

        }

For whatever reason, it seems that sometimes the connection can be closed between receiving a heartbeat request and sending the response. Is this something that we should be accounting for in our application?

@mikegibney
Copy link
Author

And for any future visitors, this is roughly the ping logic we're using:

    var connected = false;

    connection = new client.Connection({
        // settings
        onConnect: function () {
            connected = true;
            setInterval(() => {
                if(connection && connected) {
                    connection.sendPing(
                        () => {} // empty callback to avoid "unknown corrlation ID"
                    );
                }
            }, 10 * 1000);
        // connection logic
        
        },
        onError: function (err) {
            connected = false;
            // whatever error-handling logic
            throw err;
            });
        },
        onClose: (err) => {
            connected = false;
        }
    });

@x-cubed
Copy link
Owner

x-cubed commented Mar 23, 2018

Ok, if you're suspecting the heartbeat responses are a problem then we'll need some more information to diagnose it.

I would recommend looking at the Event Store logs to see what it has to say about the connection (that will tell you whether it has closed the connection, or it thinks everything is okay). The other thing that would help would be a packet capture from Wireshark, so that we can see if it's a particular sequence of events that is causing the problem, or if there's something wrong with the heartbeat response packets themselves.

@mikegibney
Copy link
Author

Here's the relevant bit from /var/log/eventstore/:

[PID:09313:025 2018.03.27 15:28:05.991 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54834, {e4fc4784-93a0-4111-b8bb-d037ce0d8708}] closed: Success.
[PID:09313:031 2018.03.27 15:28:05.992 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54834
[PID:09313:032 2018.03.27 15:28:05.998 INFO  TcpConnection       ] ES TcpConnection closed [15:28:05.998: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Received bytes: 106, Sent bytes: 16141
[PID:09313:032 2018.03.27 15:28:05.999 INFO  TcpConnection       ] ES TcpConnection closed [15:28:05.999: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Send calls: 1, callbacks: 1
[PID:09313:032 2018.03.27 15:28:06.000 INFO  TcpConnection       ] ES TcpConnection closed [15:28:06.000: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Receive calls: 2, callbacks: 2
[PID:09313:032 2018.03.27 15:28:06.000 INFO  TcpConnection       ] ES TcpConnection closed [15:28:06.000: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Close reason: [Success] Socket closed
[PID:09313:031 2018.03.27 15:28:06.001 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}].
[PID:09313:032 2018.03.27 15:28:06.003 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54836, {a1e065d1-662f-429a-8a12-80ef35b81915}] closed: Success.
[PID:09313:021 2018.03.27 15:28:06.004 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54836
[PID:09313:019 2018.03.27 15:28:09.891 TRACE TcpConnectionManager] Closing connection 'external-normal' [10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 7
[PID:09313:019 2018.03.27 15:28:09.892 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.892: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Received bytes: 382, Sent bytes: 759
[PID:09313:019 2018.03.27 15:28:09.893 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.893: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Send calls: 8, callbacks: 8
[PID:09313:019 2018.03.27 15:28:09.893 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.893: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Receive calls: 8, callbacks: 7
[PID:09313:019 2018.03.27 15:28:09.894 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.894: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 7
[PID:09313:019 2018.03.27 15:28:09.895 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54504, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}] closed: Success.
[PID:09313:017 2018.03.27 15:28:09.896 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54504
[PID:09313:021 2018.03.27 15:28:10.242 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.242: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Received bytes: 106, Sent bytes: 16230
[PID:09313:021 2018.03.27 15:28:10.243 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.243: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Send calls: 2, callbacks: 2
[PID:09313:021 2018.03.27 15:28:10.244 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.244: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Receive calls: 2, callbacks: 2
[PID:09313:021 2018.03.27 15:28:10.244 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.244: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Close reason: [Success] Socket closed
[PID:09313:021 2018.03.27 15:28:10.245 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54838, {d2c0aaee-e55a-49b5-883d-e517f673618a}] closed: Success.
[PID:09313:026 2018.03.27 15:28:10.246 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54838
[PID:09313:027 2018.03.27 15:36:46.456 TRACE InMemoryBus         ] SLOW BUS MSG [manager input bus]: GetStatistics - 54ms. Handler: ProjectionManager.
[PID:09313:027 2018.03.27 15:36:46.457 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projections Master]: GetStatistics - 54ms. Q: 0/0.
[PID:09313:027 2018.03.27 15:48:10.475 TRACE InMemoryBus         ] SLOW BUS MSG [manager input bus]: GetStatistics - 75ms. Handler: ProjectionManager.
[PID:09313:027 2018.03.27 15:48:10.476 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projections Master]: GetStatistics - 76ms. Q: 0/0.

The exact timestamp of the error from the container was 2018-03-27T15:28:10.225626957Z. 10.10.0.32 is the IP of the consuming service. Still working on getting a packet capture.

@mikegibney
Copy link
Author

Sorry I lost the thread on this, had to deal with a couple other issues that popped up. What would you want from a packet capture? This is running on Ubuntu server, so it looks I'd be using tshark or tcpdump but if I'm being completely honest I'm in way over my head with this. I'm far from a TCP expert so I'm not even sure what I'm looking for or how to go about getting it.

@x-cubed
Copy link
Owner

x-cubed commented Apr 4, 2018

So the Event Store logs are suggesting that we didn't reply to the Heartbeat request with a Heartbeat response. A packet capture should be able to tell us if that is the case, and if not, might identify what the confusion is.

All we need to capture is the TCP data on port 1113 when this issue is happening. I'm no expert on tcpdump (I usually use Wireshark myself), but it looks like tcpdump port 1113 -w capture.pcap should capture the data we need and save it to a file that you can then attach to this issue.

@mikegibney
Copy link
Author

After a certain degree of bumbling and false starts, I have successfully obtained a packet capture....which is chock full of a bunch of proprietary data that I don't think I can share. :/ Is there an easy way to extract the heartbeat packets?

@x-cubed
Copy link
Owner

x-cubed commented Apr 6, 2018

The heartbeat packets by themselves probably aren't that useful, it's more useful to see the full stream of packets, in case there's something wrong with the order in which they're being sent.

Three possible approaches from here:

  • Share the capture with me privately via email (my first name dot my last name at gmail dot com) so that I can investigate.
  • Create a copy of your application and simplify it down to the code that deals with Event Store, and make sure it still reproduces the issue. Get it to send dummy data rather than proprietary data. Share the code to that application, so that I can reproduce the issue.
  • Dig through the packet capture yourself in Wireshark to see if you can see any odd behaviour.

@mikegibney
Copy link
Author

Okay, I managed to randomize/obfuscate the sensitive data with relative ease. Let me know if this is what you need.

capture.zip

@x-cubed
Copy link
Owner

x-cubed commented Apr 25, 2018

Ok, there's definitely something wrong with the heartbeat responses. Here's a list of packets from your capture showing the server requesting a heartbeat, but the client carries on regardless without sending a heartbeat response:

heartbeatrequestbutnoresponse

@x-cubed
Copy link
Owner

x-cubed commented Apr 25, 2018

So the interesting thing to note there is that it does respond to the heartbeat request in packet 81343, but then fails to respond later to the second request in 81567. That would suggest that the receive logic is not handling incoming messages after the ReadStreamEventsForwardCompleted is received.

You could try enabling the debug flag on the connection to get log output from the client, to see if it recognizes the second HeartbeatRequest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants