-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket-mode application become a zombie when unhandled exception occurs in single-threaded thread pool #833
Comments
In addition to the second presented log (
|
Hello @ddovgal, thanks for writing this in. This is an interesting scenario. I will summarize my observations and perhaps that can help in identifying a) what is happening, b) what, if anything, we can do to work around this issue in the short term and c) how to mitigate the problem in the long term. Identifying what is happening is key to address the issue properly. In all three scenarios / stack traces you posted, we think the exception arises from this java-slack-sdk code area. Quick clarifying question: is your third log case scenario presented in your last comment in combination with the second log case scenario presented in your original post? I want to make sure I understand the timeline of the stack traces properly, and it seems if the third log case is what precedes / causes the second log case, because the stack traces for the third and first are identical, it may be the same single underlying cause for the issue. As you mentioned in the first log case, based on the end of the stack trace, it seems a reconnection is in process (based on the execution of In general, I am curious as to your reasoning on why you think this has to do with threading? May you elaborate more on this point? As for a workaround or mitigation, it seems like Tyrus allows for the definition of retry behaviour, and allows for the use of our own reconnect handler. Perhaps that is one avenue we can explore for this scenario? |
@ddovgal Thanks for sharing the details (really helpful!)
I'm also curious about this part. As far as I know, we may sometimes see "HTTP 408 Request Timed Out" when the app is behind a proxy/corporate firewall. Even if an app is not behind a proxy, some connectivity issue can arise in between the Slack server-side and your app hosting infra. According to the information you've shared in the description, the app is running in a Docker container. Thus, another possibility is that some connectivity issue between the app container and outside might be happening. As @filmaj suggested, Tyrus's retry option may help. With that being said, I'm wondering why the reconnection stops working in this scenario. Once we figure out the room for improvement on this SDK side, we are happy to quickly release the fix for it. Your continued help for the investigation would be greatly appreciated! 🙇 |
Sorry for such a delay in response. I'm currently in a vacation, so will be able to give more details on this case only on next week. Hope It will occurs again in this week, so I'll be able to gather more information from logs. |
Hi! I have similar issue with Socket Mode app written in kotlin on top of Spring Framwork. It occurs rather often right from very beginning of socket mode introduction. Is there any way to intercept event of such connection drop? Such event could be used to mark app as unhealthy. |
Are you seeing this error log too? If so, the handshake error should be happening inside the Tyrus library. As I mentioned in my last comment, this can be a kind of connectivity issue between your app and the Slack server-side.
If the close listeners do not work for you, there is nothing else as far as I know. I've never reproduced the situation described here on my end. My understanding of the issue described here is still that, due to the inactive WebSocket connection, the SocketModeApp is no longer able to receive any data from Slack. If we can make the app reconnect to the Slack server-side in some way, the app should work again even in this scenario (because other parts like thread pool for message handling are still working -- awaiting for incoming messages).
If Tyrus's retry option is useful for you, we may consider enabling the retry option by default. Can anyone try this option? |
Hi @seratch!
My application was running on log level=DEBUG for com.slack.api.socket_mode.SocketModeClient and there was simply silence in logs when issue occurred. Even entries about session maintenance were gone. Now I have enabled more detailed logs for Tyrus so I'm waiting for next occurrence of this issue. I will also check close listener and retry option in Tyrus. |
👋 It looks like this issue has been open for 30 days with no activity. We'll mark this as stale for now, and wait 10 days for an update or for further comment before closing this issue out. If you think this issue needs to be prioritized, please comment to get the thread going again! Maintainers also review issues marked as stale on a regular basis and comment or adjust status if the issue needs to be reprioritized. |
Hi! Issue with loosing socket mode connection still exists. Enabling TRACE on Tyrus and SDK packages didn't help. Last message in logs was single "ping" from my application without "pong" received and without any new attempt maintain websocket session. There is similar issue in python Slack SDK: slackapi/python-slack-sdk#1110 |
Hey @Havelock-Vetinari, Thanks for the response! I've updated this issue to be ignored by our auto-triage bot. Aside from enabling more logging, did you have an opportunity to try:
Since we’ve been unable to replicate this on our end, it's challenging to resolve the problem. This certainly looks like a legitimate issue and I believe @seratch is eager to mitigate the issue - so he may follow-up with more questions. In the meantime, any additional information that you have provide would be very helpful! |
I faced a similar issue and upgrading tyrus and bolt did not help. When i inspect the thread dump, i realized a bug on JDK SSLEngine causes this behaviour. Upgrading openjdk 11.02 to 17 .02 immediately fixed the problem. |
In our application, we use Bolt Socket-mode SDK for the bot. And we've found some strange behavior. Sometimes (looks like it doesn't depend on anything) app becomes a "zombie", meaning that it continues to work but the bot isn't in a working state. For example, let's say for the message with text
ping
it will respond with apong
message, and in that "non-working" state bot won't react to theping
message. But at the same time app isn't "dead" or crashed, we can successfully "stop"SocketModeApp
later. According to our observations, this always occurs afterjavax.websocket.DeploymentException
(we use
SocketModeClientTyrusImpl
)I suspect (but still can be wrong 😅) this is due to the facts that:
com.slack.api.socket_mode.SocketModeClient#initializeSessionMonitorExecutor
andcom.slack.api.socket_mode.SocketModeClient#initializeMessageProcessorExecutor
. Both create TP-s byExecutors.newSingleThreadScheduledExecutor
The Slack SDK version
OS info
Inside the
openjdk11:jdk-11.0.7_10-alpine-slim
Steps to reproduce:
javax.websocket.DeploymentException
s. Please see my assumptions and presented logs.Expected result:
The app is "usable", able to react to events in a common way after occurred
javax.websocket.DeploymentException
sActual result:
The app is "unusable". Like a zombie, unable to react to events in a common way after occurred
javax.websocket.DeploymentException
s, but isn't totally crashed.The text was updated successfully, but these errors were encountered: