-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc-default-executor threads stop processing tasks: how can we troubleshoot in such situation? #8174
Comments
I am a bit confused: are you saying there are no "grpc-default-executor" threads or the threads are there but there are no tasks to process or the threads have stopped processing tasks? |
Thank you for response. There were no "grpc-default-executor" threads in the thread dump, which was taken 12 minutes after the tests got hung (forever waiting for the CountDownLatch). |
Oh, so based on this, it is a regression from 1.26.0 to 1.37 (or 1.36). Would it be possible to pass a custom executor to your channel builder and see what the behavior is? It is unclear why there are no "grpc-default-executor" threads but it is concerning. Is it possible you have some uncaught exceptions in your |
This is suggestion by Sanjay from gRPC team grpc/grpc-java#8174 (comment)
"Logging client failed unexpectedly." and "CANCELLED: client cancelled" are expected messages in the test. After these messages, the service should call corresponding method (to count down the latches) for the exceptions. Trying the executor option. |
One more thing you may want to try is to catch all |
With the executor option (commit), GrpcLoggingServiceTest passes, but BeamFnLoggingServiceTest failed. |
What about modifying your |
Thanks. Let me try that. |
Trying sanjaypujare's advice: grpc/grpc-java#8174 (comment)
I tried catching Throwable in onError (commit) but the method succeeded.
I wish there's a way to log lifecycle/state of grpc-default-executor threads. Or setting Thread.setUncaughtExceptionHandler |
To summarize:
Is your test failing because of disappearing threads still? If that's the case and you are using a custom executor would you be able to instrument it to see why threads are disappearing? |
That's right. Now it's my custom executors that is responsible to count down the latches. Let me add more logging. |
The executor (ThreadFactory) is yours and the |
The thread that I created from my custom executor for the https://gist.github.com/suztomo/bb1bf0137e391f472075baeb622328e2 Next question: what happened to the thread? |
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool() says:
Maybe this is why I didn’t see the thread names in the thread dump. Next question: why had the thread been idle for 60 seconds (without counting down the latches)? |
So the doc for onError says "May only be called once and if called it must be the last method called. In particular if an exception is thrown by an implementation of So after processing |
Thank you for the document. In the test, the |
Okay, but then you should have another RPC following that which would trigger the next |
Thanks. I think I'm getting the idea; |
The test creates 3 tasks. Each task creates a channel, creates the service stub, and calls the My memo: the use of the latch to wait for communication between clients and servers are also in gRPC's tutorial: https://grpc.io/docs/languages/java/basics/#bidirectional-streaming-rpc-1 (not using multiple clients) |
Given a bidirectional streaming RPC
Is it guaranteed that calling |
It is a "short-circuit" as described here #7558 (comment). See the whole issue for a detailed discussion of this. If you have any questions/comments regarding the behavior feel free to add them in that issue. |
As per the comment,
The listener (onError) should be called immediately. |
Update: a new fix to the test fixed the problem. (I don't know why it is better; to me it does the same thing) If the problem recurs, I will reopen this. @sanjaypujare , thank you for the help. |
Do you know a good way to troubleshoot "grpc-default-executor" threads' status?
In apache/beam#14768 (comment), when I tried to upgrade Beam's vendored (shaded) gRPC dependency to 1.37.0 (or 1.36) from gRPC 1.26.0, I observed that some tests (GrpcLoggingServiceTest or BeamFnLoggingServiceTest randomly) do not finish. Borrowing Kenn's words, BeamFnLoggingServiceTest does the followings:
(GrpcLoggingServiceTest has similar structure)
Unfortunately it occurs only in Beam's CI Jenkins environment (which takes ~1 hour to finish). I cannot reproduce the problem locally.
From the observation of the trace log and the previous thread dump, it seems that grpc-default-executor threads stop processing tasks (the thread dump showed no "grpc-default-executor" threads in the JVM when the test was waiting for the them to count down a CountDownLatch) and one of the latches are not counted down. This results in the test threads waiting forever for the remaining latch. I cannot tell why the "grpc-default-executor" threads stop working (disappear?).
Do you know how to troubleshot such situation?
The text was updated successfully, but these errors were encountered: