Description
Do you know a good way to troubleshoot "grpc-default-executor" threads' status?
In apache/beam#14768 (comment), when I tried to upgrade Beam's vendored (shaded) gRPC dependency to 1.37.0 (or 1.36) from gRPC 1.26.0, I observed that some tests (GrpcLoggingServiceTest or BeamFnLoggingServiceTest randomly) do not finish. Borrowing Kenn's words, BeamFnLoggingServiceTest does the followings:
- start a logging service
- set up some stub clients, each with onError wired up to release a countdown latch
- send error responses to all three of them (actually it sends the error in the same task it creates the stub)
- each task waits on the latch
(GrpcLoggingServiceTest has similar structure)
Unfortunately it occurs only in Beam's CI Jenkins environment (which takes ~1 hour to finish). I cannot reproduce the problem locally.
From the observation of the trace log and the previous thread dump, it seems that grpc-default-executor threads stop processing tasks (the thread dump showed no "grpc-default-executor" threads in the JVM when the test was waiting for the them to count down a CountDownLatch) and one of the latches are not counted down. This results in the test threads waiting forever for the remaining latch. I cannot tell why the "grpc-default-executor" threads stop working (disappear?).
Do you know how to troubleshot such situation?