Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc-default-executor threads stop processing tasks: how can we troubleshoot in such situation? #8174

Closed
suztomo opened this issue May 13, 2021 · 23 comments
Labels

Comments

@suztomo
Copy link
Contributor

suztomo commented May 13, 2021

Do you know a good way to troubleshoot "grpc-default-executor" threads' status?

In apache/beam#14768 (comment), when I tried to upgrade Beam's vendored (shaded) gRPC dependency to 1.37.0 (or 1.36) from gRPC 1.26.0, I observed that some tests (GrpcLoggingServiceTest or BeamFnLoggingServiceTest randomly) do not finish. Borrowing Kenn's words, BeamFnLoggingServiceTest does the followings:

  • start a logging service
  • set up some stub clients, each with onError wired up to release a countdown latch
  • send error responses to all three of them (actually it sends the error in the same task it creates the stub)
  • each task waits on the latch

(GrpcLoggingServiceTest has similar structure)

Unfortunately it occurs only in Beam's CI Jenkins environment (which takes ~1 hour to finish). I cannot reproduce the problem locally.

From the observation of the trace log and the previous thread dump, it seems that grpc-default-executor threads stop processing tasks (the thread dump showed no "grpc-default-executor" threads in the JVM when the test was waiting for the them to count down a CountDownLatch) and one of the latches are not counted down. This results in the test threads waiting forever for the remaining latch. I cannot tell why the "grpc-default-executor" threads stop working (disappear?).

Do you know how to troubleshot such situation?

@sanjaypujare
Copy link
Contributor

it seems that grpc-default-executor threads stop processing tasks (the thread dump showed no "grpc-default-executor" threads in the JVM when the test was waiting for the them to count down a CountDownLatch)

I am a bit confused: are you saying there are no "grpc-default-executor" threads or the threads are there but there are no tasks to process or the threads have stopped processing tasks?

@suztomo
Copy link
Contributor Author

suztomo commented May 13, 2021

Thank you for response. There were no "grpc-default-executor" threads in the thread dump, which was taken 12 minutes after the tests got hung (forever waiting for the CountDownLatch).

@sanjaypujare
Copy link
Contributor

sanjaypujare commented May 13, 2021

Oh, so based on this, it is a regression from 1.26.0 to 1.37 (or 1.36). Would it be possible to pass a custom executor to your channel builder and see what the behavior is?

It is unclear why there are no "grpc-default-executor" threads but it is concerning. Is it possible you have some uncaught exceptions in your onError() which causes the thread to die? Also I looked at apache/beam#14768 (comment) and are the messages "Logging client failed unexpectedly." and "CANCELLED: client cancelled" part of the test or could they somehow be related to this issue?

suztomo added a commit to suztomo/beam that referenced this issue May 13, 2021
@suztomo
Copy link
Contributor Author

suztomo commented May 13, 2021

"Logging client failed unexpectedly." and "CANCELLED: client cancelled" are expected messages in the test. After these messages, the service should call corresponding method (to count down the latches) for the exceptions.

Trying the executor option.

@sanjaypujare
Copy link
Contributor

"Logging client failed unexpectedly." and "CANCELLED: client cancelled" are expected messages in the test. After these messages, the service should call corresponding method (to count down the latches) for the exceptions.

Trying the executor option.

One more thing you may want to try is to catch all Throwables in your onError, log them and see if that was the issue. It is possible that because of an unrelated behavior change between 1.26 to 1.37 there is an uncaught exception that causes the thread to exit/die.

@suztomo
Copy link
Contributor Author

suztomo commented May 14, 2021

With the executor option (commit), GrpcLoggingServiceTest passes, but BeamFnLoggingServiceTest failed.

@sanjaypujare
Copy link
Contributor

With the executor option (commit), GrpcLoggingServiceTest passes, but BeamFnLoggingServiceTest failed.

What about modifying your onError to make sure you catch all exceptions and log them?

@suztomo
Copy link
Contributor Author

suztomo commented May 14, 2021

Thanks. Let me try that.

suztomo added a commit to suztomo/beam that referenced this issue May 14, 2021
@suztomo
Copy link
Contributor Author

suztomo commented May 14, 2021

I tried catching Throwable in onError (commit) but the method succeeded.

suztomo-macbookpro44% grep 'onError succeeded'  17724.log
    14:37:16.420 [grpc-default-executor-1] DEBUG o.a.b.r.f.logging.GrpcLoggingService - onError succeeded
    14:37:16.423 [grpc-default-executor-2] DEBUG o.a.b.r.f.logging.GrpcLoggingService - onError succeeded
    14:37:16.430 [grpc-default-executor-3] DEBUG o.a.b.r.f.logging.GrpcLoggingService - onError succeeded
    14:52:04.255 [grpc-default-executor-6] DEBUG o.a.b.r.d.w.f.l.BeamFnLoggingService - onError succeeded
    14:52:04.255 [grpc-default-executor-3] DEBUG o.a.b.r.d.w.f.l.BeamFnLoggingService - onError succeeded
    14:52:04.255 [grpc-default-executor-0] DEBUG o.a.b.r.d.w.f.l.BeamFnLoggingService - onError succeeded

I wish there's a way to log lifecycle/state of grpc-default-executor threads. Or setting Thread.setUncaughtExceptionHandler

@sanjaypujare
Copy link
Contributor

I tried catching Throwable in onError (commit) but the method succeeded.

To summarize:

  • you are using a custom executor now and that allows one test to pass but another test is still failing
  • based on the latest commit it looks like there are no uncaught exceptions happening in onError

Is your test failing because of disappearing threads still? If that's the case and you are using a custom executor would you be able to instrument it to see why threads are disappearing?

@suztomo
Copy link
Contributor Author

suztomo commented May 14, 2021

That's right. Now it's my custom executors that is responsible to count down the latches. Let me add more logging.

@sanjaypujare
Copy link
Contributor

That's right. Now it's my custom executors that is responsible to count down the latches. Let me add more logging.

The executor (ThreadFactory) is yours and the onError code executed is also yours. This is unlikely to be a gRPC issue then. But it will be good to know what the root cause is.

@suztomo
Copy link
Contributor Author

suztomo commented May 15, 2021

The thread that I created from my custom executor for the InProcessChannelBuilder is not there when I got the thread dump (It's after 27 minutes the Jenkins console stopped showing progress).

https://gist.github.com/suztomo/bb1bf0137e391f472075baeb622328e2

Next question: what happened to the thread?

@suztomo
Copy link
Contributor Author

suztomo commented May 15, 2021

https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool() says:

Threads that have not been used for sixty seconds are terminated and removed from the cache.

Maybe this is why I didn’t see the thread names in the thread dump.

Next question: why had the thread been idle for 60 seconds (without counting down the latches)?

@sanjaypujare
Copy link
Contributor

Maybe this is why I didn’t see the thread names in the thread dump.

Next question: why had the thread been idle for 60 seconds (without counting down the latches)?

So the doc for onError says

"May only be called once and if called it must be the last method called. In particular if an exception is thrown by an implementation of onError no further calls to any method are allowed."

So after processing onError you need to re-initiate the RPC for your next onError in order for your count-down latch to proceed. But I can't explain the difference in behavior between 1.26 and 1.36 .

@suztomo
Copy link
Contributor Author

suztomo commented May 17, 2021

Thank you for the document. In the test, the onError is the last call for each thread. apache/beam@070550d#r633667135

@sanjaypujare
Copy link
Contributor

Okay, but then you should have another RPC following that which would trigger the next onError. And that's the only way the count-down latch will reach zero.

@suztomo
Copy link
Contributor Author

suztomo commented May 17, 2021

Thanks. I think I'm getting the idea; onError may be called once per a gRPC server, not per a StreamObserver.

@suztomo
Copy link
Contributor Author

suztomo commented May 18, 2021

The test creates 3 tasks. Each task creates a channel, creates the service stub, and calls the logging RPC (link). 3 tasks, 3 stubs, and 3 RPCs. Therefore now I think the existing test implementation should work.

My memo: the use of the latch to wait for communication between clients and servers are also in gRPC's tutorial: https://grpc.io/docs/languages/java/basics/#bidirectional-streaming-rpc-1 (not using multiple clients)

@suztomo
Copy link
Contributor Author

suztomo commented May 18, 2021

Given a bidirectional streaming RPC stub.m() and responseObserver and requestObserver such that

StreamObserver<...> requestObserver = stub.m(responseObserver);

Is it guaranteed that calling requestObserver.onError(exception) invokes responseObserver's onError() handler?

@sanjaypujare
Copy link
Contributor

Given a bidirectional streaming RPC stub.m() and responseObserver and requestObserver such that

StreamObserver<...> requestObserver = stub.m(responseObserver);

Is it guaranteed that calling requestObserver.onError(exception) invokes responseObserver's onError() handler?

It is a "short-circuit" as described here #7558 (comment). See the whole issue for a detailed discussion of this. If you have any questions/comments regarding the behavior feel free to add them in that issue.

@suztomo
Copy link
Contributor Author

suztomo commented May 18, 2021

As per the comment,

Information passed to client's outbound onError() is never transmitted to the server, server side only sees the cancellation. ... The ClientCall.Listener is immediately notified with CANCELLED status including the original cause.

The listener (onError) should be called immediately.

@suztomo
Copy link
Contributor Author

suztomo commented May 20, 2021

Update: a new fix to the test fixed the problem. (I don't know why it is better; to me it does the same thing)

apache/beam#14833 (review)

If the problem recurs, I will reopen this. @sanjaypujare , thank you for the help.

@suztomo suztomo closed this as completed May 20, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants