-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[client][fix] Bookie WatchTask may be stuck #4481
[client][fix] Bookie WatchTask may be stuck #4481
Conversation
@merlimat @eolivelli @hangc0276 @dlg99 @zymap @shoothzj PTAL. This problem has certain harmful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
great catch
I have changed the title, this is not an "improvement", but actually a "bugfix"
can we consider to prevent outside call of |
|
According to @hangc0276 suggestion, change the execution thread pool name to: |
Motivation
Before understanding the problem solved by this PR, you can try to use this Test method to execute it in the existing master branch, and the unit test will fail.
Next, let me tell you my problem:
We execute bookie offline and then go online in ReadOnly state, hoping that these bookies can continue to accept client read requests after they come back online.
Then, in some of our brokers, we found that some bookies were not re-listed in the BookKeeper client cache in ManagedLedger after they were online. The following picture shows that Update BookieInfoCache was only executed once, but it should be executed twice in theory, because in addition to ManagedLedger in Pulsar's Broker, there is also a BookKeeper Client in BookKeeperSchemaStroge. As shown in the second figure.
According to the stack analysis, our Scheduler thread is frequently performing the following operations:
Since the thread has been performing this operation, the watch event triggered by ZK has been in the queue of the scheduler thread and has not been executed, and no new watch listeners will be registered in ZK. Finally, the online Bookie node will no longer be updated in the cache of the bk client.
···
BookKeeperClientScheduler-OrderedScheduler-0-0
at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap$Section.removeIf(Ljava/util/function/BiPredicate;)I (ConcurrentOpenHashMap.java:406)
at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap.removeIf(Ljava/util/function/BiPredicate;)I (ConcurrentOpenHashMap.java:172)
at org.apache.bookkeeper.proto.PerChannelBookieClient.checkTimeoutOnPendingOperations()V (PerChannelBookieClient.java:1015)
at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.checkTimeoutOnPendingOperations()V (DefaultPerChannelBookieClientPool.java:132)
at org.apache.bookkeeper.proto.BookieClientImpl.monitorPendingOperations()V (BookieClientImpl.java:572)
at org.apache.bookkeeper.proto.BookieClientImpl.lambda$new$0()V (BookieClientImpl.java:131)
at org.apache.bookkeeper.proto.BookieClientImpl$$Lambda$77.run()V (Unknown Source)
at org.apache.bookkeeper.util.SafeRunnable$1.safeRun()V (SafeRunnable.java:43)
at org.apache.bookkeeper.common.util.SafeRunnable.run()V (SafeRunnable.java:36)
at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run()V (MoreExecutors.java:705)
at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset()Z (FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Ljava/util/concurrent/ScheduledThreadPoolExecutor$ScheduledFutureTask;)Z (ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V (ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.FastThreadLocalRunnable.run()V (FastThreadLocalRunnable.java:30)
at java.lang.Thread.run()V (Thread.java:748)
···
So I suggest that WatchTask provide an internal independent thread pool.
Changes
Add inner thread for WatchTask.