-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the completionObjects leak problem. #4285
Fix the completionObjects leak problem. #4285
Conversation
@@ -1209,6 +1209,7 @@ private void writeAndFlush(final Channel channel, | |||
} | |||
} else { | |||
nettyOpLogger.registerFailedEvent(MathUtils.elapsedNanos(startTime), TimeUnit.NANOSECONDS); | |||
errorOut(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #4278, the PerChannelBookieClient#addEntryTimeoutNanos is enabled, So you will see that 5 seconds(default timeout) passed after bk1 died, the step 9 trigger PendingAddOp timeout, then cause the issue in #4278.
I have a question, since we have a default thread to detect timeout requests every 5 seconds and remove the timedout CompletionKey
from completionObjects
, even if the write fails, will the timeout detection task not remove the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the timeout detection will remove the key.
# Conflicts: # bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
Nice Catch! |
(cherry picked from commit 772b162)
(cherry picked from commit 772b162)
Fixes #4278, you need to go through #4278 for the context.
When a connection is broken, it will trigger PerChannelBookieClient#channelInactive.
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
Lines 1292 to 1312 in 8516d0a
Point 1. line_1303, it will drain all completionObjects, and complete with BookieHandleNotAvailableException.
Point 2. line_1310, set PerChannelBookieClient#channel = null.
When PerChannelBookieClient#addEntry.
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
Lines 776 to 854 in 8516d0a
Point 3. line_840, it will put completionKeyValue to completionObjects.
Point 4. line_852, if the channel is not null, invoke writeAndFlush.
There is a race condition between PerChannelBookieClient#channelInactive and PerChannelBookieClient#addEntry.
There is the timeline.
Point 1 -> Point 3 -> Point 4 -> Point 2.
It will write and flush the AddRequest to the netty channel. In the bookkeeper, there is a weakness in PerChannelBookieClient#writeAndFlush.
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
Lines 1176 to 1279 in 8516d0a
At line_1230, we define a promise for the netty write and flush, if the write and flush failed, we only record the metrics at line_1211, not remove the completionKey from completionObjects. The completionKey will leak in the completionObjects.
If the PerChannelBookieClient#addEntryTimeoutNanos is disabled, the timeoutCheck won't work, so the completionKey exists in the completionObjects forever.
In #4278, the PerChannelBookieClient#addEntryTimeoutNanos is enabled, So you will see that 5 seconds(default timeout) passed after bk1 died, the step 9 trigger PendingAddOp timeout, then cause the issue in #4278.
So, as long as we remove the completionKey from completionObjects when write and flush AddRequest failed, we can solve the problem.