-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BP-66: support throttling for zookeeper read during rereplication #4258
Conversation
Could you help to review this BP? thanks! |
List some experimental data. 200 100MB LedgersExperimental Conditions
Comparison conditions:
Give the result as follows: Experimental ResultWe can see that the read delay only reaches 60ms. The peak read delay is 3ms, which is indeed a significant decrease compared to the previous 60ms. But because the second level delay has not been reproduced, the persuasiveness is not enough. Let's take a look at the experiment below. 5000 2MB LedgersExperimental Conditions
Comparison conditions:
Experimental ResultThe read delay reaches 2 seconds, and the traffic for reading zk reaches 146kb/s. The peak read latency is 1.38s, and the peak read traffic significantly decreases to 73KB/s. The peak read latency significantly decreased to 40ms, and the peak read traffic was 73KB/s. |
Could you help to review this BP? thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks working for this BP.
|
||
|
||
### Configuration | ||
add the following configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your work. Currently I maintain a bookie cluster with 200 nodes. I applied the following speed limit PR. The bookie process disabled autorecovery and deployed about 10 AutoRecovery processes independently.
#2778
So far, the cluster service operation and maintenance work are relatively good. I think you can separate the AutoRecovery service and set the corresponding replication limit, which may help you.
For the work of this proposal, by individually limiting the frequency of reading zk, the zk service can be reasonably limited and protected, but it is not convenient to limit the byte rate of reading and copying entries because the size of the entry changes;
On the contrary, I feel that PR #2778 can protect zk's read speed through speed limiting.
Let's hear what others have to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We enable replicationRateByBytes
to only 3MB as we have 100+ replicators, but the latency of zk read is still very high to minutes level.
Each time we decommission a bookie in production cluster, the read latency soar to minute level.
Limitting the byte rate of replication can't relieve the pressure of zk, but avoid that the replication throughput of replication is so high to impact the normal client throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the throughput of a single RW is only 3MB, consider reducing the number of RWs to 10 and adjusting the throughput of a single RW to 30MB. There is no need to maintain 100 replicators. It is generally recommended to separate AR and Bookie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ZK can't scale out. As the Pulsar & Bookkeeper cluster scale out, the number of replicators must scale out inevitably. When the number of replicators reach to tens or hundreds, the latency of zk will soar to unacceptable level.
- For the convenience of operation and maintenance, we always set
autoRecoveryDaemonEnable=true
for every bookie. We do not adopt the other two options:- Deploying another cluster for AutoRecovery increase the complexity of the whole system.
- Make small part of bookies in the Bookkeeper cluster work as replicator and set a high value of
replicationRateByBytes
is dangerous. because the throughput of normal client will be impacted by the replication throuhput.
- Actually, there is not a way to relieve the pressure of zk in replicatioin currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've mentioned 400 bookies in the cluster.
In such configuration (and in general) I'd recommend to not run autorecovery on every bookie, and not run it as a part of a bookie process. I'd go as far as to call it the best practice.
E.g. when autorecovery needs to run it will compete for resources with the bookie, potentially OOMing it (though that has been improved over the years IIRC) etc.
Normally one does not need 400 AR services anyway.
i'd run 3, maybe 5 as a separate processes even if they are running on a subset of bookie nodes (better - just separately).
With 400 AR you also getting into the case when they frequently collide trying to grab ledger for rereplication from ZK and backoff/wait, thus many of the AR services won't be productive anyway.
With all that in mind, you can tune dedicated AR service to have stricter settings for some of the existing throttles, such as:
zkRequestRateLimit
auditorMaxNumberOfConcurrentOpenLedgerOperations
rereplicationEntryBatchSize
and others
See detailed descriptions in https://github.com/apache/bookkeeper/blob/master/conf/bk_server.conf and in the corresponding code for the configs.
I think this should cover your usecase without any changes unless I have missed some nuanced point.
Deploy a dedicated cluster for AR is a solution to relieve the pressure of zk. But we prefer to solve this problem without adding complexity of the cluster and the difficulty of maintenance. And this PR fix our problem pretty well without any negative effect. |
I am ok with the change if we maintain default behavior same as now (no throttling). |
Thanks a lot. I have fixed the problem corresponding to the comment. |
Signed-off-by: ZhangJian He <[email protected]>
…ache#4258) Signed-off-by: ZhangJian He <[email protected]> Co-authored-by: ZhangJian He <[email protected]>
Includes the BP-66 design proposal markdown document.
Master Issue: #4257
Implement PR: #4256