BP-66: support throttling for zookeeper read during rereplication #4258

thetumbled · 2024-04-01T08:36:40Z

Includes the BP-66 design proposal markdown document.

Master Issue: #4257
Implement PR: #4256

thetumbled · 2024-04-01T08:42:46Z

Could you help to review this BP? thanks!
@hangc0276 @ivankelly @shoothzj @eolivelli @horizonzy

thetumbled · 2024-04-28T03:51:02Z

List some experimental data.
A 20 node bookkeeper cluster is deployed with autoRecoveryDaemonEnable=true and Ensemble Size, WQ, and AQ of 3 3 2, so each cookie is a replicator.

200 100MB Ledgers

Experimental Conditions

Create 200 100MB Ledgers. Approximately 200 * 3/20=30 Ledgers are loaded on each cookie, occupying 3000MB of storage space.

Comparison conditions:

The replicationAcquireTaskPerSecond is set to the default value of 0, with no speed limit.
The replicationAcquireTaskPerSecond is set to 1, limiting the task to be read once per second.
Compare the read latency and read throughput of zk when decommissing a cookie.

Give the result as follows:

Experimental Result

No speed limit (replicationAcquireTaskPerSecond=0)

We can see that the read delay only reaches 60ms.

Speed limit (replicationAcquireTaskPerSecond=1)

The peak read delay is 3ms, which is indeed a significant decrease compared to the previous 60ms. But because the second level delay has not been reproduced, the persuasiveness is not enough. Let's take a look at the experiment below.

5000 2MB Ledgers

Experimental Conditions

Create 5000 2MB Ledgers. Approximately 5000 * 3/20=750 Ledgers are loaded on each cookie, occupying 1500MB of storage space.

Comparison conditions:

The replicationAcquireTaskPerSecond is set to the default value of 0, with no speed limit.
The replicationAcquireTaskPerSecond is set to 1, limiting the task to be read once per second.
The replicationAcquireTaskPerSecond is set to 0.5, with a limit of reading zk to retrieve tasks every 2 seconds.
Compare the read latency and read throughput of zk when decommissing a cookie.

Experimental Result

No speed limit (replicationAcquireTaskPerSecond=0)

The read delay reaches 2 seconds, and the traffic for reading zk reaches 146kb/s.

ReplicationAcquireTaskPerSecond=1

The peak read latency is 1.38s, and the peak read traffic significantly decreases to 73KB/s.

ReplicationAcquireTaskPerSecond=0.5

The peak read latency significantly decreased to 40ms, and the peak read traffic was 73KB/s.

thetumbled · 2024-04-28T08:45:41Z

Could you help to review this BP? thanks.
@lhotari @hangc0276 @eolivelli @wenbingshen @zymap @shoothzj @horizonzy

wenbingshen

Thanks working for this BP.

wenbingshen · 2024-04-28T09:50:24Z

site3/website/src/pages/bps/BP-66-support-throttling-for-zookeeper-read-of-rereplication.md

+
+
+### Configuration
+add the following configuration:


Thank you very much for your work. Currently I maintain a bookie cluster with 200 nodes. I applied the following speed limit PR. The bookie process disabled autorecovery and deployed about 10 AutoRecovery processes independently.
#2778

So far, the cluster service operation and maintenance work are relatively good. I think you can separate the AutoRecovery service and set the corresponding replication limit, which may help you.

For the work of this proposal, by individually limiting the frequency of reading zk, the zk service can be reasonably limited and protected, but it is not convenient to limit the byte rate of reading and copying entries because the size of the entry changes;
On the contrary, I feel that PR #2778 can protect zk's read speed through speed limiting.

Let's hear what others have to say.

We enable replicationRateByBytes to only 3MB as we have 100+ replicators, but the latency of zk read is still very high to minutes level.
Each time we decommission a bookie in production cluster, the read latency soar to minute level.

Limitting the byte rate of replication can't relieve the pressure of zk, but avoid that the replication throughput of replication is so high to impact the normal client throughput.

If the throughput of a single RW is only 3MB, consider reducing the number of RWs to 10 and adjusting the throughput of a single RW to 30MB. There is no need to maintain 100 replicators. It is generally recommended to separate AR and Bookie.

ZK can't scale out. As the Pulsar & Bookkeeper cluster scale out, the number of replicators must scale out inevitably. When the number of replicators reach to tens or hundreds, the latency of zk will soar to unacceptable level.

For the convenience of operation and maintenance, we always set autoRecoveryDaemonEnable=true for every bookie. We do not adopt the other two options:

Deploying another cluster for AutoRecovery increase the complexity of the whole system.

Make small part of bookies in the Bookkeeper cluster work as replicator and set a high value of replicationRateByBytes is dangerous. because the throughput of normal client will be impacted by the replication throuhput.

Actually, there is not a way to relieve the pressure of zk in replicatioin currently.

dlg99

You've mentioned 400 bookies in the cluster.
In such configuration (and in general) I'd recommend to not run autorecovery on every bookie, and not run it as a part of a bookie process. I'd go as far as to call it the best practice.

E.g. when autorecovery needs to run it will compete for resources with the bookie, potentially OOMing it (though that has been improved over the years IIRC) etc.
Normally one does not need 400 AR services anyway.
i'd run 3, maybe 5 as a separate processes even if they are running on a subset of bookie nodes (better - just separately).

With 400 AR you also getting into the case when they frequently collide trying to grab ledger for rereplication from ZK and backoff/wait, thus many of the AR services won't be productive anyway.

With all that in mind, you can tune dedicated AR service to have stricter settings for some of the existing throttles, such as:

zkRequestRateLimit
auditorMaxNumberOfConcurrentOpenLedgerOperations
rereplicationEntryBatchSize

and others

See detailed descriptions in https://github.com/apache/bookkeeper/blob/master/conf/bk_server.conf and in the corresponding code for the configs.

I think this should cover your usecase without any changes unless I have missed some nuanced point.

thetumbled · 2024-04-29T02:10:22Z

You've mentioned 400 bookies in the cluster. In such configuration (and in general) I'd recommend to not run autorecovery on every bookie, and not run it as a part of a bookie process. I'd go as far as to call it the best practice.

E.g. when autorecovery needs to run it will compete for resources with the bookie, potentially OOMing it (though that has been improved over the years IIRC) etc. Normally one does not need 400 AR services anyway. i'd run 3, maybe 5 as a separate processes even if they are running on a subset of bookie nodes (better - just separately).

With 400 AR you also getting into the case when they frequently collide trying to grab ledger for rereplication from ZK and backoff/wait, thus many of the AR services won't be productive anyway.

With all that in mind, you can tune dedicated AR service to have stricter settings for some of the existing throttles, such as:
zkRequestRateLimit
auditorMaxNumberOfConcurrentOpenLedgerOperations
rereplicationEntryBatchSize
and others

See detailed descriptions in https://github.com/apache/bookkeeper/blob/master/conf/bk_server.conf and in the corresponding code for the configs.

I think this should cover your usecase without any changes unless I have missed some nuanced point.

Deploy a dedicated cluster for AR is a solution to relieve the pressure of zk. But we prefer to solve this problem without adding complexity of the cluster and the difficulty of maintenance. And this PR fix our problem pretty well without any negative effect.
As for the concern about the collision of replicator, i have studied this issue before. When the replicator try to acquire a task, replicators will shuffling the znode list before iterate it, so we do not meet such problem yet.

dlg99 · 2024-04-29T18:05:43Z

I am ok with the change if we maintain default behavior same as now (no throttling).
I left couple of comments on PR and will vote on the maillist

thetumbled · 2024-04-30T03:12:20Z

I am ok with the change if we maintain default behavior same as now (no throttling). I left couple of comments on PR and will vote on the maillist

Thanks a lot. I have fixed the problem corresponding to the comment.
The voting thread is: https://lists.apache.org/thread/llblggrr5rdr5fgqq45sq31qjg2rlb7n
Thanks.

Signed-off-by: ZhangJian He <[email protected]>

…ache#4258) Signed-off-by: ZhangJian He <[email protected]> Co-authored-by: ZhangJian He <[email protected]>

add BP-66.

b1234fe

thetumbled mentioned this pull request Apr 1, 2024

BP-66: support throttling for zookeeper read during rereplication. #4257

Closed

wenbingshen reviewed Apr 28, 2024

View reviewed changes

dlg99 reviewed Apr 28, 2024

View reviewed changes

shoothzj mentioned this pull request Apr 29, 2024

BP-66: support throttling for zookeeper read during rereplication #4256

Merged

rename conf.

64aef6d

dlg99 approved these changes Apr 30, 2024

View reviewed changes

change of rereplication to during rereplication

cccb9be

Signed-off-by: ZhangJian He <[email protected]>

shoothzj force-pushed the BP-66 branch from 02255b3 to cccb9be Compare April 30, 2024 23:15

shoothzj changed the title ~~BP-66: support throttling for zookeeper read of rereplication~~ BP-66: support throttling for zookeeper read during rereplication Apr 30, 2024

shoothzj approved these changes Apr 30, 2024

View reviewed changes

shoothzj merged commit c2defe2 into apache:master Apr 30, 2024
22 checks passed

hangc0276 added this to the 4.18.0 milestone May 25, 2024

hangc0276 assigned thetumbled May 25, 2024

hangc0276 added area/autorecovery type/improvement area/protocol and removed type/improvement area/autorecovery labels May 25, 2024

Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024

BP-66: support throttling for zookeeper read during rereplication (ap…

96fc7ff

…ache#4258) Signed-off-by: ZhangJian He <[email protected]> Co-authored-by: ZhangJian He <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BP-66: support throttling for zookeeper read during rereplication #4258

BP-66: support throttling for zookeeper read during rereplication #4258

thetumbled commented Apr 1, 2024 •

edited by shoothzj

Loading

thetumbled commented Apr 1, 2024

thetumbled commented Apr 28, 2024 •

edited

Loading

thetumbled commented Apr 28, 2024

wenbingshen left a comment

wenbingshen Apr 28, 2024

thetumbled Apr 28, 2024 •

edited

Loading

wenbingshen Apr 28, 2024

thetumbled Apr 28, 2024 •

edited

Loading

dlg99 left a comment

thetumbled commented Apr 29, 2024 •

edited

Loading

dlg99 commented Apr 29, 2024

thetumbled commented Apr 30, 2024

BP-66: support throttling for zookeeper read during rereplication #4258

BP-66: support throttling for zookeeper read during rereplication #4258

Conversation

thetumbled commented Apr 1, 2024 • edited by shoothzj Loading

thetumbled commented Apr 1, 2024

thetumbled commented Apr 28, 2024 • edited Loading

200 100MB Ledgers

Experimental Conditions

Experimental Result

5000 2MB Ledgers

Experimental Conditions

Experimental Result

thetumbled commented Apr 28, 2024

wenbingshen left a comment

Choose a reason for hiding this comment

wenbingshen Apr 28, 2024

Choose a reason for hiding this comment

thetumbled Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

wenbingshen Apr 28, 2024

Choose a reason for hiding this comment

thetumbled Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

dlg99 left a comment

Choose a reason for hiding this comment

thetumbled commented Apr 29, 2024 • edited Loading

dlg99 commented Apr 29, 2024

thetumbled commented Apr 30, 2024

thetumbled commented Apr 1, 2024 •

edited by shoothzj

Loading

thetumbled commented Apr 28, 2024 •

edited

Loading

thetumbled Apr 28, 2024 •

edited

Loading

thetumbled Apr 28, 2024 •

edited

Loading

thetumbled commented Apr 29, 2024 •

edited

Loading