Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BP-66: support throttling for zookeeper read during rereplication #4258

Merged
merged 3 commits into from
Apr 30, 2024

Conversation

thetumbled
Copy link
Member

@thetumbled thetumbled commented Apr 1, 2024

Includes the BP-66 design proposal markdown document.

Master Issue: #4257
Implement PR: #4256

@thetumbled
Copy link
Member Author

Could you help to review this BP? thanks!
@hangc0276 @ivankelly @shoothzj @eolivelli @horizonzy

@thetumbled
Copy link
Member Author

thetumbled commented Apr 28, 2024

List some experimental data.
A 20 node bookkeeper cluster is deployed with autoRecoveryDaemonEnable=true and Ensemble Size, WQ, and AQ of 3 3 2, so each cookie is a replicator.

200 100MB Ledgers

Experimental Conditions

  • Create 200 100MB Ledgers. Approximately 200 * 3/20=30 Ledgers are loaded on each cookie, occupying 3000MB of storage space.

Comparison conditions:

  • The replicationAcquireTaskPerSecond is set to the default value of 0, with no speed limit.
  • The replicationAcquireTaskPerSecond is set to 1, limiting the task to be read once per second.
    Compare the read latency and read throughput of zk when decommissing a cookie.

Give the result as follows:

Experimental Result

  • No speed limit (replicationAcquireTaskPerSecond=0)
    image
    image

We can see that the read delay only reaches 60ms.

  • Speed limit (replicationAcquireTaskPerSecond=1)
    image
    image

The peak read delay is 3ms, which is indeed a significant decrease compared to the previous 60ms. But because the second level delay has not been reproduced, the persuasiveness is not enough. Let's take a look at the experiment below.

5000 2MB Ledgers

Experimental Conditions

  • Create 5000 2MB Ledgers. Approximately 5000 * 3/20=750 Ledgers are loaded on each cookie, occupying 1500MB of storage space.

Comparison conditions:

  • The replicationAcquireTaskPerSecond is set to the default value of 0, with no speed limit.
  • The replicationAcquireTaskPerSecond is set to 1, limiting the task to be read once per second.
  • The replicationAcquireTaskPerSecond is set to 0.5, with a limit of reading zk to retrieve tasks every 2 seconds.
    Compare the read latency and read throughput of zk when decommissing a cookie.

Experimental Result

  • No speed limit (replicationAcquireTaskPerSecond=0)
    image
    image

The read delay reaches 2 seconds, and the traffic for reading zk reaches 146kb/s.

  • ReplicationAcquireTaskPerSecond=1
    image
    image

The peak read latency is 1.38s, and the peak read traffic significantly decreases to 73KB/s.

  • ReplicationAcquireTaskPerSecond=0.5
    image
    image

The peak read latency significantly decreased to 40ms, and the peak read traffic was 73KB/s.

@thetumbled
Copy link
Member Author

Could you help to review this BP? thanks.
@lhotari @hangc0276 @eolivelli @wenbingshen @zymap @shoothzj @horizonzy

Copy link
Member

@wenbingshen wenbingshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks working for this BP.



### Configuration
add the following configuration:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your work. Currently I maintain a bookie cluster with 200 nodes. I applied the following speed limit PR. The bookie process disabled autorecovery and deployed about 10 AutoRecovery processes independently.
#2778

So far, the cluster service operation and maintenance work are relatively good. I think you can separate the AutoRecovery service and set the corresponding replication limit, which may help you.

For the work of this proposal, by individually limiting the frequency of reading zk, the zk service can be reasonably limited and protected, but it is not convenient to limit the byte rate of reading and copying entries because the size of the entry changes;
On the contrary, I feel that PR #2778 can protect zk's read speed through speed limiting.

Let's hear what others have to say.

Copy link
Member Author

@thetumbled thetumbled Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We enable replicationRateByBytes to only 3MB as we have 100+ replicators, but the latency of zk read is still very high to minutes level.
Each time we decommission a bookie in production cluster, the read latency soar to minute level.
image

Limitting the byte rate of replication can't relieve the pressure of zk, but avoid that the replication throughput of replication is so high to impact the normal client throughput.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the throughput of a single RW is only 3MB, consider reducing the number of RWs to 10 and adjusting the throughput of a single RW to 30MB. There is no need to maintain 100 replicators. It is generally recommended to separate AR and Bookie.

Copy link
Member Author

@thetumbled thetumbled Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ZK can't scale out. As the Pulsar & Bookkeeper cluster scale out, the number of replicators must scale out inevitably. When the number of replicators reach to tens or hundreds, the latency of zk will soar to unacceptable level.
  • For the convenience of operation and maintenance, we always set autoRecoveryDaemonEnable=true for every bookie. We do not adopt the other two options:
    • Deploying another cluster for AutoRecovery increase the complexity of the whole system.
    • Make small part of bookies in the Bookkeeper cluster work as replicator and set a high value of replicationRateByBytes is dangerous. because the throughput of normal client will be impacted by the replication throuhput.
  • Actually, there is not a way to relieve the pressure of zk in replicatioin currently.

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've mentioned 400 bookies in the cluster.
In such configuration (and in general) I'd recommend to not run autorecovery on every bookie, and not run it as a part of a bookie process. I'd go as far as to call it the best practice.

E.g. when autorecovery needs to run it will compete for resources with the bookie, potentially OOMing it (though that has been improved over the years IIRC) etc.
Normally one does not need 400 AR services anyway.
i'd run 3, maybe 5 as a separate processes even if they are running on a subset of bookie nodes (better - just separately).

With 400 AR you also getting into the case when they frequently collide trying to grab ledger for rereplication from ZK and backoff/wait, thus many of the AR services won't be productive anyway.

With all that in mind, you can tune dedicated AR service to have stricter settings for some of the existing throttles, such as:

zkRequestRateLimit
auditorMaxNumberOfConcurrentOpenLedgerOperations
rereplicationEntryBatchSize

and others

See detailed descriptions in https://github.com/apache/bookkeeper/blob/master/conf/bk_server.conf and in the corresponding code for the configs.

I think this should cover your usecase without any changes unless I have missed some nuanced point.

@thetumbled
Copy link
Member Author

thetumbled commented Apr 29, 2024

You've mentioned 400 bookies in the cluster. In such configuration (and in general) I'd recommend to not run autorecovery on every bookie, and not run it as a part of a bookie process. I'd go as far as to call it the best practice.

E.g. when autorecovery needs to run it will compete for resources with the bookie, potentially OOMing it (though that has been improved over the years IIRC) etc. Normally one does not need 400 AR services anyway. i'd run 3, maybe 5 as a separate processes even if they are running on a subset of bookie nodes (better - just separately).

With 400 AR you also getting into the case when they frequently collide trying to grab ledger for rereplication from ZK and backoff/wait, thus many of the AR services won't be productive anyway.

With all that in mind, you can tune dedicated AR service to have stricter settings for some of the existing throttles, such as:

zkRequestRateLimit
auditorMaxNumberOfConcurrentOpenLedgerOperations
rereplicationEntryBatchSize

and others

See detailed descriptions in https://github.com/apache/bookkeeper/blob/master/conf/bk_server.conf and in the corresponding code for the configs.

I think this should cover your usecase without any changes unless I have missed some nuanced point.

Deploy a dedicated cluster for AR is a solution to relieve the pressure of zk. But we prefer to solve this problem without adding complexity of the cluster and the difficulty of maintenance. And this PR fix our problem pretty well without any negative effect.
As for the concern about the collision of replicator, i have studied this issue before. When the replicator try to acquire a task, replicators will shuffling the znode list before iterate it, so we do not meet such problem yet.

@dlg99
Copy link
Contributor

dlg99 commented Apr 29, 2024

I am ok with the change if we maintain default behavior same as now (no throttling).
I left couple of comments on PR and will vote on the maillist

@thetumbled
Copy link
Member Author

I am ok with the change if we maintain default behavior same as now (no throttling). I left couple of comments on PR and will vote on the maillist

Thanks a lot. I have fixed the problem corresponding to the comment.
The voting thread is: https://lists.apache.org/thread/llblggrr5rdr5fgqq45sq31qjg2rlb7n
Thanks.

@shoothzj shoothzj changed the title BP-66: support throttling for zookeeper read of rereplication BP-66: support throttling for zookeeper read during rereplication Apr 30, 2024
@shoothzj shoothzj merged commit c2defe2 into apache:master Apr 30, 2024
22 checks passed
@hangc0276 hangc0276 added this to the 4.18.0 milestone May 25, 2024
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants