3.x: Add RandomTwoChoice load balancing policy #198

Gor027 · 2023-02-07T01:06:59Z

TL;DR

The RandomTwoChoice policy is a wrapper load balancing policy that adds the "Power of 2 Choice" algorithm to its child policy. It will compare the first two hosts returned from the child policy query plan, and will first return the host with the target shard having fewer inflight requests. The rest of the child query plan will be left intact.

It is intended to be used with TokenAwarePolicy and RoundRobinPolicy, to send queries to the replica nodes by always avoiding the worst option (the replica with the target shard having the most inflight requests will not be chosen).

The original problem

The current default load-balancing policy is the TokenAwarePolicy, which is able to fetch the replica set based on the prepared statement and iterate over the replicas in a round-robin style. This approach has a certain limitation when one of the replicas gets overloaded with background processing like compactions. In that case, the driver keeps sending queries to the overloaded replica which potentially increases the latencies and even queries end up being timed out.

The solution

Now, to avoid sending queries to replicas under high load, we need to collect metrics about the request queues of each replica and for each shard. The shard-awareness feature will allow us to identify the target shard for a prepared statement, and then choose a replica with the target shard having the least number of inflight requests (shortest request queue). It will be particularly effective for environments where latency to each replica may vary unpredictably. However, this may all fall apart if there are multiple clients, as momentarily queries from multiple clients may be sent to the same replica with the target shard having the least number of inflight requests, and this will cause unbalanced and unfair distribution. So, instead of making the absolute best choice, similar to the “power of two choices” algorithm. we pick two replicas at random and choose the better option from them, thus, always avoiding the worst option. It is effective with multiple clients, as it simply avoids the worst choice and distributes the traffic with some degree of randomness.

Benchmarks

Start a Scylla cluster with 5 i3.4xlarge nodes with Scylla 5.1.3 (AMI-07d0ff1841b1a9d8a)
Start 2 c5n.9xlarge loaders with Ubuntu 22.04 LTS (AMI-0ab0629dba5ae551d)
Run on each loader a c-s process with the following command:
./cassandra-stress read duration=150m -schema 'replication(strategy=SimpleStrategy, factor=3)' -mode native cql3 -rate "threads=500" -pop 'dist=UNIFORM(1..100000000)' -node NODE_IP

All the nodes and loaders were in `us-east-2a`

To test how well the new load-balancing policy will perform in comparison with the TokenAwarePolicy, shard 4 (randomly chosen shard) of one of the nodes was overloaded with a simple C application that made the CPU core busy and thus increased the number of inflight requests for that specific shard. There were run 4 copies of the overloading application which used approximately 75% of the CPU core.

Additionally, data was collected about the inflight requests for all nodes and shards and how many times a certain node was chosen to sent a request to. Charts were generated based on this data which highlights the better performance of the RandomTwoChoicePolicy over the TokenAwarePolicy.

Results of the TokenAwarePolicy

Overview

Load and Throughput by instance,shard

Latencies by instance,shard

Inflight Requests and Host Chosen (for all nodes and shard 4)

Results of the RandomTwoChoicePolicy

Overview

Load and Throughput by instance,shard

Latencies by instance,shard

Inflight Requests and Host Chosen/Not Chosen (for all nodes and shard 4)

Conclusion

The benchmark with a node having a single shard overloaded shows that the RandomTwoChoice policy performs better in terms of throughput, latencies and smart distribution of the requests to the nodes. The replica with an overloaded shard was chosen less frequently in case of RandomTwoChoice policy as the number of inflight/hanging requests were higher. This resulted in small latencies by instance and shard comparing to the TokenAwarePolicy where the P95 and P99 latencies were almost 300ms. Also the overall throughput was significantly higher and the loaders were able to maximize their use of CPU cores. However, the RandomTwoChoice policy may happen to be less efficient for write workloads, as writes result in a lot of background processing like compaction, hints, etc. So, as a result, the lattencies may rapidly increase and even cause frequent timeouts. In fact, the average and P95 latencies were not so different, but the P99 latency was significantly higher for the RandomTwoChoice policy. The throughput however, was more or less the same in comparision with the TokenAwarePolicy, so perhaps benchmarking with write workloads without throttling the throughput will not show significant differences as the cluster gets easily overloaded with background processings resulting in skyrocketing latencies, hints, background and foreground writes, etc:

Cluster Throughput (RandomTwoChoice vs TokenAwarePolicy)

Cluster Latencies (RandomTwoChoice vs TokenAwarePolicy)

The RandomTwoChoice policy is a wrapper load balancing policy that adds the "Power of 2 Choice" algorithm to its child policy. It will compare the first two hosts returned from the child policy query plan, and will first return the host with the target shard having fewer inflight requests. The rest of the child query plan will be left intact. It is intended to be used with TokenAwarePolicy and RoundRobinPolicy, to send queries to the replica nodes by always avoiding the worst option (the replica with the target shard having the most inflight requests will not be chosen).

dorlaor · 2024-12-25T17:06:12Z

Hmm, how come we didn't merge it?

dkropachev · 2024-12-25T20:09:19Z

Hmm, how come we didn't merge it?

It will not work when drivers are running on different VMs.

mykaul · 2024-12-26T09:46:02Z

Hmm, how come we didn't merge it?

It will not work when drivers are running on different VMs.

@dkropachev - can you elaborate on this? How are VMs related to this?

dkropachev · 2024-12-26T15:00:21Z

In order for it to work properly all instance of the driver needs to be synchronize when they choose one node over another.
Otherwise one client pick first node, another pick second, LWT congestion increases, performance degrades.

In this PR it is done here:

java-driver/driver-core/src/main/java/com/datastax/driver/core/policies/RandomTwoChoicePolicy.java

Lines 109 to 114 in 2307011

    
           return new AbstractIterator<Host>() { 
        
             private final Host firstChosenHost = 
        
                 host1ShardInflightRequests.get() < host2ShardInflightRequests.get() ? host1 : host2; 
        
             private final Host secondChosenHost = 
        
                 host1ShardInflightRequests.get() < host2ShardInflightRequests.get() ? host2 : host1; 
        
             private int index = 0;

Synchronization is done via shared global metric perShardInflightRequestInfo, which is number of in-flight requests for given shard, it is shared across session.

Say, you ran clients on different VMs (even on the same machine), one client have perShardInflightRequestInfo all 0, other have 1000 for first node and 0 for second, now first one schedules queries to node1, second to node2.
LWT congestion increases, performance degrades.

dkropachev · 2024-12-26T15:24:32Z

Perfect solution would involve server to report back to clients some metrics that would help clients not only see when cluster/shard/host is overloaded, but also understand how big overload is and decide either to rearrange query plan or not even try and throw an error/exception that signals application tht it needs to slow down.

dkropachev · 2024-12-26T15:29:01Z

Or we can reserve to a solution when clients are syncing with each other via 3rd party service (like redis or scylla)

dorlaor · 2024-12-26T20:07:12Z

Well, in such a case the implementation is partial. We either need to get the real queue from the server. It might be ok to estimate it though but it can hurt a case with many clients. From where the client gets the inflight queue size?

…

On Thu, Dec 26, 2024 at 5:29 PM Dmitry Kropachev ***@***.***> wrote: Or we can reserve to a solution when clients are syncing with each other via 3rd party service (like redis or scylla) — Reply to this email directly, view it on GitHub <#198 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANHURIJS7KRLBDTTETIVMD2HQOFHAVCNFSM6AAAAABUGE72YWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRSHA4DQNBVHE> . You are receiving this because you commented.Message ID: ***@***.***>

roydahan · 2024-12-26T22:27:24Z

I think the PR got neglected for two reasons:
1.The write workloads weren't good enough and it was not followed up.
2. Probably there is no much demand for it?
Do we see problems where one shard on one node is busier than others?
I'm not aware of such.

dkropachev · 2024-12-27T12:11:32Z

Well, in such a case the implementation is partial. We either need to get the real queue from the server. It might be ok to estimate it though but it can hurt a case with many clients. From where the client gets the inflight queue size?

It is just atomic counter, when driver sends query to the shard, it increases inflight counter, when it recieves response it decreases inflight counter

mykaul · 2024-12-29T10:36:00Z

@dkropachev - few thoughts:

This CANNOT be a coordinated effort, for multiple reasons: not all clients will receive the information at the same time, the coordination itself will slow everyone (clients and servers), the coordination is 'too late' most likely for most workloads, etc. The whole beauty of this algorithm is the lack of coordination between clients.
LWT is an edge case, that probably needs a different treatment. It should probably have a 'sticky random' node - you choose randomly and stick to it (for a specific period time / no. of transactions / ...).
I don't think we have a test demonstrating it is NOT a good LB policy.

Gor027 force-pushed the random_two_choice branch 2 times, most recently from 9fa2fbd to 727a80d Compare February 7, 2023 01:08

Gor027 requested a review from avelanarius February 7, 2023 13:53

Gor027 mentioned this pull request Feb 9, 2023

Add fallback from shard awareness in case of overloaded coordinator shards #136

Open

mykaul changed the title ~~Add RandomTwoChoice load balancing policy~~ 3.x: Add RandomTwoChoice load balancing policy Feb 22, 2023

Gor027 force-pushed the random_two_choice branch from 727a80d to 2307011 Compare March 9, 2023 15:22

Bouncheck self-assigned this Jul 26, 2024

dkropachev self-assigned this Aug 22, 2024

dkropachev mentioned this pull request Nov 22, 2024

Help drivers to invalidate tablets in time scylladb/scylladb#21664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.x: Add RandomTwoChoice load balancing policy #198

3.x: Add RandomTwoChoice load balancing policy #198

Gor027 commented Feb 7, 2023

dorlaor commented Dec 25, 2024

dkropachev commented Dec 25, 2024

mykaul commented Dec 26, 2024

dkropachev commented Dec 26, 2024

dkropachev commented Dec 26, 2024

dkropachev commented Dec 26, 2024

dorlaor commented Dec 26, 2024 via email

roydahan commented Dec 26, 2024

dkropachev commented Dec 27, 2024

mykaul commented Dec 29, 2024

3.x: Add RandomTwoChoice load balancing policy #198

Are you sure you want to change the base?

3.x: Add RandomTwoChoice load balancing policy #198

Conversation

Gor027 commented Feb 7, 2023

TL;DR

The original problem

The solution

Benchmarks

All the nodes and loaders were in us-east-2a

Results of the TokenAwarePolicy

Overview

Load and Throughput by instance,shard

Latencies by instance,shard

Inflight Requests and Host Chosen (for all nodes and shard 4)

Results of the RandomTwoChoicePolicy

Overview

Load and Throughput by instance,shard

Latencies by instance,shard

Inflight Requests and Host Chosen/Not Chosen (for all nodes and shard 4)

Conclusion

Cluster Throughput (RandomTwoChoice vs TokenAwarePolicy)

Cluster Latencies (RandomTwoChoice vs TokenAwarePolicy)

dorlaor commented Dec 25, 2024

dkropachev commented Dec 25, 2024

mykaul commented Dec 26, 2024

dkropachev commented Dec 26, 2024

dkropachev commented Dec 26, 2024

dkropachev commented Dec 26, 2024

dorlaor commented Dec 26, 2024 via email

roydahan commented Dec 26, 2024

dkropachev commented Dec 27, 2024

mykaul commented Dec 29, 2024

All the nodes and loaders were in `us-east-2a`