Add testcase for scaling-in while 3-node cluster having 90% storage utilization #9131

pehala · 2024-11-05T08:46:32Z

Create 3 node cluster with rf=3.
Reach 70% disk usage.
Keep it running for 30 minutes (no writes to the cluster, only read)
Perform scale-out and wait for tablet migration
Drop 20% of data and then Perform scale-in by removing node 3.

paszkow · 2024-11-05T12:39:11Z

@pehala This scenario seems to be incorrect. Without deletes you will hit out of space error once you scale in. I think we should rather aim at having 5 nodes at ~ 72% of disk utilization and then scale-in. As a result you will end up with 4 node with ~90%

swasik · 2024-11-05T14:28:44Z

@pehala This scenario seems to be incorrect. Without deletes you will hit out of space error once you scale in. I think we should rather aim at having 5 nodes at ~ 72% of disk utilization and then scale-in. As a result you will end up with 4 node with ~90%

But in this scenario you perform scale out before scale in. So, if I understand correctly it is add node 4, then remove node 3 so in practice swap node 3 to 4.

Lakshmipathi · 2024-11-11T05:16:05Z

I updated this description a bit. Based on the suggestion on testplan document, we have two variant for scale-in. a) 3node-cluster scale-in at 90% b) 4node-cluster scale-in at 67%.

For 3node-cluster scale-in at 90%, add a new node once tablet migration completed. Drop 20% of data from the cluster and then scale-in by removing a node.
For 4node-cluster scale-in at 67%, we scale-in by removing a node, after tablet migration, cluster will be at around 90% storage utilization.

Lakshmipathi · 2024-11-11T06:14:56Z

reached 92% disk usage and started waiting for 30mins, no write or read.

< t:2024-11-05 09:36:49,314 f:full_storage_utilization_test.py l:121  c:FullStorageUtilizationTest p:INFO  > Current max disk usage after writing to keyspace10: 92% (398 GB / 392.40000000000003 GB)
< t:2024-11-05 09:36:50,342 f:full_storage_utilization_test.py l:87   c:FullStorageUtilizationTest p:INFO  > Wait for 1800 seconds

After 30min idle time, started throttled write:

< t:2024-11-05 10:08:01,521 f:stress_thread.py l:325  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup duration=30m -rate threads=10 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema keyspace=keyspace1 "replication(strategy=NetworkTopologyStrategy,replication_factor=3)" -node 10.4.1.62,10.4.3.97,10.4.1.100 -errors skip-unsupported-columns

Scaleout by adding a new node at 90%

< t:2024-11-05 10:09:57,086 f:full_storage_utilization_test.py l:35   c:FullStorageUtilizationTest p:INFO  > Adding a new node
< t:2024-11-05 10:12:55,534 f:common.py       l:43   c:sdcm.utils.tablets.common p:INFO  > Waiting for tablets to be balanced
< t:2024-11-05 10:40:55,031 f:common.py       l:48   c:sdcm.utils.tablets.common p:INFO  > Tablets are balanced

Later, dropping some data before scale-in

< t:2024-11-05 10:40:55,031 f:full_storage_utilization_test.py l:48   c:FullStorageUtilizationTest p:INFO  > Dropping some data

few minutes later, removing a node from 3-node cluster.

< t:2024-11-05 10:41:00,079 f:full_storage_utilization_test.py l:40   c:FullStorageUtilizationTest p:INFO  > Removing a node
< t:2024-11-05 10:41:00,080 f:full_storage_utilization_test.py l:133  c:FullStorageUtilizationTest p:INFO  > Removing a second node from the cluster
< t:2024-11-05 10:41:00,080 f:full_storage_utilization_test.py l:135  c:FullStorageUtilizationTest p:INFO  > Node to be removed: df-test-master-db-node-1ffa6d64-2

Tablet migration over time

max/avg disk utilization

Latency
99th percentile write and read latency by Cluster (max at 90% disk utilization)

syscall	value
writes	1.79ms
read	3.58ms

Final 3node cluster has disk usage at 92%,91% and 87%

https://argus.scylladb.com/tests/scylla-cluster-tests/1ffa6d64-004a-4443-a3c9-d52a18ea08e1

swasik · 2024-11-12T14:38:53Z

Final 3node cluster has disk usage at 92%,91% and 87%

But if dropping 20% of data as suggested in the test plan, should not we get ca. 70% here? It was incorrectly stated in the doc - I fixed it. The idea behind it is to simulate the scenario where we loose plenty of data and because of it we can scale in to save resources.

pehala · 2024-11-26T08:19:48Z

But if dropping 20% of data as suggested in the test plan, should not we get ca. 70% here? It was incorrectly stated in the doc - I fixed it. The idea behind it is to simulate the scenario where we loose plenty of data and because of it we can scale in to save resources.

@Lakshmipathi ping

Lakshmipathi · 2024-11-26T13:57:00Z

But if dropping 20% of data as suggested in the test plan, should not we get ca. 70% here?
@swasik , Here is the flow for this case:

In a 3-node cluster, we reached 92% disk usage.
Wait for 30mins.
Started throttled write.
Now add a new node at 90%, total-nodes in the cluster=4
From the graph, we can see avg disk usage drops after this operation.
Wait for 30mins
Drop 20% of data
Start throttled write.
Perform scale-in.

If I'm not wrong, the throttled write we do during scaling operation (3 and 8) - contributes to addition disk usage. Let me add more graphs to this issue.

pehala · 2024-12-09T07:24:51Z

Merged into #9156

pehala mentioned this issue Nov 5, 2024

Add 90% storage utilization tests #9129

Open

13 tasks

github-actions bot assigned pehala Nov 5, 2024

pehala assigned Lakshmipathi and unassigned pehala Nov 5, 2024

swasik added the area/elastic cloud Issues related to the elastic cloud project label Nov 5, 2024

Lakshmipathi changed the title ~~Add testcase for scaling-in while having 90% storage utilization~~ Add testcase for scaling-in while 3-node cluster having 90% storage utilization Nov 11, 2024

pehala added area/tablets area/serverlessv2 labels Nov 11, 2024

pehala added the P1 Urgent label Nov 21, 2024

dani-tweig removed the area/serverlessv2 label Nov 26, 2024

pehala closed this as completed Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add testcase for scaling-in while 3-node cluster having 90% storage utilization #9131

Add testcase for scaling-in while 3-node cluster having 90% storage utilization #9131

pehala commented Nov 5, 2024 •

edited by Lakshmipathi

Loading

paszkow commented Nov 5, 2024

swasik commented Nov 5, 2024

Lakshmipathi commented Nov 11, 2024 •

edited

Loading

Lakshmipathi commented Nov 11, 2024 •

edited

Loading

swasik commented Nov 12, 2024

pehala commented Nov 26, 2024

Lakshmipathi commented Nov 26, 2024

pehala commented Dec 9, 2024

Add testcase for scaling-in while 3-node cluster having 90% storage utilization #9131

Add testcase for scaling-in while 3-node cluster having 90% storage utilization #9131

Comments

pehala commented Nov 5, 2024 • edited by Lakshmipathi Loading

paszkow commented Nov 5, 2024

swasik commented Nov 5, 2024

Lakshmipathi commented Nov 11, 2024 • edited Loading

Lakshmipathi commented Nov 11, 2024 • edited Loading

swasik commented Nov 12, 2024

pehala commented Nov 26, 2024

Lakshmipathi commented Nov 26, 2024

pehala commented Dec 9, 2024

pehala commented Nov 5, 2024 •

edited by Lakshmipathi

Loading

Lakshmipathi commented Nov 11, 2024 •

edited

Loading

Lakshmipathi commented Nov 11, 2024 •

edited

Loading