[disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure #9031

timtimb0t · 2024-10-23T09:22:22Z

Packages

Scylla version: 6.3.0~dev-20241018.b11d50f59191 with build-id d5fe38f8fd12d9b834688320151f36fb8c1e050d

Kernel Version: 6.8.0-1016-gcp

Issue description

New issue

During disrupt_destroy_data_then_rebuild nemises the test initiates rebuild process (as last step after data been destroyed) that caused bunch of scylla errors:

2024-10-19T09:16:26.885+00:00 longevity-large-partitions-200k-pks-db-node-26155658-0-1     !INFO | scylla[4507]:  [shard  0: gms] raft_topology - executing global topology command barrier_and_drain, excluded nodes: {}
2024-10-19T09:16:29.383+00:00 longevity-large-partitions-200k-pks-db-node-26155658-0-1      !ERR | scylla[4507]:  [shard  0: gms] raft_topology - drain rpc failed, proceed to fence old writes: std::runtime_error (raft topology: exec_global_command(barrier_and_drain) failed with seastar::rpc::closed_error (connection is closed))
2024-10-19T09:16:29.383+00:00 longevity-large-partitions-200k-pks-db-node-26155658-0-1     !INFO | scylla[4507]:  [shard  0: gms] raft_topology - updating topology state: advance fence version to 2108
2024-10-19T09:16:29.384+00:00 longevity-large-partitions-200k-pks-db-node-26155658-0-1     !INFO | scylla[4507]:  [shard  0: gms] raft_topology - executing global topology command barrier, excluded nodes: {}
2024-10-19T09:16:29.384+00:00 longevity-large-partitions-200k-pks-db-node-26155658-0-1  !WARNING | scylla[4507]:  [shard  0: gms] raft_topology - barrier for tablet 27470250-8dca-11ef-ae5d-04fec4937667:79 failed: seastar::broken_promise (broken promise)
2024-10-19T09:16:29.384+00:00 longevity-large-partitions-200k-pks-db-node-26155658-0-1      !ERR | scylla[4507]:  [shard  0: gms] raft_topology - topology change coordinator fiber got error std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed))

The scylla itself is alive and these errors never caused a failure. Seems like they occurred due to deleted data

Describe your issue in detail and steps it took to produce it.

Impact

No visible impact

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

longevity-large-partitions-200k-pks-db-node-26155658-0-9 (35.231.165.46 | 10.142.0.67) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-8 (35.237.86.48 | 10.142.0.7) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-7 (34.23.199.5 | 10.142.0.5) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-6 (35.237.86.48 | 10.142.0.3) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-5 (35.227.3.176 | 10.142.0.181) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-4 (35.243.229.232 | 10.142.0.179) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-3 (34.138.154.236 | 10.142.0.175) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-2 (34.74.144.170 | 10.142.0.161) (shards: 14)
longevity-large-partitions-200k-pks-db-node-26155658-0-1 (34.73.4.23 | 10.142.0.155) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-6-3-0-dev-x86-64-2024-10-19t02-11-39 (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test
Test id: 26155658-0ac7-449d-8d60-ed91dba49ce0
Test name: scylla-master/tier1/longevity-large-partition-200k-pks-4days-gce-test
Test method: longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity
Test config file(s):

longevity-large-partition-200k_pks-4days.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 26155658-0ac7-449d-8d60-ed91dba49ce0
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 26155658-0ac7-449d-8d60-ed91dba49ce0

Logs:

longevity-large-partitions-200k-pks-db-node-26155658-0-2 - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241019_031018/longevity-large-partitions-200k-pks-db-node-26155658-0-2-26155658.tar.gz
longevity-large-partitions-200k-pks-db-node-26155658-0-6 - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241019_031018/longevity-large-partitions-200k-pks-db-node-26155658-0-6-26155658.tar.gz
longevity-large-partitions-200k-pks-db-node-26155658-0-6 - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241019_031018/longevity-large-partitions-200k-pks-db-node-26155658-0-6-26155658.tar.gz
longevity-large-partitions-200k-pks-db-node-26155658-0-5 - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241019_031018/longevity-large-partitions-200k-pks-db-node-26155658-0-5-26155658.tar.gz
longevity-large-partitions-200k-pks-db-node-26155658-0-3 - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241019_031018/longevity-large-partitions-200k-pks-db-node-26155658-0-3-26155658.tar.gz
db-cluster-26155658.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241020_044941/db-cluster-26155658.tar.gz
sct-runner-events-26155658.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241020_044941/sct-runner-events-26155658.tar.gz
sct-26155658.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241020_044941/sct-26155658.log.tar.gz
loader-set-26155658.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241020_044941/loader-set-26155658.tar.gz
monitor-set-26155658.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/26155658-0ac7-449d-8d60-ed91dba49ce0/20241020_044941/monitor-set-26155658.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

temichus · 2024-10-27T10:41:37Z

@kbr-scylla could you please take a look at this issue? It is created as an SCT issue because
The scylla itself is alive and these errors never caused a failure. and nemesis itself passed.
Can we ignore this error in SCT during disrupt_destroy_data_then_rebuild nemesis?

kbr-scylla · 2024-10-29T09:35:47Z

This nemesis is shutting down a Scylla node for a while.

During this time topology coordinator could try communicating with this node, e.g. during tablet migrations. This is what happened here. Cluster was doing tablet migrations when the node was killed. If the communication attempt fails due to closed connection we print an error.

So it's expected that the error happens if one of the nodes is down.

timtimb0t · 2024-11-04T13:57:40Z

there are few nemeses that generate closed connection type errors during runtime:

disrupt_destroy_data_then_rebuild
disrupt_destroy_data_then_repair

such an errors are expected for these nemeses and maybe ignored by SCT to avoid redundant error messages in argus.

roydahan · 2024-11-04T18:27:14Z

@timtimb0t if you know it happens in a specific duration of a specific nemesis, you can add to the nemesis in the relevant part "ignore_..." context switch and it won't produce errors.
e.g. in enospc nemesis (we have in many other as well):

with ignore_no_space_errors(node=node):
        .....

timtimb0t · 2024-11-12T15:36:12Z

I'm not sure but seems that there is one more nemesis for the same handling (disrupt_stop_wait_start_scylla_server):
https://argus.scylladb.com/tests/scylla-cluster-tests/a84727c0-ee78-40ba-b03e-5c27df02a2f5

fruch · 2024-11-17T13:35:48Z

I'm not sure but seems that there is one more nemesis for the same handling (disrupt_stop_wait_start_scylla_server): https://argus.scylladb.com/tests/scylla-cluster-tests/a84727c0-ee78-40ba-b03e-5c27df02a2f5

according to @kbr-scylla explanation, this can happen every time we take down a down.
there are multiple places like that in the nemesis code

_destroy_data_and_restart_scylla
_terminate_cluster_node
disrupt_kill_scylla

and more

seems like the reason is very similar to ignore_ycsb_connection_refused, I would recommend filter for those print in the exact same way in the same places)

github-actions bot assigned timtimb0t Oct 23, 2024

timtimb0t changed the title ~~disrupt_destroy_data_then_rebuild nemises caused lots of raft_topology errors that never lead to failure~~ [disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure Oct 23, 2024

temichus added the tests/longevity-tier1 label Oct 27, 2024

kbr-scylla mentioned this issue Oct 29, 2024

barrier for tablet ... failed: seastar::broken_promise scylladb/scylladb#21338

Open

timtimb0t mentioned this issue Nov 3, 2024

Adjustmen of expected error handling\ignorance in SCT #9117

Closed

temichus added the on_core_qa tasks that should be solved by Core QA team label Nov 18, 2024

timtimb0t mentioned this issue Dec 18, 2024

fix(nemesis): filter raft-topology errors when starting/stopping nodes #9580

Merged

2 tasks

fruch closed this as completed in #9580 Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure #9031

[disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure #9031

timtimb0t commented Oct 23, 2024

Logs:

temichus commented Oct 27, 2024

kbr-scylla commented Oct 29, 2024

timtimb0t commented Nov 4, 2024

roydahan commented Nov 4, 2024

timtimb0t commented Nov 12, 2024

fruch commented Nov 17, 2024

[disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure #9031

[disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure #9031

Comments

timtimb0t commented Oct 23, 2024

Packages

Issue description

Impact

Installation details

Logs:

temichus commented Oct 27, 2024

kbr-scylla commented Oct 29, 2024

timtimb0t commented Nov 4, 2024

roydahan commented Nov 4, 2024

timtimb0t commented Nov 12, 2024

fruch commented Nov 17, 2024