After multiple hard reboots a node eventually loses connection to all other nodes in the cluster #8553

dimakr · 2024-08-15T07:49:10Z

Packages

Scylla version: 6.2.0~dev-20240809.3745d0a53457 with build-id d168ac61c2ac38ef15c6deba485431e666a8fdef
Kernel Version: 6.8.0-1013-aws

Issue description

disrupt_method multiple_hard_reboot_node nemesis performed 10 hard reboots of a node-5 in a row (the number of reboots is randomly selected from range 2 to 10 for each test run) - the 1st one started ad 08:41:02, the 10th finished at 08:54:07. All reboots finished successfully, scylla started after 10th reboot at 08:54:07:

08:54:07.223699 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:main] init - Scylla version 6.2.0~dev-0.20240809.3745d0a53457 initialization completed.

Then post-nemesis check health of the cluster was performed and passed, and the nemesis was marked as successfully finished at 09:18:14.

Next replace_service_level_using_detach_during_load nemesis was scheduled at 09:23:14. At the beginning of each nemesis cluster health check is performed. This nemesis passed the pre-nemesis health check, but the nemesis disruption itself was not executed as it is dedicated for SLA tests only. At 09:47:15 the nemesis was marked as skipped.

Next network_reject_node_exporter nemesis was scheduled at 09:47:15 and its pre-nemesis cluster health check failed - nodetool status check on node-5 (the one that was hard rebooted multiple times previously) returned at 09:54:26 that it sees all other nodes as down:

< t:2024-08-10 09:54:26,720 f:cluster.py      l:2701 c:sdcm.cluster         p:DEBUG > Check the health of the node `longevity-tls-50gb-3d-master-db-node-7b8958f1-5' [attempt scylladb/scylladb#1]                           │
< t:2024-08-10 09:54:26,896 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: Datacenter: eu-west                                                                                    │
< t:2024-08-10 09:54:26,897 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: ===================                                                                                    │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: Status=Up/Down                                                                                         │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: |/ State=Normal/Leaving/Joining/Moving                                                                 │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: -- Address     Load     Tokens Owns Host ID                              Rack                          │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.21.103 19.41 GB 256    ?    af2bd628-7745-4085-baa8-6199972679e8 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: UN 10.4.21.183 19.54 GB 256    ?    b4273168-eea5-459e-8567-ee7350612c57 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.21.250 20.67 GB 256    ?    ce1395db-9b0e-47b8-9853-8806c70459f4 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.22.115 19.49 GB 256    ?    15b3be5f-54bc-473a-b823-2ea2945a6bf2 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.23.56  19.74 GB 256    ?    c26fe030-4ef8-4d18-84f1-a13d65fbcf01 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.23.78  24.07 GB 256    ?    8f147dbe-f7d4-409a-be66-32937d802a8c 1c

At 09:54:24 In system.log on node-5 there are records that the other nodes are marked down in gossip:

Aug 10 09:54:22.939111 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  3:strm] gossip - failure_detector_loop: Send echo to node 10.4.22.115, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:22.939121 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  3:strm] gossip - failure_detector_loop: Mark node 10.4.22.115 as DOWN
Aug 10 09:54:22.940995 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress 15b3be5f-54bc-473a-b823-2ea2945a6bf2/10.4.22.115 is now DOWN, status = NORMAL
Aug 10 09:54:23.147016 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - failure_detector_loop: Send echo to node 10.4.23.78, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.147026 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - failure_detector_loop: Mark node 10.4.23.78 as DOWN
Aug 10 09:54:23.148989 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress 8f147dbe-f7d4-409a-be66-32937d802a8c/10.4.23.78 is now DOWN, status = NORMAL
Aug 10 09:54:23.535627 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Send echo to node 10.4.21.250, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.535637 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Mark node 10.4.21.250 as DOWN
Aug 10 09:54:23.536990 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress ce1395db-9b0e-47b8-9853-8806c70459f4/10.4.21.250 is now DOWN, status = NORMAL
Aug 10 09:54:23.757903 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Send echo to node 10.4.21.103, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.757913 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Mark node 10.4.21.103 as DOWN
Aug 10 09:54:23.759997 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress af2bd628-7745-4085-baa8-6199972679e8/10.4.21.103 is now DOWN, status = NORMAL
Aug 10 09:54:23.890642 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Send echo to node 10.4.23.56, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.890653 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Mark node 10.4.23.56 as DOWN
Aug 10 09:54:23.892488 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress c26fe030-4ef8-4d18-84f1-a13d65fbcf01/10.4.23.56 is now DOWN, status = NORMAL

To summarize, after multiple reboots of node-5 there were no other disruptions executed during the test and the cluster was able to pass a few cluster health checks, but eventually within an hour the node-5 lost connection to other nodes.

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Node lost connection to other cluster nodes, over time, after multiple hard reboots

How frequently does it reproduce?

No occurrences of the issue were noticed previously.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-7b8958f1-7 (54.217.202.232 | 10.4.21.250) (shards: 14)
longevity-tls-50gb-3d-master-db-node-7b8958f1-6 (52.212.55.136 | 10.4.23.56) (shards: 14)
longevity-tls-50gb-3d-master-db-node-7b8958f1-5 (54.73.59.86 | 10.4.21.183) (shards: 14)
longevity-tls-50gb-3d-master-db-node-7b8958f1-4 (52.48.211.247 | 10.4.22.17) (shards: 14)
longevity-tls-50gb-3d-master-db-node-7b8958f1-3 (46.137.188.213 | 10.4.22.115) (shards: 14)
longevity-tls-50gb-3d-master-db-node-7b8958f1-2 (34.246.25.227 | 10.4.23.78) (shards: 14)
longevity-tls-50gb-3d-master-db-node-7b8958f1-1 (63.35.206.244 | 10.4.21.103) (shards: 14)

OS / Image: ami-05a75d3ca4e6ebd9f (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 7b8958f1-4a7f-4cef-a68d-32c1393493b0
Test name: scylla-master/tier1/longevity-50gb-3days-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 7b8958f1-4a7f-4cef-a68d-32c1393493b0
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7b8958f1-4a7f-4cef-a68d-32c1393493b0

Logs:

db-cluster-7b8958f1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7b8958f1-4a7f-4cef-a68d-32c1393493b0/20240810_184109/db-cluster-7b8958f1.tar.gz
sct-runner-events-7b8958f1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7b8958f1-4a7f-4cef-a68d-32c1393493b0/20240810_184109/sct-runner-events-7b8958f1.tar.gz
sct-7b8958f1.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7b8958f1-4a7f-4cef-a68d-32c1393493b0/20240810_184109/sct-7b8958f1.log.tar.gz
loader-set-7b8958f1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7b8958f1-4a7f-4cef-a68d-32c1393493b0/20240810_184109/loader-set-7b8958f1.tar.gz
monitor-set-7b8958f1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7b8958f1-4a7f-4cef-a68d-32c1393493b0/20240810_184109/monitor-set-7b8958f1.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

mykaul · 2024-08-15T10:54:14Z

Looks like network down event - not sure, but even the Amazon agent gave up. Afterwards, it went up (you can see external SSH attacks trying to login!), but Scylla did not. Not sure this is a very interesting scenario.

kbr-scylla · 2024-08-21T16:39:06Z

ssh access was available all the time, I think. SCT could still connect to this node.

But one of the interfaces, the one used by Scylla (with ipv4=10.4.21.183) went down and judging from SCT logs, never went back up again: drivers were also trying to connect to this node until the end of the test (~18:40) and never succeeded.
(also Java driver likes to spam logs with absolutely no rate limiting... it prints thousands of lines in a single millisecond) Other nodes were also printing that they fail to apply view update on this node.

Not enough evidence to say that it's a Scylla problem. Can be explained by networking problem. So I'm closing the issue.

juliayakovlev · 2024-08-22T07:14:05Z

I faced out with the same problem in another test run.
It's hard to say what happened there.
The only thing that I found - with starting of Rebuilding bloom filter the node longevity-tls-50gb-3d-master-db-node-1b45877b-2 lost connection with another nodes (rpc call timed out) and vice versa.

2024-08-17T12:15:37.423+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  8:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1e_0rfnk2ejcgiw7vtgxa-big-Filter.db: resizing bitset from 7397608 bytes to 3901872 bytes. sstable origin: compaction
2024-08-17T12:15:37.423+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  8:comp] compaction - [Compact keyspace1.standard1 5b743de0-5c92-11ef-9e12-1d88475c75ee] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1e_0rfnk2ejcgiw7vtgxa-big-Data.db:level=0]. 1GB to 789MB (~52% of original) in 23311ms = 64MB/s. ~5918080 total partitions merged to 3121490.
2024-08-17T12:15:40.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  3:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1q_3edv42l75q8n848yr2-big-Filter.db: resizing bitset from 7274728 bytes to 3903792 bytes. sstable origin: compaction
2024-08-17T12:15:40.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  3:comp] compaction - [Compact keyspace1.standard1 62dfa7e0-5c92-11ef-aa3d-1d7f475c75ee] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1q_3edv42l75q8n848yr2-big-Data.db:level=0]. 1GB to 790MB (~53% of original) in 13708ms = 107MB/s. ~5819776 total partitions merged to 3123030.
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2  !WARNING | scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Send echo to node 10.4.20.5, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Mark node 10.4.20.5 as DOWN
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:strm] gossip - InetAddress 1a9165b4-a930-4c8c-8274-8b479554f43e/10.4.20.5 is now DOWN, status = NORMAL
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1w_1pii82m5nd9bvcmmzy-big-Filter.db: resizing bitset from 7335848 bytes to 3906168 bytes. sstable origin: compaction
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2  !WARNING | scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Send echo to node 10.4.23.20, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Mark node 10.4.23.20 as DOWN
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:strm] gossip - InetAddress 30cc1c48-4f93-441c-af38-87f36baa0d96/10.4.23.20 is now DOWN, status = NORMAL
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2  !WARNING | scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Send echo to node 10.4.20.109, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Mark node 10.4.20.109 as DOWN
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:strm] gossip - InetAddress bf865007-0310-4aee-a755-259daafeffbb/10.4.20.109 is now DOWN, status = NORMAL

External SSH attacks trying to login, like @mykaul mentioned, also observed but later

2024-08-17T12:18:17.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2      !ERR | sshd[2380]: error: kex_exchange_identification: read: Connection reset by peer
2024-08-17T12:18:17.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2380]: Connection reset by 211.186.118.31 port 22842
2024-08-17T12:18:24.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2387]: Connection closed by 211.186.118.31 port 26821 [preauth]
2024-08-17T12:18:28.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2   !NOTICE | sudo[2392]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2024-08-17T12:18:28.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sudo[2392]: pam_unix(sudo:session): session opened for user root(uid=0) by scyllaadm(uid=1000)
2024-08-17T12:18:28.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sudo[2392]: pam_unix(sudo:session): session closed for user root
2024-08-17T12:18:30.979+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2389]: Invalid user orangepi from 211.186.118.31 port 37739
2024-08-17T12:18:31.479+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2389]: Connection closed by invalid user orangepi 211.186.118.31 port 37739 [preauth]

Packages

Scylla version: 6.2.0~dev-20240816.afee3924b3dc with build-id c01d2a55a9631178e3fbad3869c20ef3c8dcf293

Kernel Version: 6.8.0-1013-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-1b45877b-7 (54.220.171.0 | 10.4.20.109) (shards: 14)
longevity-tls-50gb-3d-master-db-node-1b45877b-6 (54.216.84.210 | 10.4.23.43) (shards: 14)
longevity-tls-50gb-3d-master-db-node-1b45877b-5 (54.155.51.83 | 10.4.21.227) (shards: 14)
longevity-tls-50gb-3d-master-db-node-1b45877b-4 (63.33.182.152 | 10.4.20.5) (shards: 14)
longevity-tls-50gb-3d-master-db-node-1b45877b-3 (54.78.114.46 | 10.4.23.20) (shards: 14)
longevity-tls-50gb-3d-master-db-node-1b45877b-2 (54.246.121.156 | 10.4.21.135) (shards: 14)
longevity-tls-50gb-3d-master-db-node-1b45877b-1 (108.129.84.60 | 10.4.22.71) (shards: 14)

OS / Image: ami-0f440d7175113787f (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 1b45877b-8dfa-418e-b56c-43be70362fd3
Test name: scylla-master/tier1/longevity-50gb-3days-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 1b45877b-8dfa-418e-b56c-43be70362fd3
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 1b45877b-8dfa-418e-b56c-43be70362fd3

Logs:

db-cluster-1b45877b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1b45877b-8dfa-418e-b56c-43be70362fd3/20240817_122522/db-cluster-1b45877b.tar.gz
sct-runner-events-1b45877b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1b45877b-8dfa-418e-b56c-43be70362fd3/20240817_122522/sct-runner-events-1b45877b.tar.gz
sct-1b45877b.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1b45877b-8dfa-418e-b56c-43be70362fd3/20240817_122522/sct-1b45877b.log.tar.gz
loader-set-1b45877b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1b45877b-8dfa-418e-b56c-43be70362fd3/20240817_122522/loader-set-1b45877b.tar.gz
monitor-set-1b45877b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1b45877b-8dfa-418e-b56c-43be70362fd3/20240817_122522/monitor-set-1b45877b.tar.gz

Jenkins job URL
Argus

fruch · 2024-09-04T13:09:05Z

@juliayakovlev

this case is using the two interfaces step, it's not doing public communication at all, how come it can be attacked from anyone external ? we need to investigate what's going on with this one

fruch · 2024-09-04T13:09:32Z

@mykaul can you transfer it to SCT ?

fruch · 2024-09-04T19:36:23Z

Me and @roy found the reason for external communication, SG was open for ssh, we close it as it should have been.

As for the network issue we are facing here,

@juliayakovlev let's try to reproduce with nemesis that hard restart node.

My gut feeling is that we might have issues with the network setup after reboot of new, we need a quicker reproducer.

fruch · 2024-09-15T06:00:34Z

@juliayakovlev

please look at this one, seems like we are having an issue with multiple network not working after node is rebooted.

juliayakovlev · 2024-09-16T15:46:15Z

This test started to use multi network configuration at 3.6.2024. But this issue happened first time at 24.7.2024 (https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=856e2796-0eb3-4051-a5d0-305442ceb57a) .
But in this run from 13.7.2024 this issue did not happened.

Trying to reproduce the issue with MultipleHardRebootNodeMonkey.

juliayakovlev · 2024-09-16T17:10:24Z

< t:2024-08-03 10:37:09,300 f:cluster.py      l:2691 c:sdcm.cluster         p:DEBUG > Check the health of the node `longevity-tls-50gb-3d-master-db-node-6f2ed542-7' [attempt #3]

< t:2024-08-03 10:37:09,484 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: Datacenter: eu-west
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: ===================
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: Status=Up/Down
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: |/ State=Normal/Leaving/Joining/Moving
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: -- Address     Load     Tokens Owns Host ID                              Rack
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.21.188 16.75 GB 256    ?    3e2c0462-dcc8-4bae-9f4a-d848da35c8e5 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.21.218 18.00 GB 256    ?    1dfedffc-26d6-4605-b8d0-c25bc3aeaf86 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.21.245 17.94 GB 256    ?    6f319656-cf33-4058-9d1f-865bda65bbd9 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.22.148 18.35 GB 256    ?    565296e7-bb69-46f6-939c-ccfb3b932f05 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.22.222 16.29 GB 256    ?    ce53f8cb-2f70-4cbe-b3db-3537b33336ae 1c  
< t:2024-08-03 10:37:09,490 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: DN 10.4.23.44  18.55 GB 256    ?    660b47f7-92be-44c8-b5ff-c5486cd5a0de 1c

< t:2024-08-03 10:37:12,301 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.21.245","error":"setup: dial tcp :0->10.4.21.245:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.21.188","error":"setup: dial tcp :0->10.4.21.188:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.21.218","error":"setup: dial tcp :0->10.4.21.218:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.22.148","error":"setup: dial tcp :0->10.4.22.148:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.22.222","error":"setup: dial tcp :0->10.4.22.222:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}

< t:2024-08-03 10:37:12,017 f:cluster_aws.py  l:497  c:sdcm.cluster_aws     p:DEBUG > Node longevity-tls-50gb-3d-master-db-node-6f2ed542-7 [79.125.27.189 | 10.4.22.222]: Sorted interfaces: [NetworkInterface(ipv4
_public_address='79.125.27.189', ipv6_public_addresses=['2a05:d018:12e3:f002:46da:2b5d:eb8f:d34b'], ipv4_private_addresses=['10.4.9.156'], ipv6_private_address='', dns_private_name='ip-10-4-9-156.eu-west-1.compu
te.internal', dns_public_name='ec2-79-125-27-189.eu-west-1.compute.amazonaws.com', device_index=0), NetworkInterface(ipv4_public_address=None, ipv6_public_addresses=['2a05:d018:12e3:f005:7b0a:1e0d:e8d1:0329'], i
pv4_private_addresses=['10.4.22.222'], ipv6_private_address='', dns_private_name='ip-10-4-22-222.eu-west-1.compute.internal', dns_public_name=None, device_index=1)]

< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG > /10.4.22.222
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   generation:1722656360
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   heartbeat:76957
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   LOAD:17492227739
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   STATUS:NORMAL,9222693218404302345
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   CDC_STREAMS_TIMESTAMP:v2;1722656428071;f8bd2926-5149-11ef-44a5-d34f93b96fc5
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   IGNOR_MSB_BITS:12
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   SHARD_COUNT:14
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   DC:eu-west
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   SCHEMA_TABLES_VERSION:3
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RACK:1c
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RPC_READY:1
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   NET_VERSION:0
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   HOST_ID:ce53f8cb-2f70-4cbe-b3db-3537b33336ae
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RPC_ADDRESS:10.4.22.222
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RELEASE_VERSION:3.0.8
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   VIEW_BACKLOG:19593:877238681:1722681431300

< t:2024-08-03 10:37:13,735 f:cluster.py      l:2698 c:sdcm.cluster         p:DEBUG > One or more node `longevity-tls-50gb-3d-master-db-node-6f2ed542-7' health validation has failed

fruch · 2024-09-16T20:40:39Z

@juliayakovlev what is this log, in the last comment? The context of it isn't clear

juliayakovlev · 2024-09-17T06:15:54Z

@juliayakovlev what is this log, in the last comment? The context of it isn't clear

I am collecting the info meanwhile

juliayakovlev · 2024-09-23T09:00:03Z

Reproduced https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=fb21a056-02c8-4c86-888e-f3fb5303c1ff

Branch: issue_sct_8553
The issue is reproduced with 2 nemeses: [disrupt_no_corrupt_repair, disrupt_multiple_hard_reboot_node] after disrupt_no_corrupt_repair

juliayakovlev · 2024-09-23T15:15:32Z

Reproduced https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=fb21a056-02c8-4c86-888e-f3fb5303c1ff

Branch: issue_sct_8553 The issue is reproduced with 2 nemeses: [disrupt_no_corrupt_repair, disrupt_multiple_hard_reboot_node] after disrupt_no_corrupt_repair

Actually not. It is the problem of scylladb/scylladb#18059

juliayakovlev · 2024-09-23T16:07:09Z

I ran the tests with disrupt_multiple_hard_reboot_node only - the problem was not reproduced

fruch · 2024-09-23T20:06:11Z

I ran the tests with disrupt_multiple_hard_reboot_node only - the problem was not reproduced

Can you try running it with the scylla version from reported failures ?

If that doesn't reproduce, we'll leave it for now

juliayakovlev · 2024-09-24T05:43:42Z

I ran the tests with disrupt_multiple_hard_reboot_node only - the problem was not reproduced

Can you try running it with the scylla version from reported failures ?

If that doesn't reproduce, we'll leave it for now

I ran it with ami from issue

juliayakovlev · 2024-09-26T08:03:02Z

https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=23a99995-959d-4c8a-9067-f49c326f5a15
One network interface
The test was run about 23 hours with disrupt_multiple_hard_reboot_node only. The issue was not reproduced
With ami-0071c4f2153cf06c2 (from issue)

juliayakovlev · 2024-09-26T09:56:30Z

https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=641f6728-624b-418d-9205-b8d4beac763c

2 network interfaces
Test failed after about 55 mins, disrupt_multiple_hard_reboot_node nemesis only.
With ami-0071c4f2153cf06c2 (from issue)

fruch · 2024-09-26T14:55:33Z

just to sum the observation so far, me and @juliayakovlev found

after node reboot, 2nd interface isn't functioning, doing sudo netplan apply seems to be fixing it.
trying now, to clear the netplan configuration SCT was adding

we also noticed there a warning about the SCT netplan config permissions (maybe it doesn't load it in boot cause of it ?)

juliayakovlev · 2024-09-29T16:27:38Z

Created branch issue_sct_8553. The change there - do not run sdcm.provision.aws.utils.configure_eth1_script, not re-configure secondary network interface.

Run 2 tests:

Test with disrupt_multiple_hard_reboot_node only - PASSED. The test were run 2 days
Regular test with all nemeses (according to test yaml) - running
https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=21961db0-285c-4b84-a4da-4d70be149e06

fruch · 2024-09-29T16:48:21Z

Created branch issue_sct_8553. The change there - do not run sdcm.provision.aws.utils.configure_eth1_script, not re-configure secondary network interface.

Run 2 tests:

Test with disrupt_multiple_hard_reboot_node only - PASSED. The test were run 2 days

Regular test with all nemeses (according to test yaml) - running
https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=21961db0-285c-4b84-a4da-4d70be149e06

Let's open a PR, and discuss testing needed there

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: scylladb#8553

juliayakovlev · 2024-09-30T07:40:19Z

Created branch issue_sct_8553. The change there - do not run sdcm.provision.aws.utils.configure_eth1_script, not re-configure secondary network interface.
Run 2 tests:

Test with disrupt_multiple_hard_reboot_node only - PASSED. The test were run 2 days

Regular test with all nemeses (according to test yaml) - running
https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=21961db0-285c-4b84-a4da-4d70be149e06

Let's open a PR, and discuss testing needed there

@fruch #8883

juliayakovlev · 2024-10-13T08:15:23Z

Running the final test

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: #8553 (cherry picked from commit e059688)

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: scylladb#8553

mykaul assigned kbr-scylla Aug 15, 2024

kbr-scylla closed this as not planned Won't fix, can't repro, duplicate, stale Aug 21, 2024

mykaul transferred this issue from scylladb/scylladb Sep 4, 2024

github-actions bot assigned dimakr Sep 4, 2024

fruch assigned juliayakovlev and unassigned dimakr and kbr-scylla Sep 4, 2024

fruch reopened this Sep 15, 2024

juliayakovlev mentioned this issue Sep 30, 2024

fix(multinetwork): remove second network interface configuration #8883

Merged

4 tasks

fruch closed this as completed in #8883 Oct 13, 2024

fruch closed this as completed in e059688 Oct 13, 2024

mergify bot mentioned this issue Oct 13, 2024

[Backport 6.1] fix(multinetwork): remove second network interface configuration #8989

Merged

4 tasks

This was referenced Oct 13, 2024

[Backport 6.2] fix(multinetwork): remove second network interface configuration #8990

Merged

[Backport 2024.2] fix(multinetwork): remove second network interface configuration #8991

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster #8553

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster #8553

dimakr commented Aug 15, 2024

Logs:

mykaul commented Aug 15, 2024

kbr-scylla commented Aug 21, 2024

juliayakovlev commented Aug 22, 2024

Logs:

fruch commented Sep 4, 2024

fruch commented Sep 4, 2024

fruch commented Sep 4, 2024

fruch commented Sep 15, 2024

juliayakovlev commented Sep 16, 2024

juliayakovlev commented Sep 16, 2024 •

edited

Loading

fruch commented Sep 16, 2024

juliayakovlev commented Sep 17, 2024

juliayakovlev commented Sep 23, 2024

juliayakovlev commented Sep 23, 2024

juliayakovlev commented Sep 23, 2024

fruch commented Sep 23, 2024

juliayakovlev commented Sep 24, 2024

juliayakovlev commented Sep 26, 2024 •

edited

Loading

juliayakovlev commented Sep 26, 2024

fruch commented Sep 26, 2024

juliayakovlev commented Sep 29, 2024

fruch commented Sep 29, 2024

juliayakovlev commented Sep 30, 2024

juliayakovlev commented Oct 13, 2024

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster #8553

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster #8553

Comments

dimakr commented Aug 15, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

mykaul commented Aug 15, 2024

kbr-scylla commented Aug 21, 2024

juliayakovlev commented Aug 22, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Sep 4, 2024

fruch commented Sep 4, 2024

fruch commented Sep 4, 2024

fruch commented Sep 15, 2024

juliayakovlev commented Sep 16, 2024

juliayakovlev commented Sep 16, 2024 • edited Loading

fruch commented Sep 16, 2024

juliayakovlev commented Sep 17, 2024

juliayakovlev commented Sep 23, 2024

juliayakovlev commented Sep 23, 2024

juliayakovlev commented Sep 23, 2024

fruch commented Sep 23, 2024

juliayakovlev commented Sep 24, 2024

juliayakovlev commented Sep 26, 2024 • edited Loading

juliayakovlev commented Sep 26, 2024

fruch commented Sep 26, 2024

juliayakovlev commented Sep 29, 2024

fruch commented Sep 29, 2024

juliayakovlev commented Sep 30, 2024

juliayakovlev commented Oct 13, 2024

juliayakovlev commented Sep 16, 2024 •

edited

Loading

juliayakovlev commented Sep 26, 2024 •

edited

Loading