Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After multiple hard reboots a node eventually loses connection to all other nodes in the cluster #8553

Closed
1 of 2 tasks
dimakr opened this issue Aug 15, 2024 · 23 comments · Fixed by #8883
Closed
1 of 2 tasks
Assignees

Comments

@dimakr
Copy link
Contributor

dimakr commented Aug 15, 2024

Packages

Scylla version: 6.2.0~dev-20240809.3745d0a53457 with build-id d168ac61c2ac38ef15c6deba485431e666a8fdef
Kernel Version: 6.8.0-1013-aws

Issue description

disrupt_method multiple_hard_reboot_node nemesis performed 10 hard reboots of a node-5 in a row (the number of reboots is randomly selected from range 2 to 10 for each test run) - the 1st one started ad 08:41:02, the 10th finished at 08:54:07. All reboots finished successfully, scylla started after 10th reboot at 08:54:07:

08:54:07.223699 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:main] init - Scylla version 6.2.0~dev-0.20240809.3745d0a53457 initialization completed. 

Then post-nemesis check health of the cluster was performed and passed, and the nemesis was marked as successfully finished at 09:18:14.

Next replace_service_level_using_detach_during_load nemesis was scheduled at 09:23:14. At the beginning of each nemesis cluster health check is performed. This nemesis passed the pre-nemesis health check, but the nemesis disruption itself was not executed as it is dedicated for SLA tests only. At 09:47:15 the nemesis was marked as skipped.

Next network_reject_node_exporter nemesis was scheduled at 09:47:15 and its pre-nemesis cluster health check failed - nodetool status check on node-5 (the one that was hard rebooted multiple times previously) returned at 09:54:26 that it sees all other nodes as down:

< t:2024-08-10 09:54:26,720 f:cluster.py      l:2701 c:sdcm.cluster         p:DEBUG > Check the health of the node `longevity-tls-50gb-3d-master-db-node-7b8958f1-5' [attempt scylladb/scylladb#1]                           │
< t:2024-08-10 09:54:26,896 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: Datacenter: eu-west                                                                                    │
< t:2024-08-10 09:54:26,897 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: ===================                                                                                    │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: Status=Up/Down                                                                                         │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: |/ State=Normal/Leaving/Joining/Moving                                                                 │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: -- Address     Load     Tokens Owns Host ID                              Rack                          │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.21.103 19.41 GB 256    ?    af2bd628-7745-4085-baa8-6199972679e8 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: UN 10.4.21.183 19.54 GB 256    ?    b4273168-eea5-459e-8567-ee7350612c57 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.21.250 20.67 GB 256    ?    ce1395db-9b0e-47b8-9853-8806c70459f4 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.22.115 19.49 GB 256    ?    15b3be5f-54bc-473a-b823-2ea2945a6bf2 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.23.56  19.74 GB 256    ?    c26fe030-4ef8-4d18-84f1-a13d65fbcf01 1c                            │
< t:2024-08-10 09:54:26,902 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.69>: DN 10.4.23.78  24.07 GB 256    ?    8f147dbe-f7d4-409a-be66-32937d802a8c 1c   

At 09:54:24 In system.log on node-5 there are records that the other nodes are marked down in gossip:

Aug 10 09:54:22.939111 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  3:strm] gossip - failure_detector_loop: Send echo to node 10.4.22.115, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:22.939121 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  3:strm] gossip - failure_detector_loop: Mark node 10.4.22.115 as DOWN
Aug 10 09:54:22.940995 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress 15b3be5f-54bc-473a-b823-2ea2945a6bf2/10.4.22.115 is now DOWN, status = NORMAL
Aug 10 09:54:23.147016 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - failure_detector_loop: Send echo to node 10.4.23.78, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.147026 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - failure_detector_loop: Mark node 10.4.23.78 as DOWN
Aug 10 09:54:23.148989 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress 8f147dbe-f7d4-409a-be66-32937d802a8c/10.4.23.78 is now DOWN, status = NORMAL
Aug 10 09:54:23.535627 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Send echo to node 10.4.21.250, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.535637 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Mark node 10.4.21.250 as DOWN
Aug 10 09:54:23.536990 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress ce1395db-9b0e-47b8-9853-8806c70459f4/10.4.21.250 is now DOWN, status = NORMAL
Aug 10 09:54:23.757903 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Send echo to node 10.4.21.103, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.757913 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Mark node 10.4.21.103 as DOWN
Aug 10 09:54:23.759997 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress af2bd628-7745-4085-baa8-6199972679e8/10.4.21.103 is now DOWN, status = NORMAL
Aug 10 09:54:23.890642 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Send echo to node 10.4.23.56, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Aug 10 09:54:23.890653 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Mark node 10.4.23.56 as DOWN
Aug 10 09:54:23.892488 longevity-tls-50gb-3d-master-db-node-7b8958f1-5 scylla[926]:  [shard  0:strm] gossip - InetAddress c26fe030-4ef8-4d18-84f1-a13d65fbcf01/10.4.23.56 is now DOWN, status = NORMAL

To summarize, after multiple reboots of node-5 there were no other disruptions executed during the test and the cluster was able to pass a few cluster health checks, but eventually within an hour the node-5 lost connection to other nodes.

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Node lost connection to other cluster nodes, over time, after multiple hard reboots

How frequently does it reproduce?

No occurrences of the issue were noticed previously.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-7b8958f1-7 (54.217.202.232 | 10.4.21.250) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7b8958f1-6 (52.212.55.136 | 10.4.23.56) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7b8958f1-5 (54.73.59.86 | 10.4.21.183) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7b8958f1-4 (52.48.211.247 | 10.4.22.17) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7b8958f1-3 (46.137.188.213 | 10.4.22.115) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7b8958f1-2 (34.246.25.227 | 10.4.23.78) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7b8958f1-1 (63.35.206.244 | 10.4.21.103) (shards: 14)

OS / Image: ami-05a75d3ca4e6ebd9f (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 7b8958f1-4a7f-4cef-a68d-32c1393493b0
Test name: scylla-master/tier1/longevity-50gb-3days-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 7b8958f1-4a7f-4cef-a68d-32c1393493b0
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7b8958f1-4a7f-4cef-a68d-32c1393493b0

Logs:

Jenkins job URL
Argus

@mykaul
Copy link
Contributor

mykaul commented Aug 15, 2024

Looks like network down event - not sure, but even the Amazon agent gave up. Afterwards, it went up (you can see external SSH attacks trying to login!), but Scylla did not. Not sure this is a very interesting scenario.

@kbr-scylla
Copy link

ssh access was available all the time, I think. SCT could still connect to this node.

But one of the interfaces, the one used by Scylla (with ipv4=10.4.21.183) went down and judging from SCT logs, never went back up again: drivers were also trying to connect to this node until the end of the test (~18:40) and never succeeded.
(also Java driver likes to spam logs with absolutely no rate limiting... it prints thousands of lines in a single millisecond) Other nodes were also printing that they fail to apply view update on this node.

Not enough evidence to say that it's a Scylla problem. Can be explained by networking problem. So I'm closing the issue.

@kbr-scylla kbr-scylla closed this as not planned Won't fix, can't repro, duplicate, stale Aug 21, 2024
@juliayakovlev
Copy link
Contributor

I faced out with the same problem in another test run.
It's hard to say what happened there.
The only thing that I found - with starting of Rebuilding bloom filter the node longevity-tls-50gb-3d-master-db-node-1b45877b-2 lost connection with another nodes (rpc call timed out) and vice versa.

2024-08-17T12:15:37.423+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  8:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1e_0rfnk2ejcgiw7vtgxa-big-Filter.db: resizing bitset from 7397608 bytes to 3901872 bytes. sstable origin: compaction
2024-08-17T12:15:37.423+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  8:comp] compaction - [Compact keyspace1.standard1 5b743de0-5c92-11ef-9e12-1d88475c75ee] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1e_0rfnk2ejcgiw7vtgxa-big-Data.db:level=0]. 1GB to 789MB (~52% of original) in 23311ms = 64MB/s. ~5918080 total partitions merged to 3121490.
2024-08-17T12:15:40.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  3:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1q_3edv42l75q8n848yr2-big-Filter.db: resizing bitset from 7274728 bytes to 3903792 bytes. sstable origin: compaction
2024-08-17T12:15:40.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  3:comp] compaction - [Compact keyspace1.standard1 62dfa7e0-5c92-11ef-aa3d-1d7f475c75ee] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1q_3edv42l75q8n848yr2-big-Data.db:level=0]. 1GB to 790MB (~53% of original) in 13708ms = 107MB/s. ~5819776 total partitions merged to 3123030.
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2  !WARNING | scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Send echo to node 10.4.20.5, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  4:strm] gossip - failure_detector_loop: Mark node 10.4.20.5 as DOWN
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:strm] gossip - InetAddress 1a9165b4-a930-4c8c-8274-8b479554f43e/10.4.20.5 is now DOWN, status = NORMAL
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/keyspace1/standard1-ac26ebc05c4711efbb3a3041fdca8f85/me-3gir_0y1w_1pii82m5nd9bvcmmzy-big-Filter.db: resizing bitset from 7335848 bytes to 3906168 bytes. sstable origin: compaction
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2  !WARNING | scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Send echo to node 10.4.23.20, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2024-08-17T12:15:46.951+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  1:strm] gossip - failure_detector_loop: Mark node 10.4.23.20 as DOWN
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:strm] gossip - InetAddress 30cc1c48-4f93-441c-af38-87f36baa0d96/10.4.23.20 is now DOWN, status = NORMAL
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2  !WARNING | scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Send echo to node 10.4.20.109, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  2:strm] gossip - failure_detector_loop: Mark node 10.4.20.109 as DOWN
2024-08-17T12:15:47.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | scylla[926]:  [shard  0:strm] gossip - InetAddress bf865007-0310-4aee-a755-259daafeffbb/10.4.20.109 is now DOWN, status = NORMAL

External SSH attacks trying to login, like @mykaul mentioned, also observed but later

2024-08-17T12:18:17.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2      !ERR | sshd[2380]: error: kex_exchange_identification: read: Connection reset by peer
2024-08-17T12:18:17.229+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2380]: Connection reset by 211.186.118.31 port 22842
2024-08-17T12:18:24.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2387]: Connection closed by 211.186.118.31 port 26821 [preauth]
2024-08-17T12:18:28.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2   !NOTICE | sudo[2392]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2024-08-17T12:18:28.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sudo[2392]: pam_unix(sudo:session): session opened for user root(uid=0) by scyllaadm(uid=1000)
2024-08-17T12:18:28.729+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sudo[2392]: pam_unix(sudo:session): session closed for user root
2024-08-17T12:18:30.979+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2389]: Invalid user orangepi from 211.186.118.31 port 37739
2024-08-17T12:18:31.479+00:00 longevity-tls-50gb-3d-master-db-node-1b45877b-2     !INFO | sshd[2389]: Connection closed by invalid user orangepi 211.186.118.31 port 37739 [preauth]

Packages

Scylla version: 6.2.0~dev-20240816.afee3924b3dc with build-id c01d2a55a9631178e3fbad3869c20ef3c8dcf293

Kernel Version: 6.8.0-1013-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-1b45877b-7 (54.220.171.0 | 10.4.20.109) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-1b45877b-6 (54.216.84.210 | 10.4.23.43) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-1b45877b-5 (54.155.51.83 | 10.4.21.227) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-1b45877b-4 (63.33.182.152 | 10.4.20.5) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-1b45877b-3 (54.78.114.46 | 10.4.23.20) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-1b45877b-2 (54.246.121.156 | 10.4.21.135) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-1b45877b-1 (108.129.84.60 | 10.4.22.71) (shards: 14)

OS / Image: ami-0f440d7175113787f (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 1b45877b-8dfa-418e-b56c-43be70362fd3
Test name: scylla-master/tier1/longevity-50gb-3days-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 1b45877b-8dfa-418e-b56c-43be70362fd3
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 1b45877b-8dfa-418e-b56c-43be70362fd3

Logs:

Jenkins job URL
Argus

@fruch
Copy link
Contributor

fruch commented Sep 4, 2024

@juliayakovlev

this case is using the two interfaces step, it's not doing public communication at all, how come it can be attacked from anyone external ? we need to investigate what's going on with this one

@fruch
Copy link
Contributor

fruch commented Sep 4, 2024

@mykaul can you transfer it to SCT ?

@mykaul mykaul transferred this issue from scylladb/scylladb Sep 4, 2024
@fruch fruch assigned juliayakovlev and unassigned dimakr and kbr-scylla Sep 4, 2024
@fruch
Copy link
Contributor

fruch commented Sep 4, 2024

Me and @roy found the reason for external communication, SG was open for ssh, we close it as it should have been.

As for the network issue we are facing here,

@juliayakovlev let's try to reproduce with nemesis that hard restart node.

My gut feeling is that we might have issues with the network setup after reboot of new, we need a quicker reproducer.

@fruch fruch reopened this Sep 15, 2024
@fruch
Copy link
Contributor

fruch commented Sep 15, 2024

@juliayakovlev

please look at this one, seems like we are having an issue with multiple network not working after node is rebooted.

@juliayakovlev
Copy link
Contributor

This test started to use multi network configuration at 3.6.2024. But this issue happened first time at 24.7.2024 (https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=856e2796-0eb3-4051-a5d0-305442ceb57a) .
But in this run from 13.7.2024 this issue did not happened.

Trying to reproduce the issue with MultipleHardRebootNodeMonkey.

@juliayakovlev
Copy link
Contributor

juliayakovlev commented Sep 16, 2024

< t:2024-08-03 10:37:09,300 f:cluster.py      l:2691 c:sdcm.cluster         p:DEBUG > Check the health of the node `longevity-tls-50gb-3d-master-db-node-6f2ed542-7' [attempt #3]

< t:2024-08-03 10:37:09,484 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: Datacenter: eu-west
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: ===================
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: Status=Up/Down
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: |/ State=Normal/Leaving/Joining/Moving
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: -- Address     Load     Tokens Owns Host ID                              Rack
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.21.188 16.75 GB 256    ?    3e2c0462-dcc8-4bae-9f4a-d848da35c8e5 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.21.218 18.00 GB 256    ?    1dfedffc-26d6-4605-b8d0-c25bc3aeaf86 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.21.245 17.94 GB 256    ?    6f319656-cf33-4058-9d1f-865bda65bbd9 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.22.148 18.35 GB 256    ?    565296e7-bb69-46f6-939c-ccfb3b932f05 1c  
< t:2024-08-03 10:37:09,489 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: UN 10.4.22.222 16.29 GB 256    ?    ce53f8cb-2f70-4cbe-b3db-3537b33336ae 1c  
< t:2024-08-03 10:37:09,490 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.9.156>: DN 10.4.23.44  18.55 GB 256    ?    660b47f7-92be-44c8-b5ff-c5486cd5a0de 1c  
< t:2024-08-03 10:37:12,301 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.21.245","error":"setup: dial tcp :0->10.4.21.245:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.21.188","error":"setup: dial tcp :0->10.4.21.188:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.21.218","error":"setup: dial tcp :0->10.4.21.218:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.22.148","error":"setup: dial tcp :0->10.4.22.148:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}
< t:2024-08-03 10:37:12,312 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.8.224>: Aug 03 10:37:12 longevity-tls-50gb-3d-master-monitor-node-6f2ed542-1 scylla-manager[9562]: {"L":"ERROR","T":"
2024-08-03T10:37:12.006Z","N":"healthcheck.CQL healthcheck","M":"Parallel hosts check failed","host":"10.4.22.222","error":"setup: dial tcp :0->10.4.22.222:9142: connect: connection refused","_trace_id":"zMxUxZM
CRzWFShZ0hPjmkw","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.c
om/scylladb/scylla-manager/v3/pkg/service/healthcheck.runner.checkHosts.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/healthcheck/runner.go:101\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parall
el.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79"}

< t:2024-08-03 10:37:12,017 f:cluster_aws.py  l:497  c:sdcm.cluster_aws     p:DEBUG > Node longevity-tls-50gb-3d-master-db-node-6f2ed542-7 [79.125.27.189 | 10.4.22.222]: Sorted interfaces: [NetworkInterface(ipv4
_public_address='79.125.27.189', ipv6_public_addresses=['2a05:d018:12e3:f002:46da:2b5d:eb8f:d34b'], ipv4_private_addresses=['10.4.9.156'], ipv6_private_address='', dns_private_name='ip-10-4-9-156.eu-west-1.compu
te.internal', dns_public_name='ec2-79-125-27-189.eu-west-1.compute.amazonaws.com', device_index=0), NetworkInterface(ipv4_public_address=None, ipv6_public_addresses=['2a05:d018:12e3:f005:7b0a:1e0d:e8d1:0329'], i
pv4_private_addresses=['10.4.22.222'], ipv6_private_address='', dns_private_name='ip-10-4-22-222.eu-west-1.compute.internal', dns_public_name=None, device_index=1)]
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG > /10.4.22.222
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   generation:1722656360
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   heartbeat:76957
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   LOAD:17492227739
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   STATUS:NORMAL,9222693218404302345
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   CDC_STREAMS_TIMESTAMP:v2;1722656428071;f8bd2926-5149-11ef-44a5-d34f93b96fc5
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   IGNOR_MSB_BITS:12
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   SHARD_COUNT:14
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   DC:eu-west
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   SCHEMA_TABLES_VERSION:3
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RACK:1c
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RPC_READY:1
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   NET_VERSION:0
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   HOST_ID:ce53f8cb-2f70-4cbe-b3db-3537b33336ae
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RPC_ADDRESS:10.4.22.222
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   RELEASE_VERSION:3.0.8
< t:2024-08-03 10:37:12,125 f:cluster.py      l:2772 c:sdcm.cluster         p:DEBUG >   VIEW_BACKLOG:19593:877238681:1722681431300
< t:2024-08-03 10:37:13,735 f:cluster.py      l:2698 c:sdcm.cluster         p:DEBUG > One or more node `longevity-tls-50gb-3d-master-db-node-6f2ed542-7' health validation has failed

@fruch
Copy link
Contributor

fruch commented Sep 16, 2024

@juliayakovlev what is this log, in the last comment? The context of it isn't clear

@juliayakovlev
Copy link
Contributor

@juliayakovlev what is this log, in the last comment? The context of it isn't clear

I am collecting the info meanwhile

@juliayakovlev
Copy link
Contributor

Reproduced https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=fb21a056-02c8-4c86-888e-f3fb5303c1ff

Branch: issue_sct_8553
The issue is reproduced with 2 nemeses: [disrupt_no_corrupt_repair, disrupt_multiple_hard_reboot_node] after disrupt_no_corrupt_repair

@juliayakovlev
Copy link
Contributor

Reproduced https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=fb21a056-02c8-4c86-888e-f3fb5303c1ff

Branch: issue_sct_8553 The issue is reproduced with 2 nemeses: [disrupt_no_corrupt_repair, disrupt_multiple_hard_reboot_node] after disrupt_no_corrupt_repair

Actually not. It is the problem of scylladb/scylladb#18059

@juliayakovlev
Copy link
Contributor

I ran the tests with disrupt_multiple_hard_reboot_node only - the problem was not reproduced

@fruch
Copy link
Contributor

fruch commented Sep 23, 2024

I ran the tests with disrupt_multiple_hard_reboot_node only - the problem was not reproduced

Can you try running it with the scylla version from reported failures ?

If that doesn't reproduce, we'll leave it for now

@juliayakovlev
Copy link
Contributor

I ran the tests with disrupt_multiple_hard_reboot_node only - the problem was not reproduced

Can you try running it with the scylla version from reported failures ?

If that doesn't reproduce, we'll leave it for now

I ran it with ami from issue

@juliayakovlev
Copy link
Contributor

juliayakovlev commented Sep 26, 2024

https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=23a99995-959d-4c8a-9067-f49c326f5a15
One network interface
The test was run about 23 hours with disrupt_multiple_hard_reboot_node only. The issue was not reproduced
With ami-0071c4f2153cf06c2 (from issue)

@juliayakovlev
Copy link
Contributor

https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=641f6728-624b-418d-9205-b8d4beac763c

2 network interfaces
Test failed after about 55 mins, disrupt_multiple_hard_reboot_node nemesis only.
With ami-0071c4f2153cf06c2 (from issue)

@fruch
Copy link
Contributor

fruch commented Sep 26, 2024

just to sum the observation so far, me and @juliayakovlev found

after node reboot, 2nd interface isn't functioning, doing sudo netplan apply seems to be fixing it.
trying now, to clear the netplan configuration SCT was adding

we also noticed there a warning about the SCT netplan config permissions (maybe it doesn't load it in boot cause of it ?)

@juliayakovlev
Copy link
Contributor

Created branch issue_sct_8553. The change there - do not run sdcm.provision.aws.utils.configure_eth1_script, not re-configure secondary network interface.

Run 2 tests:

  1. Test with disrupt_multiple_hard_reboot_node only - PASSED. The test were run 2 days
  2. Regular test with all nemeses (according to test yaml) - running
    https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=21961db0-285c-4b84-a4da-4d70be149e06

@fruch
Copy link
Contributor

fruch commented Sep 29, 2024

Created branch issue_sct_8553. The change there - do not run sdcm.provision.aws.utils.configure_eth1_script, not re-configure secondary network interface.

Run 2 tests:

  1. Test with disrupt_multiple_hard_reboot_node only - PASSED. The test were run 2 days
  2. Regular test with all nemeses (according to test yaml) - running
    https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=21961db0-285c-4b84-a4da-4d70be149e06

Let's open a PR, and discuss testing needed there

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Sep 30, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: scylladb#8553
@juliayakovlev
Copy link
Contributor

Created branch issue_sct_8553. The change there - do not run sdcm.provision.aws.utils.configure_eth1_script, not re-configure secondary network interface.
Run 2 tests:

  1. Test with disrupt_multiple_hard_reboot_node only - PASSED. The test were run 2 days
  2. Regular test with all nemeses (according to test yaml) - running
    https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=21961db0-285c-4b84-a4da-4d70be149e06

Let's open a PR, and discuss testing needed there

@fruch #8883

@juliayakovlev
Copy link
Contributor

Running the final test

@fruch fruch closed this as completed in e059688 Oct 13, 2024
mergify bot pushed a commit that referenced this issue Oct 13, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: #8553
(cherry picked from commit e059688)
mergify bot pushed a commit that referenced this issue Oct 13, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: #8553
(cherry picked from commit e059688)
mergify bot pushed a commit that referenced this issue Oct 13, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: #8553
(cherry picked from commit e059688)
fruch pushed a commit that referenced this issue Oct 14, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: #8553
(cherry picked from commit e059688)
fruch pushed a commit that referenced this issue Oct 14, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: #8553
(cherry picked from commit e059688)
fruch pushed a commit that referenced this issue Oct 14, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: #8553
(cherry picked from commit e059688)
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Nov 14, 2024
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster.
After node reboot, 2nd interface isn't functioning.
Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes
to lost connection.
Actually with using existing and configured network interface we do not need this eth1 configuration script any more.

Fixes: scylladb#8553
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants