-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster #8553
Comments
Looks like network down event - not sure, but even the Amazon agent gave up. Afterwards, it went up (you can see external SSH attacks trying to login!), but Scylla did not. Not sure this is a very interesting scenario. |
ssh access was available all the time, I think. SCT could still connect to this node. But one of the interfaces, the one used by Scylla (with ipv4=10.4.21.183) went down and judging from SCT logs, never went back up again: drivers were also trying to connect to this node until the end of the test (~18:40) and never succeeded. Not enough evidence to say that it's a Scylla problem. Can be explained by networking problem. So I'm closing the issue. |
I faced out with the same problem in another test run.
External SSH attacks trying to login, like @mykaul mentioned, also observed but later
PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 6 nodes (i4i.4xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
this case is using the two interfaces step, it's not doing public communication at all, how come it can be attacked from anyone external ? we need to investigate what's going on with this one |
@mykaul can you transfer it to SCT ? |
Me and @roy found the reason for external communication, SG was open for ssh, we close it as it should have been. As for the network issue we are facing here, @juliayakovlev let's try to reproduce with nemesis that hard restart node. My gut feeling is that we might have issues with the network setup after reboot of new, we need a quicker reproducer. |
please look at this one, seems like we are having an issue with multiple network not working after node is rebooted. |
This test started to use multi network configuration at 3.6.2024. But this issue happened first time at 24.7.2024 (https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=856e2796-0eb3-4051-a5d0-305442ceb57a) . Trying to reproduce the issue with MultipleHardRebootNodeMonkey. |
|
@juliayakovlev what is this log, in the last comment? The context of it isn't clear |
I am collecting the info meanwhile |
Branch: |
Actually not. It is the problem of scylladb/scylladb#18059 |
I ran the tests with |
Can you try running it with the scylla version from reported failures ? If that doesn't reproduce, we'll leave it for now |
I ran it with ami from issue |
https://argus.scylladb.com/test/b08fa1c7-ae96-4a5b-be2d-98db30813a00/runs?additionalRuns[]=23a99995-959d-4c8a-9067-f49c326f5a15 |
2 network interfaces |
just to sum the observation so far, me and @juliayakovlev found after node reboot, 2nd interface isn't functioning, doing we also noticed there a warning about the SCT netplan config permissions (maybe it doesn't load it in boot cause of it ?) |
Created branch Run 2 tests:
|
Let's open a PR, and discuss testing needed there |
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: scylladb#8553
|
Running the final test |
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: #8553 (cherry picked from commit e059688)
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: #8553 (cherry picked from commit e059688)
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: #8553 (cherry picked from commit e059688)
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: #8553 (cherry picked from commit e059688)
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: #8553 (cherry picked from commit e059688)
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: #8553 (cherry picked from commit e059688)
After multiple hard reboots a node eventually loses connection to all other nodes in the cluster. After node reboot, 2nd interface isn't functioning. Found that our `sdcm.provision.aws.utils.configure_eth1_script` re-configure the second network interface and it causes to lost connection. Actually with using existing and configured network interface we do not need this eth1 configuration script any more. Fixes: scylladb#8553
Packages
Scylla version:
6.2.0~dev-20240809.3745d0a53457
with build-idd168ac61c2ac38ef15c6deba485431e666a8fdef
Kernel Version:
6.8.0-1013-aws
Issue description
disrupt_method multiple_hard_reboot_node
nemesis performed 10 hard reboots of a node-5 in a row (the number of reboots is randomly selected from range 2 to 10 for each test run) - the 1st one started ad08:41:02
, the 10th finished at08:54:07
. All reboots finished successfully, scylla started after 10th reboot at08:54:07
:Then post-nemesis check health of the cluster was performed and passed, and the nemesis was marked as successfully finished at
09:18:14
.Next
replace_service_level_using_detach_during_load
nemesis was scheduled at09:23:14
. At the beginning of each nemesis cluster health check is performed. This nemesis passed the pre-nemesis health check, but the nemesis disruption itself was not executed as it is dedicated for SLA tests only. At09:47:15
the nemesis was marked as skipped.Next
network_reject_node_exporter
nemesis was scheduled at09:47:15
and its pre-nemesis cluster health check failed -nodetool status
check on node-5 (the one that was hard rebooted multiple times previously) returned at09:54:26
that it sees all other nodes as down:At
09:54:24
In system.log on node-5 there are records that the other nodes are marked down in gossip:To summarize, after multiple reboots of node-5 there were no other disruptions executed during the test and the cluster was able to pass a few cluster health checks, but eventually within an hour the node-5 lost connection to other nodes.
Describe your issue in detail and steps it took to produce it.
Impact
Node lost connection to other cluster nodes, over time, after multiple hard reboots
How frequently does it reproduce?
No occurrences of the issue were noticed previously.
Installation details
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-05a75d3ca4e6ebd9f
(aws: undefined_region)Test:
longevity-50gb-3days-test
Test id:
7b8958f1-4a7f-4cef-a68d-32c1393493b0
Test name:
scylla-master/tier1/longevity-50gb-3days-test
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 7b8958f1-4a7f-4cef-a68d-32c1393493b0
$ hydra investigate show-logs 7b8958f1-4a7f-4cef-a68d-32c1393493b0
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: