Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodetool removenode should call with long_running=True #9494

Closed
2 tasks
fruch opened this issue Dec 8, 2024 · 0 comments · Fixed by #9518
Closed
2 tasks

nodetool removenode should call with long_running=True #9494

fruch opened this issue Dec 8, 2024 · 0 comments · Fixed by #9518
Assignees

Comments

@fruch
Copy link
Contributor

fruch commented Dec 8, 2024

Packages

Scylla version: 6.3.0~dev-20241206.7e2875d6489d with build-id 5227dd2a3fce4d2beb83ec6c17d47ad2e8ba6f5c

Kernel Version: 6.8.0-1019-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

node-5 was remove with nodetool remove command running on node-6
the command run disconnected from node-6 cause of knonw ssh overload issue:

2024-12-07T06:42:24.802+00:00 longevity-parallel-topology-schema--db-node-f1480f18-6     !INFO | sshd[966]: Bad packet length 64330216.

and the nemesis failed with:

2024-12-07 06:42:53.463: (DisruptionEvent Severity.ERROR) period_type=end event_id=2cb3e6a3-c061-4467-b75d-ed80856b22e5 duration=54m56s: nemesis_name=RemoveNodeThenAddNode target_node=Node longevity-parallel-topology-schema--db-node-f1480f18-5 [108.128.207.201 | 10.4.8.89] errors=Node was not removed properly (Node status:{'state': 'DL', 'load': '24.85GB', 'tokens': '256', 'owns': '?', 'host_id': 'ad4da1e5-ec36-4546-bcd0-8fcf3cf468fb', 'rack': '1c'})
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5441, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 188, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3740, in disrupt_remove_node_then_add_node
assert removed_node_status is None, \
AssertionError: Node was not removed properly (Node status:{'state': 'DL', 'load': '24.85GB', 'tokens': '256', 'owns': '?', 'host_id': 'ad4da1e5-ec36-4546-bcd0-8fcf3cf468fb', 'rack': '1c'})

without adding new node as replacement, so the rest fo the test is failing cause of that missing the extra node:

2024-12-07 07:09:04.615: (DisruptionEvent Severity.ERROR) period_type=end event_id=6a245da1-8b11-4880-a507-dc1d1650fa87 duration=16m11s: nemesis_name=GrowShrinkCluster target_node=Node longevity-parallel-topology-schema--db-node-f1480f18-1 [52.30.25.88 | 10.4.10.172] errors=Not enough nodes for decommission
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5441, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 188, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4261, in disrupt_grow_shrink_cluster
self._shrink_cluster(rack=None, new_nodes=new_nodes)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4313, in _shrink_cluster
raise Exception(error)
Exception: Not enough nodes for decommission

Installation details

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • longevity-parallel-topology-schema--db-node-f1480f18-9 (52.49.139.251 | 10.4.9.212) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-8 (34.253.253.174 | 10.4.10.218) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-7 (18.203.20.199 | 10.4.11.243) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-6 (54.229.234.249 | 10.4.11.215) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-5 (108.128.207.201 | 10.4.8.89) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-47 (34.253.166.213 | 10.4.11.115) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-46 (34.254.72.195 | 10.4.11.79) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-45 (52.209.40.236 | 10.4.8.12) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-44 (54.216.187.35 | 10.4.10.37) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-43 (52.208.185.113 | 10.4.9.123) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-42 (52.51.239.136 | 10.4.11.208) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-41 (79.125.35.245 | 10.4.9.189) (shards: -1)
  • longevity-parallel-topology-schema--db-node-f1480f18-40 (52.30.4.221 | 10.4.8.111) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-4 (52.215.212.129 | 10.4.11.30) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-39 (34.246.48.102 | 10.4.11.168) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-38 (52.214.168.223 | 10.4.11.158) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-37 (52.208.166.96 | 10.4.11.39) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-36 (54.154.223.246 | 10.4.9.212) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-35 (34.252.237.171 | 10.4.8.32) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-34 (52.215.174.232 | 10.4.8.26) (shards: -1)
  • longevity-parallel-topology-schema--db-node-f1480f18-33 (34.252.15.252 | 10.4.11.105) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-32 (52.208.1.155 | 10.4.8.127) (shards: -1)
  • longevity-parallel-topology-schema--db-node-f1480f18-31 (52.208.102.221 | 10.4.8.130) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-30 (54.155.111.214 | 10.4.8.22) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-3 (52.209.207.193 | 10.4.8.83) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-29 (52.19.174.175 | 10.4.10.43) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-28 (54.72.250.189 | 10.4.9.9) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-27 (54.76.22.128 | 10.4.10.212) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-26 (34.253.6.123 | 10.4.10.130) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-25 (34.252.184.92 | 10.4.8.224) (shards: -1)
  • longevity-parallel-topology-schema--db-node-f1480f18-24 (52.19.203.116 | 10.4.8.116) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-23 (54.228.57.232 | 10.4.10.38) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-22 (52.212.50.251 | 10.4.11.63) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-21 (52.18.181.205 | 10.4.9.75) (shards: -1)
  • longevity-parallel-topology-schema--db-node-f1480f18-20 (52.30.55.187 | 10.4.11.252) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-2 (34.247.14.31 | 10.4.8.29) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-19 (54.246.131.51 | 10.4.8.229) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-18 (46.137.126.74 | 10.4.8.19) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-17 (52.210.234.53 | 10.4.10.228) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-16 (52.214.130.114 | 10.4.9.97) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-15 (18.203.167.215 | 10.4.8.254) (shards: -1)
  • longevity-parallel-topology-schema--db-node-f1480f18-14 (99.81.10.144 | 10.4.9.11) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-13 (54.216.174.147 | 10.4.10.173) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-12 (34.248.105.27 | 10.4.8.241) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-11 (34.240.128.223 | 10.4.11.120) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-10 (63.33.180.3 | 10.4.11.213) (shards: 7)
  • longevity-parallel-topology-schema--db-node-f1480f18-1 (52.30.25.88 | 10.4.10.172) (shards: 7)

OS / Image: ami-0c7b4b0835c9342f7 (aws: undefined_region)

Test: longevity-schema-topology-changes-12h-test
Test id: f1480f18-df28-4af1-848e-c73e68da3a85
Test name: scylla-master/tier1/longevity-schema-topology-changes-12h-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor f1480f18-df28-4af1-848e-c73e68da3a85
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs f1480f18-df28-4af1-848e-c73e68da3a85

Logs:

Jenkins job URL
Argus

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Dec 9, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: scylladb#9494
fruch added a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
mergify bot pushed a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
mergify bot pushed a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
mergify bot pushed a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
mergify bot pushed a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
fruch added a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
fruch added a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
fruch added a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
fruch added a commit that referenced this issue Dec 10, 2024
… long_running

since this command can take long, and ssh connection might get interuppted
we should use the `long_running=True` option, so the command
would be exceuted on the node, and retrive the output when it
finish.

Fixes: #9494
(cherry picked from commit ecc2c7e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant