Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decommission_streaming_err nemesis times out too early when some END-around log message is awaited before rebooting target DB node #8144

Open
2 tasks
vponomaryov opened this issue Jul 25, 2024 · 5 comments · May be fixed by #8843
Assignees
Labels
Bug Something isn't working right tier1

Comments

@vponomaryov
Copy link
Contributor

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Running the disrupt_decommission_streaming_err nemesis SCT picks up one of the DB log messages to be awaited before rebooting the target node.

And when some END-around command gets picked up like the following:

2024-07-24 19:52:55,353 f:nemesis.py      l:3873 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'Finished token ring movement'

Then it gets assigned to short time out (80m) and fails because of it.

The problem is that the decommission was going on ok, just required more time to reach required step.
The target node is yellow on the screenshot below:

Screenshot from 2024-07-24 17-36-38

Steps to Reproduce

  1. Run longevity-multi-keyspaces-with-tablets Ci job with the disrupt_decommission_streaming_err nemesis
  2. See error

Expected behavior: timeout value must be more closer to the real life.

Actual behavior: timeout is too small

Impact

How frequently does it reproduce?

100%

Installation details

SCT Version: master
Scylla version (or git commit hash): master

Logs

@vponomaryov vponomaryov removed their assignment Jul 25, 2024
@soyacz
Copy link
Contributor

soyacz commented Jul 25, 2024

Is this test passing without tablets?
For me it looks like a very slow decommission.
And the question, is Finished token ring movement log message valid for tablets?

@vponomaryov
Copy link
Contributor Author

Is this test passing without tablets?

I didn't run it without tablets

For me it looks like a very slow decommission.

Cluster state directly influences the speed of decommission.
Data size, CPU load, disk load and so on...

And the question, is Finished token ring movement log message valid for tablets?

This message is raft-specific and it is enabled in this test.

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster.
Do we have somewhere the limits for the decommission operation?

@soyacz
Copy link
Contributor

soyacz commented Jul 26, 2024

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?

I just wonder how it passed in the past for so long - it's not that new nemesis.

Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low.
Switched to MAX_TIME_WAIT_FOR_DECOMMISSION which is set to 6h.

@fruch
Copy link
Contributor

fruch commented Aug 15, 2024

@aleksbykov

please look into this one

@fruch
Copy link
Contributor

fruch commented Aug 15, 2024

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?

I just wonder how it passed in the past for so long - it's not that new nemesis.

this test case valeri is running does not used that often, and the refactoring @aleksbykov done to this neemeis
wasn't done so long ago.
i.e. this nemesis is relatively new, not surprise 1h isn't enough to all cases.

Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low. Switched to MAX_TIME_WAIT_FOR_DECOMMISSION which is set to 6h.

@temichus temichus added Bug Something isn't working right tier1 labels Sep 15, 2024
aleksbykov added a commit to aleksbykov/scylla-cluster-tests that referenced this issue Sep 25, 2024
Decommission operation could be terminated for different reason, but
decommission process will be run on node and could successfully be
decommissioned. But because of nemesis is terminated by exception
cluster health validator could abort whole test run because
node status is Decommissioning. This start happened with nemesis
DecommissionStreamingErr, when log message could not be found and
process is aborted by timeout, while decommission continue to run.

To Catch such case FailedDecommissionOperationMonitoring is presented
It is ContextManager which could be used to safely run decommission
operation and check node status and wait decommission will be finished
if command aborted or terminated.

Fix: scylladb#8144
aleksbykov added a commit to aleksbykov/scylla-cluster-tests that referenced this issue Sep 25, 2024
Decommission operation could be terminated for different reason, but
decommission process will be run on node and could successfully be
decommissioned. But because of nemesis is terminated by exception
cluster health validator could abort whole test run because
node status is Decommissioning. This start happened with nemesis
DecommissionStreamingErr, when log message could not be found and
process is aborted by timeout, while decommission continue to run.

To Catch such case FailedDecommissionOperationMonitoring is presented
It is ContextManager which could be used to safely run decommission
operation and check node status and wait decommission will be finished
if command aborted or terminated.

Fix: scylladb#8144
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working right tier1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants