-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decommission_streaming_err
nemesis times out too early when some END-around
log message is awaited before rebooting target DB node
#8144
Comments
Is this test passing without tablets? |
I didn't run it without tablets
Cluster state directly influences the speed of decommission.
This message is raft-specific and it is enabled in this test. IMHO, |
I just wonder how it passed in the past for so long - it's not that new nemesis. Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low. |
please look into this one |
this test case valeri is running does not used that often, and the refactoring @aleksbykov done to this neemeis
|
Decommission operation could be terminated for different reason, but decommission process will be run on node and could successfully be decommissioned. But because of nemesis is terminated by exception cluster health validator could abort whole test run because node status is Decommissioning. This start happened with nemesis DecommissionStreamingErr, when log message could not be found and process is aborted by timeout, while decommission continue to run. To Catch such case FailedDecommissionOperationMonitoring is presented It is ContextManager which could be used to safely run decommission operation and check node status and wait decommission will be finished if command aborted or terminated. Fix: scylladb#8144
Decommission operation could be terminated for different reason, but decommission process will be run on node and could successfully be decommissioned. But because of nemesis is terminated by exception cluster health validator could abort whole test run because node status is Decommissioning. This start happened with nemesis DecommissionStreamingErr, when log message could not be found and process is aborted by timeout, while decommission continue to run. To Catch such case FailedDecommissionOperationMonitoring is presented It is ContextManager which could be used to safely run decommission operation and check node status and wait decommission will be finished if command aborted or terminated. Fix: scylladb#8144
Issue description
Running the
disrupt_decommission_streaming_err
nemesis SCT picks up one of the DB log messages to be awaited before rebooting the target node.And when some
END
-around command gets picked up like the following:Then it gets assigned to short time out (
80m
) and fails because of it.The problem is that the decommission was going on ok, just required more time to reach required step.
The target node is yellow on the screenshot below:
Steps to Reproduce
longevity-multi-keyspaces-with-tablets
Ci job with thedisrupt_decommission_streaming_err
nemesisExpected behavior: timeout value must be more closer to the real life.
Actual behavior: timeout is too small
Impact
How frequently does it reproduce?
100%
Installation details
SCT Version: master
Scylla version (or git commit hash): master
Logs
The text was updated successfully, but these errors were encountered: