`decommission_streaming_err` nemesis times out too early when some `END-around` log message is awaited before rebooting target DB node #8144

vponomaryov · 2024-07-25T10:31:14Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Running the disrupt_decommission_streaming_err nemesis SCT picks up one of the DB log messages to be awaited before rebooting the target node.

And when some END-around command gets picked up like the following:

2024-07-24 19:52:55,353 f:nemesis.py      l:3873 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'Finished token ring movement'

Then it gets assigned to short time out (80m) and fails because of it.

The problem is that the decommission was going on ok, just required more time to reach required step.
The target node is yellow on the screenshot below:

Steps to Reproduce

Run longevity-multi-keyspaces-with-tablets Ci job with the disrupt_decommission_streaming_err nemesis
See error

Expected behavior: timeout value must be more closer to the real life.

Actual behavior: timeout is too small

Impact

How frequently does it reproduce?

100%

Installation details

SCT Version: master
Scylla version (or git commit hash): master

Logs

test_id: 7d0040da-769a-4e8c-bc93-5566e70b51e3
job log: scylla-staging/valerii/vp-longevity-multi-keyspaces-with-tablets#4 , CI

The text was updated successfully, but these errors were encountered:

soyacz · 2024-07-25T12:26:04Z

Is this test passing without tablets?
For me it looks like a very slow decommission.
And the question, is Finished token ring movement log message valid for tablets?

vponomaryov · 2024-07-26T09:15:11Z

Is this test passing without tablets?

I didn't run it without tablets

For me it looks like a very slow decommission.

Cluster state directly influences the speed of decommission.
Data size, CPU load, disk load and so on...

And the question, is Finished token ring movement log message valid for tablets?

This message is raft-specific and it is enabled in this test.

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster.
Do we have somewhere the limits for the decommission operation?

soyacz · 2024-07-26T12:20:25Z

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?

I just wonder how it passed in the past for so long - it's not that new nemesis.

Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low.
Switched to MAX_TIME_WAIT_FOR_DECOMMISSION which is set to 6h.

fruch · 2024-08-15T14:46:50Z

@aleksbykov

please look into this one

fruch · 2024-08-15T14:48:29Z

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?

I just wonder how it passed in the past for so long - it's not that new nemesis.

this test case valeri is running does not used that often, and the refactoring @aleksbykov done to this neemeis
wasn't done so long ago.
i.e. this nemesis is relatively new, not surprise 1h isn't enough to all cases.

Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low. Switched to MAX_TIME_WAIT_FOR_DECOMMISSION which is set to 6h.

Decommission operation could be terminated for different reason, but decommission process will be run on node and could successfully be decommissioned. But because of nemesis is terminated by exception cluster health validator could abort whole test run because node status is Decommissioning. This start happened with nemesis DecommissionStreamingErr, when log message could not be found and process is aborted by timeout, while decommission continue to run. To Catch such case FailedDecommissionOperationMonitoring is presented It is ContextManager which could be used to safely run decommission operation and check node status and wait decommission will be finished if command aborted or terminated. Fix: scylladb#8144

github-actions bot assigned vponomaryov Jul 25, 2024

vponomaryov removed their assignment Jul 25, 2024

fruch assigned aleksbykov and temichus Aug 15, 2024

temichus added Bug Something isn't working right tier1 labels Sep 15, 2024

temichus assigned timtimb0t Sep 18, 2024

aleksbykov linked a pull request Sep 25, 2024 that will close this issue

Feature(TopologyOpsMonitor): Monitor decommission topology operation #8843

Open

2 tasks

temichus unassigned temichus and timtimb0t Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`decommission_streaming_err` nemesis times out too early when some `END-around` log message is awaited before rebooting target DB node #8144

`decommission_streaming_err` nemesis times out too early when some `END-around` log message is awaited before rebooting target DB node #8144

vponomaryov commented Jul 25, 2024

soyacz commented Jul 25, 2024

vponomaryov commented Jul 26, 2024

soyacz commented Jul 26, 2024

fruch commented Aug 15, 2024

fruch commented Aug 15, 2024

decommission_streaming_err nemesis times out too early when some END-around log message is awaited before rebooting target DB node #8144

decommission_streaming_err nemesis times out too early when some END-around log message is awaited before rebooting target DB node #8144

Comments

vponomaryov commented Jul 25, 2024

Issue description

Steps to Reproduce

Impact

How frequently does it reproduce?

Installation details

Logs

soyacz commented Jul 25, 2024

vponomaryov commented Jul 26, 2024

soyacz commented Jul 26, 2024

fruch commented Aug 15, 2024

fruch commented Aug 15, 2024

`decommission_streaming_err` nemesis times out too early when some `END-around` log message is awaited before rebooting target DB node #8144

`decommission_streaming_err` nemesis times out too early when some `END-around` log message is awaited before rebooting target DB node #8144