Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(FullScan): choose non-running_nemesis node #9370

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

temichus
Copy link
Contributor

@temichus temichus commented Nov 26, 2024

this commit has the following changes

1 introduce common targed_node_lock mechanism
that can be used in nemesis and Scan operations

2 FullScan operation now run only on free of nemeses node

3 change all node.running_nemesis settings to use common methods
set/unset_running_nemesis from common targed_node_lock file (except unit tests)

fixes: #9284

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@temichus
Copy link
Contributor Author

@fruch, @soyacz, @aleksbykov Coul you please take a look at this idea before I start testing it

@temichus temichus force-pushed the node_operations_lock branch 5 times, most recently from 19d83c6 to 169726c Compare November 26, 2024 13:21
@temichus temichus added backport/2023.1 Need to backport to 2023.1 backport/2024.1 Need backport to 2024.1 backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 backport/6.2 labels Nov 26, 2024
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, still needs testing as setting target node is messy (e.g. we set target node in disrupt_method_wrapper and unset it in run method.)

sdcm/nemesis.py Outdated
@@ -453,10 +431,10 @@ def set_target_node(self, dc_idx: Optional[int] = None, rack: Optional[int] = No
self.target_node = random.choice(nodes)

if current_disruption:
self.target_node.running_nemesis = current_disruption
set_running_nemesis(self.target_node, current_disruption, self.disruptive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_target_nodes anyway works the old way - never selecting any node that runs any nemesis, regardless of descriptiveness. So selecting node during nemesis startup will work the old way.
I'm not saying it's wrong, just noting it.

@@ -28,7 +29,7 @@
from sdcm.utils.decorators import retrying, Retry

if TYPE_CHECKING:
from sdcm.cluster import BaseNode
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove the whole if

sdcm/nemesis.py Outdated
@@ -381,14 +362,11 @@ def publish_event(self, disrupt, status=True, data=None):
DisruptionEvent(nemesis_name=disrupt, severity=severity, **data).publish()

def set_current_running_nemesis(self, node):
with NEMESIS_TARGET_SELECTION_LOCK:
node.running_nemesis = self.current_disruption
set_running_nemesis(node, self.current_disruption, self.disruptive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does self.disruptive would work ?

all those are set on only on nemesis classes, not the Nemesis class.

aleksbykov
aleksbykov previously approved these changes Nov 26, 2024
Copy link
Contributor

@aleksbykov aleksbykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short remarks.
Also, because running_nemesis is list now, need to be carefully with passing to (un)set_running_nemesis function name of nemesis.



@contextmanager
def run_nemesis(node_list: list[BaseNode], nemesis_label: str, is_disruptive=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think better have is_disruptive=False as in set_running_nemesis

Comment on lines 31 to 32
if TYPE_CHECKING:
from sdcm.cluster import BaseNode
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be it is not needed at all

@temichus temichus force-pushed the node_operations_lock branch 3 times, most recently from 33fd377 to 6c1792d Compare November 27, 2024 12:20
@temichus temichus changed the title fix(FullScan): choose only non-disruptive node fix(FullScan): choose non-running_nemesis node Nov 27, 2024
@temichus temichus force-pushed the node_operations_lock branch from 6c1792d to 793489d Compare November 27, 2024 12:26
@temichus
Copy link
Contributor Author

temichus commented Nov 27, 2024

@soyacz @fruch @aleksbykov, Thank you for your comments. I've changed the logic to choose only a non running_nemesis node for run_nemesis, and it will help to do FullScan safe when several nemesis are running. But I don't think it will help for RollingRestartCluster because it 'disrupts' all nodes in the cluster. What do you think regarding updating disrupt_rolling_restart_cluster to restart only non running_nemesis nodes?

@temichus temichus force-pushed the node_operations_lock branch 3 times, most recently from 27ead1f to 6c23dee Compare December 1, 2024 11:25
@temichus temichus force-pushed the node_operations_lock branch 4 times, most recently from 3eb3f14 to 1184ff1 Compare December 10, 2024 16:36
@temichus
Copy link
Contributor Author

temichus commented Dec 10, 2024

@fruch @soyacz @aleksbykov

according to @fruch, I've back to solution with locking nodes 1 by one during RollingRestartCluster and increasing timouts

The last test run sows only 2 error messages but all nemesis pass

2024-12-10 13:54:40.470: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=5ff46034-9a64-4978-a370-a35effc5c1f4 during_nemesis=RollingRestartCluster duration=40m0s node=longevity-10gb-3h-node-ope-db-node-d1a01caa-3 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed: OperationTimedOut("errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.0.234:9042"),session execution timeout 2400 is exceeded

can be #8705

and

2024-12-10 11:22:38.945: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=eadc36fd-dd71-4b8c-adc5-46e1f6b664ff duration=1m56s node=longevity-10gb-3h-node-ope-db-node-d1a01caa-2 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed: Fullscan failed - 'mapreduce_service_requests_dispatched_to_other_nodes' was not triggered

is scylladb/scylladb#21578 - will change it to warning in the next PR

@temichus temichus marked this pull request as ready for review December 10, 2024 17:12
@fruch
Copy link
Contributor

fruch commented Dec 10, 2024

@temichus
https://github.com/scylladb/scylla-enterprise/issues/4740 is something happens in upgrade, I don't understand how it's related here...

@temichus
Copy link
Contributor Author

@temichus scylladb/scylla-enterprise#4740 is something happens in upgrade, I don't understand how it's related here...

wrolg link scylladb/scylladb#21578

sdcm/nemesis.py Outdated
@@ -984,7 +963,12 @@ def disrupt_soft_reboot_node(self):

@decorate_with_context(ignore_ycsb_connection_refused)
def disrupt_rolling_restart_cluster(self, random_order=False):
self.cluster.restart_scylla(random_order=random_order)
try:
for node in self.cluster.nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not the only place in SCT we have a rolling restart

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • _enable_disable_table_encryption
  • disrupt_mgmt_restore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should consider moving this logic into restart_scylla

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node.restart_scylla and cluster.restart_scylla is used in many places(test_functional.py, mng_cli_test.py, audit.py, artifacts_test.py, stop_compaction_test.py). But the lock mechanism is needed only when we have nemeses(in longevity tests). So let's keep it for nemeses only(at least for now)

will update _enable_disable_table_encryption and disrupt_mgmt_restore

@roydahan
Copy link
Contributor

First, does it solve the issue?

@temichus temichus force-pushed the node_operations_lock branch from 1184ff1 to e3e7af0 Compare December 11, 2024 12:34
@temichus
Copy link
Contributor Author

First, does it solve the issue?

yes, there is no conflict anymore between FullScan and RollingRestart. Each chooses only the "free" node and locks it while doing an operation on ti

this commit has the following changes

1 introduce common targed_node_lock mechanism
that can be used in nemesis and Scan operations

2 FullScan operation now run only on free of nemeses node

3 change all node.running_nemesis settings to use common methods
set/unset_running_nemesis from common targed_node_lock file (except unit tests)

4 change disrupt_rolling_restart_cluster nemesis to lock all nodes in
the cluster befo performing restart

fixes: scylladb#9284
@temichus temichus force-pushed the node_operations_lock branch from e3e7af0 to 267cc0a Compare December 11, 2024 12:39
@temichus
Copy link
Contributor Author

after a discussion with @fruch lock mechanism is not the best because Nemeses just will not run and wait for the lock. we heed a solution to let full-scan know (somehow) that the node was rebooted during full-scan

cc @aleksbykov @roydahan

@temichus temichus marked this pull request as draft December 11, 2024 16:27
@roydahan
Copy link
Contributor

I'm suggesting again, to have a set of errors we can ignore approved by relevant developer and apply it in all relevant nemesis.

@fruch
Copy link
Contributor

fruch commented Dec 11, 2024

I'm suggesting again, to have a set of errors we can ignore approved by relevant developer and apply it in all relevant nemesis.

  1. This is about a failure in fullscan, not nemesis

  2. We already have code there that ignore fullscan results nemesis targeted the node that was the coordinator.

  3. The problem we are facing is, that in rolling restart you target all nodes, without letting other parts of the code know about it, so the full scan thread doesn't know it should ignore the failure.

Lock suggested in this PR might cause high changes the nemesis doing rolling restart to be skipped in parallel nemesis cases.

I suggested that we need some way to identify that nodes restarted during the fullscan operation, but this not trivial with current event system.

One other option, since fullscan is just a stress tool, we implemented in python.
Maybe we should take the same approach, and apply retries with exponential backkoff

@roydahan
Copy link
Contributor

One other option, since fullscan is just a stress tool, we implemented in python.
Maybe we should take the same approach, and apply retries with exponential backkoff

This is an option but it may hide problems.

@roydahan
Copy link
Contributor

We already have code there that ignore fullscan results nemesis targeted the node that was the coordinator.

So what's the problem to ignore also the "map reduce" specific error we are getting now?

@fruch
Copy link
Contributor

fruch commented Dec 12, 2024

We already have code there that ignore fullscan results nemesis targeted the node that was the coordinator.

So what's the problem to ignore also the "map reduce" specific error we are getting now?

The problem is that ignore logic kicked in only if fullscan was using the same node as the current nemesis target node. which isn't the case in rolling restarts, cause we target all nodes.

If we're gonna just ignore the validation of the map reduce metric, we might as well remove it completely.
But it's just one of the symptoms we are seeing, we have there also failures of the fullscan queries themselves.

@fruch
Copy link
Contributor

fruch commented Dec 12, 2024

One other option, since fullscan is just a stress tool, we implemented in python.
Maybe we should take the same approach, and apply retries with exponential backkoff

This is an option but it may hide problems.

How it's different from the problems it might hide in c-s when we are retrying ?

Our from current implemention that make anything into a warning if it's during a nemesis targeting the same node ?

We use retry when there's an inerit cause of failure which is expected, and we can specifically cross match.
Consider if we had implement fullscan as stress tools in multiple languages, we would have implement retries, wouldn't we ?

@roydahan
Copy link
Contributor

The problem is that ignore logic kicked in only if fullscan was using the same node as the current nemesis target node. which isn't the case in rolling restarts, cause we target all nodes.

If we're gonna just ignore the validation of the map reduce metric, we might as well remove it completely. But it's just one of the symptoms we are seeing, we have there also failures of the fullscan queries themselves.

We don't need to ignore the validation, we just need to ignore that specific error we're getting now.
If there are more errors, we need to report issue about them and ignore them only if this is what's decided.
In that case, what's the problem to run full scans inside a context manager that ignores this error?

@roydahan
Copy link
Contributor

How it's different from the problems it might hide in c-s when we are retrying ?

It's a background thread that one hardly notice if it runs or not.
E.g. when we had aggregation issue at field and we didn't notice.

Our from current implemention that make anything into a warning if it's during a nemesis targeting the same node ?

We use retry when there's an inerit cause of failure which is expected, and we can specifically cross match. Consider if we had implement fullscan as stress tools in multiple languages, we would have implement retries, wouldn't we ?

You may use retry which will hide 99% of the problems and you wouldn't know, can't tell from monitor or anywhere else.

@yarongilor
Copy link
Contributor

yarongilor commented Dec 18, 2024

if it is expected for scan to fail following server down then it shouldn't raise an error event. Instead, it could become "disruptive nemesis aware" somehow or alternatively "log aware" so it will check for node down during scan.
IIRC c-s also had such similar failures (timeout/quorum) during rolling restart nemeses.

perhaps this case shows it:

CassandraStressEvent
CRITICAL
disrupt_rolling_config_change_internode_compression
2024-12-16 04:22:15.715
end
Node longevity-10gb-3h-master-loader-node-a4d508f0-eastus-2 [None | 10.0.0.12]
2024-12-16 04:22:15.715: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=db57ab41-6899-4414-8f72-f05782a6158b during_nemesis=RollingConfigChangeInternodeCompression duration=3h0m13s: node=Node longevity-10gb-3h-master-loader-node-a4d508f0-eastus-2 [None | 10.0.0.12]
stress_cmd=cassandra-stress write cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-a4d508f0-eastus-3.internal.cloudapp.net/10.0.0.7:9042] Connection has been closed
FullScanAggregateEvent
ERROR
disrupt_rolling_config_change_internode_compression
2024-12-16 04:21:39.040
end
longevity-10gb-3h-master-db-node-a4d508f0-eastus-4
2024-12-16 04:21:39.040: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=38537323-9966-4c96-8e0b-9a02dd4ab6d6 during_nemesis=RollingConfigChangeInternodeCompression duration=56s node=longevity-10gb-3h-master-db-node-a4d508f0-eastus-4 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed, ReadTimeout error: ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation failed for keyspace1.standard1 - received 0 responses and 1 failures from 1 CL=ONE." info={\'consistency\': \'ONE\', \'required_responses\': 1, \'received_responses\': 0}')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/6.1 Need backport to 6.1 backport/6.2 backport/2023.1 Need to backport to 2023.1 backport/2024.1 Need backport to 2024.1 backport/2024.2 Need backport to 2024.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Fullscanoperation thread to choose only alive node
6 participants