-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(FullScan): choose non-running_nemesis node #9370
base: master
Are you sure you want to change the base?
Conversation
@fruch, @soyacz, @aleksbykov Coul you please take a look at this idea before I start testing it |
19d83c6
to
169726c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, still needs testing as setting target node is messy (e.g. we set target node in disrupt_method_wrapper
and unset it in run
method.)
sdcm/nemesis.py
Outdated
@@ -453,10 +431,10 @@ def set_target_node(self, dc_idx: Optional[int] = None, rack: Optional[int] = No | |||
self.target_node = random.choice(nodes) | |||
|
|||
if current_disruption: | |||
self.target_node.running_nemesis = current_disruption | |||
set_running_nemesis(self.target_node, current_disruption, self.disruptive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_target_nodes
anyway works the old way - never selecting any node that runs any nemesis, regardless of descriptiveness. So selecting node during nemesis startup will work the old way.
I'm not saying it's wrong, just noting it.
sdcm/scan_operation_thread.py
Outdated
@@ -28,7 +29,7 @@ | |||
from sdcm.utils.decorators import retrying, Retry | |||
|
|||
if TYPE_CHECKING: | |||
from sdcm.cluster import BaseNode | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just remove the whole if
sdcm/nemesis.py
Outdated
@@ -381,14 +362,11 @@ def publish_event(self, disrupt, status=True, data=None): | |||
DisruptionEvent(nemesis_name=disrupt, severity=severity, **data).publish() | |||
|
|||
def set_current_running_nemesis(self, node): | |||
with NEMESIS_TARGET_SELECTION_LOCK: | |||
node.running_nemesis = self.current_disruption | |||
set_running_nemesis(node, self.current_disruption, self.disruptive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does self.disruptive would work ?
all those are set on only on nemesis classes, not the Nemesis class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short remarks.
Also, because running_nemesis is list now, need to be carefully with passing to (un)set_running_nemesis function name of nemesis.
sdcm/target_node_lock.py
Outdated
|
||
|
||
@contextmanager | ||
def run_nemesis(node_list: list[BaseNode], nemesis_label: str, is_disruptive=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think better have is_disruptive=False as in set_running_nemesis
sdcm/scan_operation_thread.py
Outdated
if TYPE_CHECKING: | ||
from sdcm.cluster import BaseNode | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be it is not needed at all
33fd377
to
6c1792d
Compare
6c1792d
to
793489d
Compare
@soyacz @fruch @aleksbykov, Thank you for your comments. I've changed the logic to choose only a non running_nemesis node for run_nemesis, and it will help to do FullScan safe when several nemesis are running. But I don't think it will help for RollingRestartCluster because it 'disrupts' all nodes in the cluster. What do you think regarding updating disrupt_rolling_restart_cluster to restart only non running_nemesis nodes? |
27ead1f
to
6c23dee
Compare
3eb3f14
to
1184ff1
Compare
according to @fruch, I've back to solution with locking nodes 1 by one during RollingRestartCluster and increasing timouts The last test run sows only 2 error messages but all nemesis pass
can be #8705 and
is scylladb/scylladb#21578 - will change it to warning in the next PR |
@temichus |
wrolg link scylladb/scylladb#21578 |
sdcm/nemesis.py
Outdated
@@ -984,7 +963,12 @@ def disrupt_soft_reboot_node(self): | |||
|
|||
@decorate_with_context(ignore_ycsb_connection_refused) | |||
def disrupt_rolling_restart_cluster(self, random_order=False): | |||
self.cluster.restart_scylla(random_order=random_order) | |||
try: | |||
for node in self.cluster.nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not the only place in SCT we have a rolling restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- _enable_disable_table_encryption
- disrupt_mgmt_restore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should consider moving this logic into restart_scylla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node.restart_scylla and cluster.restart_scylla is used in many places(test_functional.py, mng_cli_test.py, audit.py, artifacts_test.py, stop_compaction_test.py). But the lock mechanism is needed only when we have nemeses(in longevity tests). So let's keep it for nemeses only(at least for now)
will update _enable_disable_table_encryption and disrupt_mgmt_restore
First, does it solve the issue? |
1184ff1
to
e3e7af0
Compare
yes, there is no conflict anymore between FullScan and RollingRestart. Each chooses only the "free" node and locks it while doing an operation on ti |
this commit has the following changes 1 introduce common targed_node_lock mechanism that can be used in nemesis and Scan operations 2 FullScan operation now run only on free of nemeses node 3 change all node.running_nemesis settings to use common methods set/unset_running_nemesis from common targed_node_lock file (except unit tests) 4 change disrupt_rolling_restart_cluster nemesis to lock all nodes in the cluster befo performing restart fixes: scylladb#9284
e3e7af0
to
267cc0a
Compare
after a discussion with @fruch lock mechanism is not the best because Nemeses just will not run and wait for the lock. we heed a solution to let full-scan know (somehow) that the node was rebooted during full-scan |
I'm suggesting again, to have a set of errors we can ignore approved by relevant developer and apply it in all relevant nemesis. |
Lock suggested in this PR might cause high changes the nemesis doing rolling restart to be skipped in parallel nemesis cases. I suggested that we need some way to identify that nodes restarted during the fullscan operation, but this not trivial with current event system. One other option, since fullscan is just a stress tool, we implemented in python. |
This is an option but it may hide problems. |
So what's the problem to ignore also the "map reduce" specific error we are getting now? |
The problem is that ignore logic kicked in only if fullscan was using the same node as the current nemesis target node. which isn't the case in rolling restarts, cause we target all nodes. If we're gonna just ignore the validation of the map reduce metric, we might as well remove it completely. |
How it's different from the problems it might hide in c-s when we are retrying ? Our from current implemention that make anything into a warning if it's during a nemesis targeting the same node ? We use retry when there's an inerit cause of failure which is expected, and we can specifically cross match. |
We don't need to ignore the validation, we just need to ignore that specific error we're getting now. |
It's a background thread that one hardly notice if it runs or not.
You may use retry which will hide 99% of the problems and you wouldn't know, can't tell from monitor or anywhere else. |
if it is expected for scan to fail following server down then it shouldn't raise an error event. Instead, it could become "disruptive nemesis aware" somehow or alternatively "log aware" so it will check for node down during scan. perhaps this case shows it:
|
this commit has the following changes
1 introduce common targed_node_lock mechanism
that can be used in nemesis and Scan operations
2 FullScan operation now run only on free of nemeses node
3 change all node.running_nemesis settings to use common methods
set/unset_running_nemesis from common targed_node_lock file (except unit tests)
fixes: #9284
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)