'disrupt_resetlocalschema' nemesis fails for `longevity-mv-si-4days` test #8534

dimakr · 2024-09-02T09:57:41Z

Packages

Scylla version: 2024.1.9-20240829.d583605198a7 with build-id c0a1a483aad4949fe2ed11479d5a99e92672bb2b
Kernel Version: 5.15.0-1068-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

longevity-mv-si-4days lengevity scenario failed with the error:

2024-08-30 14:07:58.937: (DisruptionEvent Severity.ERROR) period_type=end event_id=2f19ab99-2c50-4f82-ab20-2cac3936876f duration=43s: nemesis_name=Resetlocalschema target_node=Node longevity-mv-si-4d-2024-1-db-node-7f557fdd-7 [44.220.153.199 | 10.12.9.193] errors=Schema version has not been recalculated
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5122, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 872, in disrupt_resetlocalschema
assert wait_for(
AssertionError: Schema version has not been recalculated

Impact

disrupt_resetlocalschema nemesis should pass

How frequently does it reproduce?

No occurrences of the issue were noticed since 2023 in #6229

Installation details

Cluster size: 5 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

longevity-mv-si-4d-2024-1-db-node-7f557fdd-8 (54.175.41.236 | 10.12.11.212) (shards: 30)
longevity-mv-si-4d-2024-1-db-node-7f557fdd-7 (44.220.153.199 | 10.12.9.193) (shards: 30)
longevity-mv-si-4d-2024-1-db-node-7f557fdd-6 (54.242.13.153 | 10.12.8.18) (shards: 30)
longevity-mv-si-4d-2024-1-db-node-7f557fdd-5 (18.209.49.207 | 10.12.8.227) (shards: 30)
longevity-mv-si-4d-2024-1-db-node-7f557fdd-4 (34.207.110.157 | 10.12.9.70) (shards: 30)
longevity-mv-si-4d-2024-1-db-node-7f557fdd-3 (3.80.226.82 | 10.12.8.192) (shards: 30)
longevity-mv-si-4d-2024-1-db-node-7f557fdd-2 (184.72.164.84 | 10.12.9.127) (shards: 30)
longevity-mv-si-4d-2024-1-db-node-7f557fdd-1 (34.203.28.1 | 10.12.11.69) (shards: 30)

OS / Image: ami-0571d896a052e46a3 (aws: undefined_region)

Test: longevity-mv-si-4days-test
Test id: 7f557fdd-8e48-40f3-bb8f-7c1524f8abd0
Test name: enterprise-2024.1/longevity/longevity-mv-si-4days-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-mv-si-4days.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 7f557fdd-8e48-40f3-bb8f-7c1524f8abd0
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7f557fdd-8e48-40f3-bb8f-7c1524f8abd0

Logs:

db-cluster-7f557fdd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f557fdd-8e48-40f3-bb8f-7c1524f8abd0/20240830_173134/db-cluster-7f557fdd.tar.gz
sct-runner-events-7f557fdd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f557fdd-8e48-40f3-bb8f-7c1524f8abd0/20240830_173134/sct-runner-events-7f557fdd.tar.gz
sct-7f557fdd.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f557fdd-8e48-40f3-bb8f-7c1524f8abd0/20240830_173134/sct-7f557fdd.log.tar.gz
loader-set-7f557fdd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f557fdd-8e48-40f3-bb8f-7c1524f8abd0/20240830_173134/loader-set-7f557fdd.tar.gz
monitor-set-7f557fdd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f557fdd-8e48-40f3-bb8f-7c1524f8abd0/20240830_173134/monitor-set-7f557fdd.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

soyacz · 2024-09-02T10:52:59Z

Why do you think it's SCT issue instead of Scylladb?

dimakr · 2024-09-02T11:01:58Z

Why do you think it's SCT issue instead of Scylladb?

I was led by example of how the same issue/symptoms was reported in the past for K8S related config in #6229 (and comments in that issue suggest that it is an SCT issue).

fruch · 2024-09-29T15:25:30Z

Why do you think it's SCT issue instead of Scylladb?

I was led by example of how the same issue/symptoms was reported in the past for K8S related config in #6229 (and comments in that issue suggest that it is an SCT issue).

I'm not sure it's enough to figure why it was failing.
i.e. something else is happening causing to fail, and we don't know what it is...

fruch · 2024-09-30T05:36:36Z

@dimakr

Seems this is happening again and again

Please cross check the logs of those runs in 2024.1, sounds like we are missing the expected prints (short timeout, or starting to look too late), or the operation doesn't happen on scylla side.

This flow isn't working for quite some time in master, since this command isn't supported in new scylla nodetool, but seems a regression in 2024.1 that we should chase down

dimakr · 2024-10-08T23:57:38Z

The issue repeated again in 2024.1.11 - longevity-mv-si-4days-test.

As per system.log of the target node-7 the pattern that the disrupt_resetlocalschema is waiting for appeared as expected - within the 30s timeout:

sct.log, the command is issued at 01:11:57, finished at 01:12:00

< t:2024-09-30 01:11:57,409 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.10.215>: Running command "/usr/bin/nodetool  resetlocalschema "...                                                                                                                          ┤
< t:2024-09-30 01:12:00,998 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.10.215>: Command "/usr/bin/nodetool  resetlocalschema " finished with status 0

node-7 system.log shows that reset_local_schema api request appeared at 01:12:00.678512, the schema_tables - Schema version changed to message appeared at 01:12:00.829882

Sep 30 01:12:00.678512 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  0:stre] api - reset_local_schema
Sep 30 01:12:00.734965 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  4:comp] compaction - [Compact mview.users fe7e3e20-7ec8-11ef-afe7-ff74ce421117] Compacted 2 sstables to [/var/lib/scylla/data/mview/users-ea783c807e7011ef9b931323aedade42/me-3gjz_03c0_00n5c2oauvemxubhuf-big-Data.db:level=0]. 128MB to 96MB (~75% of original) in 731ms = 176MB/s. ~12288 total partitions merged to 9105.
Sep 30 01:12:00.750382 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  0:stre] migration_manager - Reloading schema
...
Sep 30 01:12:00.829882 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  0:stre] schema_tables - Schema version changed to d75fd83a-1d54-3027-a068-26a2a7867261

The problem is that in node-7 messages.log the records about reset_local_schema api request and then the corresponding schema_tables - Schema version changed to message appear at 02:16:16:

2024-09-30T02:16:16.348+00:00 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7     !INFO | scylla[7245]:  [shard  0:stre] api - reset_local_schema
...
2024-09-30T02:16:16.447+00:00 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7     !INFO | scylla[7245]:  [shard  0:stre] schema_tables - Schema version changed to d75fd83a-1d54-3027-a068-26a2a7867261

And as per

scylla-cluster-tests/sdcm/cluster.py

Line 562 in 318be43

    
           syslogng_log_path = os.path.join(self.test_config.logdir(), 'hosts', self.short_hostname, 'messages.log')

the test was following messages.log.

In general, there is some mess with logs in node-7 messages.log (it is probably the same as #6682):

there are a lot of 1 or 2 mins gaps in logging, starting from 23:48:14 (and only starting from 02:16:14 the logs seem to be consistent again)
there are a lot of syslog-ng connection drops in node-7 messages.log

❯ grep -E 'syslog-ng.*connection broken' db-cluster-8a8bfd56/longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7/messages.log | wc -l
70

As per messages.log, the 1st syslog-ng connection drop message is at 01:17:14. But it's just probably wrong timestamp and the drops are likely started somewhere at 23:48:14.

soyacz · 2024-10-09T06:25:26Z

maybe we should backport #8743 also to 2024.1?

fruch · 2024-10-09T08:20:56Z

maybe we should backport #8743 also to 2024.1?

let's try

and test this specific nemesis for couple of hours

soyacz · 2024-10-09T10:06:49Z

maybe we should backport #8743 also to 2024.1?

let's try

and test this specific nemesis for couple of hours

backported. But I'm not sure which test to run - it looks very random issue and I don't see the point of retrying just for this purpose. Can we wait for another round?

roydahan · 2024-11-10T17:15:38Z

Is this still relevant?
Does scylla even support this command?
I think lately we just removed it from dtest.

fruch · 2025-01-02T12:46:14Z

not clear enough, a suspected fix was backported

closing for now

github-actions bot assigned dimakr Sep 2, 2024

dimakr removed their assignment Sep 2, 2024

roydahan added the relevant? label Nov 10, 2024

fruch closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'disrupt_resetlocalschema' nemesis fails for `longevity-mv-si-4days` test #8534

'disrupt_resetlocalschema' nemesis fails for `longevity-mv-si-4days` test #8534

dimakr commented Sep 2, 2024

Logs:

soyacz commented Sep 2, 2024

dimakr commented Sep 2, 2024

fruch commented Sep 29, 2024

fruch commented Sep 30, 2024

dimakr commented Oct 8, 2024 •

edited

Loading

soyacz commented Oct 9, 2024

fruch commented Oct 9, 2024

soyacz commented Oct 9, 2024

roydahan commented Nov 10, 2024

fruch commented Jan 2, 2025

'disrupt_resetlocalschema' nemesis fails for longevity-mv-si-4days test #8534

'disrupt_resetlocalschema' nemesis fails for longevity-mv-si-4days test #8534

Comments

dimakr commented Sep 2, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

soyacz commented Sep 2, 2024

dimakr commented Sep 2, 2024

fruch commented Sep 29, 2024

fruch commented Sep 30, 2024

dimakr commented Oct 8, 2024 • edited Loading

soyacz commented Oct 9, 2024

fruch commented Oct 9, 2024

soyacz commented Oct 9, 2024

roydahan commented Nov 10, 2024

fruch commented Jan 2, 2025

'disrupt_resetlocalschema' nemesis fails for `longevity-mv-si-4days` test #8534

'disrupt_resetlocalschema' nemesis fails for `longevity-mv-si-4days` test #8534

dimakr commented Oct 8, 2024 •

edited

Loading