Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'disrupt_resetlocalschema' nemesis fails for longevity-mv-si-4days test #8534

Closed
1 of 2 tasks
dimakr opened this issue Sep 2, 2024 · 10 comments
Closed
1 of 2 tasks

Comments

@dimakr
Copy link
Contributor

dimakr commented Sep 2, 2024

Packages

Scylla version: 2024.1.9-20240829.d583605198a7 with build-id c0a1a483aad4949fe2ed11479d5a99e92672bb2b
Kernel Version: 5.15.0-1068-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

longevity-mv-si-4days lengevity scenario failed with the error:

2024-08-30 14:07:58.937: (DisruptionEvent Severity.ERROR) period_type=end event_id=2f19ab99-2c50-4f82-ab20-2cac3936876f duration=43s: nemesis_name=Resetlocalschema target_node=Node longevity-mv-si-4d-2024-1-db-node-7f557fdd-7 [44.220.153.199 | 10.12.9.193] errors=Schema version has not been recalculated
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5122, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 872, in disrupt_resetlocalschema
assert wait_for(
AssertionError: Schema version has not been recalculated

Impact

disrupt_resetlocalschema nemesis should pass

How frequently does it reproduce?

No occurrences of the issue were noticed since 2023 in #6229

Installation details

Cluster size: 5 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-8 (54.175.41.236 | 10.12.11.212) (shards: 30)
  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-7 (44.220.153.199 | 10.12.9.193) (shards: 30)
  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-6 (54.242.13.153 | 10.12.8.18) (shards: 30)
  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-5 (18.209.49.207 | 10.12.8.227) (shards: 30)
  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-4 (34.207.110.157 | 10.12.9.70) (shards: 30)
  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-3 (3.80.226.82 | 10.12.8.192) (shards: 30)
  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-2 (184.72.164.84 | 10.12.9.127) (shards: 30)
  • longevity-mv-si-4d-2024-1-db-node-7f557fdd-1 (34.203.28.1 | 10.12.11.69) (shards: 30)

OS / Image: ami-0571d896a052e46a3 (aws: undefined_region)

Test: longevity-mv-si-4days-test
Test id: 7f557fdd-8e48-40f3-bb8f-7c1524f8abd0
Test name: enterprise-2024.1/longevity/longevity-mv-si-4days-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 7f557fdd-8e48-40f3-bb8f-7c1524f8abd0
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7f557fdd-8e48-40f3-bb8f-7c1524f8abd0

Logs:

Jenkins job URL
Argus

@dimakr dimakr removed their assignment Sep 2, 2024
@soyacz
Copy link
Contributor

soyacz commented Sep 2, 2024

Why do you think it's SCT issue instead of Scylladb?

@dimakr
Copy link
Contributor Author

dimakr commented Sep 2, 2024

Why do you think it's SCT issue instead of Scylladb?

I was led by example of how the same issue/symptoms was reported in the past for K8S related config in #6229 (and comments in that issue suggest that it is an SCT issue).

@fruch
Copy link
Contributor

fruch commented Sep 29, 2024

Why do you think it's SCT issue instead of Scylladb?

I was led by example of how the same issue/symptoms was reported in the past for K8S related config in #6229 (and comments in that issue suggest that it is an SCT issue).

I'm not sure it's enough to figure why it was failing.
i.e. something else is happening causing to fail, and we don't know what it is...

@fruch
Copy link
Contributor

fruch commented Sep 30, 2024

@dimakr

Seems this is happening again and again

Please cross check the logs of those runs in 2024.1, sounds like we are missing the expected prints (short timeout, or starting to look too late), or the operation doesn't happen on scylla side.

This flow isn't working for quite some time in master, since this command isn't supported in new scylla nodetool, but seems a regression in 2024.1 that we should chase down

@dimakr
Copy link
Contributor Author

dimakr commented Oct 8, 2024

The issue repeated again in 2024.1.11 - longevity-mv-si-4days-test.

As per system.log of the target node-7 the pattern that the disrupt_resetlocalschema is waiting for appeared as expected - within the 30s timeout:

  • sct.log, the command is issued at 01:11:57, finished at 01:12:00
< t:2024-09-30 01:11:57,409 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.10.215>: Running command "/usr/bin/nodetool  resetlocalschema "...                                                                                                                          ┤
< t:2024-09-30 01:12:00,998 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.10.215>: Command "/usr/bin/nodetool  resetlocalschema " finished with status 0
  • node-7 system.log shows that reset_local_schema api request appeared at 01:12:00.678512, the schema_tables - Schema version changed to message appeared at 01:12:00.829882
Sep 30 01:12:00.678512 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  0:stre] api - reset_local_schema
Sep 30 01:12:00.734965 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  4:comp] compaction - [Compact mview.users fe7e3e20-7ec8-11ef-afe7-ff74ce421117] Compacted 2 sstables to [/var/lib/scylla/data/mview/users-ea783c807e7011ef9b931323aedade42/me-3gjz_03c0_00n5c2oauvemxubhuf-big-Data.db:level=0]. 128MB to 96MB (~75% of original) in 731ms = 176MB/s. ~12288 total partitions merged to 9105.
Sep 30 01:12:00.750382 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  0:stre] migration_manager - Reloading schema
...
Sep 30 01:12:00.829882 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7 scylla[7245]:  [shard  0:stre] schema_tables - Schema version changed to d75fd83a-1d54-3027-a068-26a2a7867261

The problem is that in node-7 messages.log the records about reset_local_schema api request and then the corresponding schema_tables - Schema version changed to message appear at 02:16:16:

2024-09-30T02:16:16.348+00:00 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7     !INFO | scylla[7245]:  [shard  0:stre] api - reset_local_schema
...
2024-09-30T02:16:16.447+00:00 longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7     !INFO | scylla[7245]:  [shard  0:stre] schema_tables - Schema version changed to d75fd83a-1d54-3027-a068-26a2a7867261

And as per

syslogng_log_path = os.path.join(self.test_config.logdir(), 'hosts', self.short_hostname, 'messages.log')
the test was following messages.log.


In general, there is some mess with logs in node-7 messages.log (it is probably the same as #6682):

  • there are a lot of 1 or 2 mins gaps in logging, starting from 23:48:14 (and only starting from 02:16:14 the logs seem to be consistent again)
  • there are a lot of syslog-ng connection drops in node-7 messages.log
❯ grep -E 'syslog-ng.*connection broken' db-cluster-8a8bfd56/longevity-mv-si-4d-2024-1-db-node-8a8bfd56-7/messages.log | wc -l
70

As per messages.log, the 1st syslog-ng connection drop message is at 01:17:14. But it's just probably wrong timestamp and the drops are likely started somewhere at 23:48:14.

@soyacz
Copy link
Contributor

soyacz commented Oct 9, 2024

maybe we should backport #8743 also to 2024.1?

@fruch
Copy link
Contributor

fruch commented Oct 9, 2024

maybe we should backport #8743 also to 2024.1?

let's try

and test this specific nemesis for couple of hours

@soyacz
Copy link
Contributor

soyacz commented Oct 9, 2024

maybe we should backport #8743 also to 2024.1?

let's try

and test this specific nemesis for couple of hours

backported. But I'm not sure which test to run - it looks very random issue and I don't see the point of retrying just for this purpose. Can we wait for another round?

@roydahan
Copy link
Contributor

Is this still relevant?
Does scylla even support this command?
I think lately we just removed it from dtest.

@fruch
Copy link
Contributor

fruch commented Jan 2, 2025

not clear enough, a suspected fix was backported

closing for now

@fruch fruch closed this as completed Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants