Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scylla Manager controller will update tasks despite no changes in spec #1827

Open
Tracked by #1939
rzetelskik opened this issue Mar 13, 2024 · 10 comments · May be fixed by #2142
Open
Tracked by #1939

Scylla Manager controller will update tasks despite no changes in spec #1827

rzetelskik opened this issue Mar 13, 2024 · 10 comments · May be fixed by #2142
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@rzetelskik
Copy link
Member

What happened?

Scylla Manager controller decides to update the tasks defined in ScyllaCluster's spec by checking deep equality between the definition and the task obtained from the Manager's state.

update = !reflect.DeepEqual(backupTask, managerTask)

Since some fields are converted when translating them to requests to Scylla Manager, but not when converting them back, the deep equality will always be false in some cases. This in turn means that tasks can be updated indefinitely in a loop, despite their specification not changing. This causes superfluous, additional load to Scylla Manager and the controller.

The same situation can also be caused by the Manager defaulting some fields or not returning their values in API call responses.

Example logs:

I0313 12:00:21.129683       1 manager/sync.go:134] "Executing action" action="add task &{ClusterID: Enabled:true ID: Name:weekly Properties:map[intensity:1 parallel:1 small_table_threshold:1073741824] Schedule:0xc00069d5e0 Tags:[] Type:repair}"
...
I0313 12:00:21.291902       1 manager/sync.go:93] "Started syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vkzsg-nwz28/basic-55h26" startTime="2024-03-13 12:00:21.291882268 +0000 UTC m=+1712.897695871"
I0313 12:00:21.306972       1 manager/sync.go:134] "Executing action" action="update task &{ClusterID: Enabled:true ID:c0aa282b-63ef-4dc5-87c3-475e3dcec9e0 Name:weekly Properties:map[intensity:1 parallel:1 small_table_threshold:1073741824] Schedule:0xc0000b20e0 Tags:[] Type:repair}"
I0313 12:00:21.483593       1 manager/sync.go:95] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vkzsg-nwz28/basic-55h26" duration="191.693011ms"
...
I0313 12:03:22.862395       1 manager/sync.go:93] "Started syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vkzsg-nwz28/basic-55h26" startTime="2024-03-13 12:03:22.862358661 +0000 UTC m=+1894.468172253"
I0313 12:03:22.885635       1 manager/sync.go:134] "Executing action" action="update task &{ClusterID: Enabled:true ID:c0aa282b-63ef-4dc5-87c3-475e3dcec9e0 Name:weekly Properties:map[intensity:1 parallel:1 small_table_threshold:1073741824] Schedule:0xc0002448c0 Tags:[] Type:repair}"
...
I0313 12:04:41.037507       1 manager/sync.go:93] "Started syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vkzsg-nwz28/basic-55h26" startTime="2024-03-13 12:04:41.037464223 +0000 UTC m=+1972.643277820"
I0313 12:04:41.058417       1 manager/sync.go:134] "Executing action" action="update task &{ClusterID: Enabled:true ID:c0aa282b-63ef-4dc5-87c3-475e3dcec9e0 Name:weekly Properties:map[intensity:1 parallel:1 small_table_threshold:1073741824] Schedule:0xc000244070 Tags:[] Type:repair}"
I0313 12:04:41.201111       1 manager/sync.go:95] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vkzsg-nwz28/basic-55h26" duration="163.634093ms"
I0313 12:04:41.201153       1 manager/sync.go:93] "Started syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vkzsg-nwz28/basic-55h26" startTime="2024-03-13 12:04:41.201142592 +0000 UTC m=+1972.806956179"
I0313 12:04:41.223481       1 manager/sync.go:134] "Executing action" action="update task &{ClusterID: Enabled:true ID:c0aa282b-63ef-4dc5-87c3-475e3dcec9e0 Name:weekly Properties:map[intensity:1 parallel:1 small_table_threshold:1073741824] Schedule:0xc0002444d0 Tags:[] Type:repair}"
I0313 12:04:41.367677       1 manager/sync.go:95] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vkzsg-nwz28/basic-55h26" duration="166.520159ms"

In the above scenario the infinite updates come from the discrepancy of small_table_threshold value between ScyllaCluster's spec and the Manager's state, due to the value being converted before sending the request.

What did you expect to happen?

The tasks should not be updated when there are no changes in their spec.

How can we reproduce it (as minimally and precisely as possible)?

Schedule any task using ScyllaCluster's API.

Scylla Operator version

master

Kubernetes platform name and version

n/a

Please attach the must-gather archive.

n/a

Anything else we need to know?

No response

@rzetelskik rzetelskik added the kind/bug Categorizes issue or PR as related to a bug. label Mar 13, 2024
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 13, 2024
@rzetelskik
Copy link
Member Author

rzetelskik commented Mar 13, 2024

Although this can be worked around by tweaking the deep equality test, the ideal approach would be to annotate the tasks with a checksum of the most recently sent spec. There's an existing feature request which would allow us to implement this: scylladb/scylla-manager#3645.

@zimnx @tnozicka what do you suggest?

@rzetelskik rzetelskik changed the title Scylla Manager will update tasks despite no changes in spec Scylla Manager controller will update tasks despite no changes in spec Mar 13, 2024
@rzetelskik
Copy link
Member Author

Another issue is that this goes past the unit tests since they use a crafted "manager state", which doesn't correspond to what would normally come from the manager client. Should we maybe use a mock client instead?

I can't come up with a way to verify this trivially in our e2e suite.

@tnozicka
Copy link
Member

scylladb/scylla-manager#3645

not sure if that's enough at at some point we'd collide with user's note. annotations like map would be best.
I even think the issue is more broad applying to the cluster definition itself and backups

Should we maybe use a mock client instead?

mock are usually not good when it come to the level of API admission / defaulting / conversion

I can't come up with a way to verify this trivially in our e2e suite.

I suppose a progressing condition might show this

@tnozicka tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 13, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 13, 2024
@rzetelskik
Copy link
Member Author

I suppose a progressing condition might show this

We can't use them from the manager controller, can we?

@tnozicka
Copy link
Member

manager controller already sets status on scyllaclusters, it can add its own progressing condition - if it detects a change, it sets progressing, if next sync detects no change, it sets false

@rzetelskik
Copy link
Member Author

not sure if that's enough at at some point we'd collide with user's note. annotations like map would be best.
I even think the issue is more broad applying to the cluster definition itself and backups

Sure, I wasn't proposing we do exactly this, only pointing out there's already a need for it - labels/annotations map definitely seems more fitting. There's even another issue for clusters already: scylladb/scylla-manager#3219. It's closer to what we need so I'll update this one instead.

@rzetelskik
Copy link
Member Author

Just for the record this is waiting for scylladb/scylla-manager#3219. The manager team agreed to add a metadata/labels map to the clusters/tasks API and we'll use that to decide if the operator controls the given object and to compare hashes of the objects to decide if we need to update them. It won't come in 3.2.8 though, we'll have to wait a bit longer.
Xref: scylladb/scylla-manager#3828 (comment)

Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2024
@rzetelskik
Copy link
Member Author

/remove-lifecycle stale
/triage accepted

@scylla-operator-bot scylla-operator-bot bot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2024
@rzetelskik
Copy link
Member Author

scylladb/scylla-manager#3219 was closed as completed with scylladb/scylla-manager#3934, so this is no longer blocked on SM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
2 participants