Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move overriding Scylla Operator's log level to CI scripts and make it configurable #2297

Merged

Conversation

rzetelskik
Copy link
Member

@rzetelskik rzetelskik commented Jan 3, 2025

Description of your changes: Currently, the Operator's and manager controller's loglevels are patched after they have already been deployed, which causes a rolling restart. It prevents us from giving operator guaranteed QoS in CI with resources >= half of what's available because PDB blocks the restart and everything gets stuck.

This PR moves the adjustment to CI scripts and makes the value configurable through e2e scripts.

Which issue is resolved by this Pull Request:
Resolves #2296

/cc

Copy link
Contributor

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Description of your changes: wip

Which issue is resolved by this Pull Request:
Resolves #2296

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 3, 2025
@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch from 674ff94 to 4570486 Compare January 8, 2025 11:59
@scylla-operator-bot scylla-operator-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 8, 2025
@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch 4 times, most recently from 6e104a8 to 7e42f88 Compare January 8, 2025 13:26
@scylla-operator-bot scylla-operator-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 8, 2025
@rzetelskik rzetelskik changed the title [WIP] Move overriding Operator's and Manager's log level to CI scripts and make it configurable [WIP] Move overriding Scylla Operator's and Scylla Manager controller's log level to CI scripts and make it configurable Jan 8, 2025
@rzetelskik rzetelskik changed the title [WIP] Move overriding Scylla Operator's and Scylla Manager controller's log level to CI scripts and make it configurable [WIP] Move overriding Scylla Operator's log level to CI scripts and make it configurable Jan 8, 2025
@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch 2 times, most recently from 4f970b5 to 67740d8 Compare January 8, 2025 15:10
@rzetelskik
Copy link
Member Author

rzetelskik commented Jan 8, 2025

Proof that loglevel propagated in ci-deploy.sh:

I added a transitional commit setting the env vars in hack/.ci/deploy-release-gke.sh to test it. The loglevel was propagated:

@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch 6 times, most recently from 76ad542 to 235da58 Compare January 9, 2025 09:51
@rzetelskik
Copy link
Member Author

/priority important-soon
/kind machinery

@scylla-operator-bot scylla-operator-bot bot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/machinery Categorizes issue or PR as related to Makefile, scripts or similar changes. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jan 9, 2025
@scylla-operator-bot scylla-operator-bot bot requested a review from zimnx January 9, 2025 09:54
@rzetelskik rzetelskik changed the title [WIP] Move overriding Scylla Operator's log level to CI scripts and make it configurable Move overriding Scylla Operator's log level to CI scripts and make it configurable Jan 9, 2025
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2025
Copy link
Collaborator

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the in place editing pattern for the operator & manager kustomizations is already present in these scripts (as opposed to creating a dependent kustomization), I agree that following the established way of doing things is the right decision (whilst a rewrite to a declarative approach is strongly preferable for maintainability reasons).

Comment on lines 26 to 28
SO_SCYLLA_OPERATOR_LOGLEVEL="${SO_SCYLLA_OPERATOR_LOGLEVEL:-4}"
export SO_SCYLLA_OPERATOR_LOGLEVEL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about creating a _run-e2e-common.inc.sh file with some of the bash boilerplate? (to get rid of the "repeated constants" code smell)

that would then be included using something like

source "$( dirname "${BASH_SOURCE[0]}" )/_run-e2e-common.inc.sh"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind separate scripts was to make the config platform-dependent and allow for configuring it in the CI jobs. The values here are meant as platform-dependent defaults, but as you can see, most of them repeat across different platforms.
This particular one could be embedded in ci-deploy.sh and ci-deploy-release.sh scripts, but it won't hold for some other examples (e.g. resources, as in #2276, because it will prevent us from using it locally and/or on less powerful setups).
I think having the shared sourced script could clean things up a bit, but I'm wondering if it's not going to obfuscate where the values are coming from in the end (given they could then come from the CI job, the shared script, or the e2e script)?
I'll make it your call. 😉

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option of embedding the default log level in ci-deploy.sh and ci-deploy-release.sh directly (and inlining the loglevel patch in the kustomization yaml) looks most promising to me (because this config is unlikely to be environment-dependent and adding excessive logic here doesn't add clear benefit)

Let's do it (that is: drop the conditional, embed the loglevel patch directly in the kustomization) if you think the same way. Structuring the config in a streamlined way would require a fundamental restructuring of the deploy machinery - something we won't do in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

hack/ci-deploy.sh Show resolved Hide resolved
@rzetelskik
Copy link
Member Author

whilst a rewrite to a declarative approach is strongly preferable for maintainability reasons

I have a couple of similar PRs in queue, so maybe it's worth considering. I'm not sure how would you see this done though?

@mflendrich
Copy link
Collaborator

I'm not sure how would you see this done though?

I have started looking and it was more complex than I managed to produce a complete recipe for in a timeboxed effort. But basically the idea is that:

  • the kustomization file is either static or generated from a single "flat" template
  • the patches are static (and - where appropriate - replaced with kustomize transformers like images, replacements etc.
  • what varies between executions is: decisions whether to apply specific patches or not and what values to fill in - but keeping the structure of the kustomization fixed.

I am more than happy to try explore this tomorrow, or pair program. Kustomize has one important limitation that it doesn't like accepting external context that we'd need to work around somehow.

@rzetelskik
Copy link
Member Author

I'm not sure how would you see this done though?

I have started looking and it was more complex than I managed to produce a complete recipe for in a timeboxed effort. But basically the idea is that:

  • the kustomization file is either static or generated from a single "flat" template
  • the patches are static (and - where appropriate - replaced with kustomize transformers like images, replacements etc.
  • what varies between executions is: decisions whether to apply specific patches or not and what values to fill in - but keeping the structure of the kustomization fixed.

I though about it before but I figured it wouldn't make that much of a difference since the patches would still have to be conditionally modified and concatenated to the kustomize file. It might make things more understandable, but I don't think it will change the flow in these scripts too much, so it seems mostly orthogonal to this PR.
How about we take this as a followup to these and other PRs in queue?

@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch from 235da58 to 721987b Compare January 10, 2025 10:58
@rzetelskik rzetelskik changed the title Move overriding Scylla Operator's log level to CI scripts and make it configurable [WIP] Move overriding Scylla Operator's log level to CI scripts and make it configurable Jan 10, 2025
@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2025
@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch 2 times, most recently from cc4e7d3 to ce2750c Compare January 10, 2025 14:30
@scylla-operator-bot scylla-operator-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 10, 2025
@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch from ce2750c to ecb7300 Compare January 10, 2025 15:21
@rzetelskik rzetelskik force-pushed the operator-loglevel-ci-script branch 2 times, most recently from 0228d81 to e981aa9 Compare January 10, 2025 17:15
@rzetelskik rzetelskik requested a review from mflendrich January 10, 2025 17:16
@rzetelskik rzetelskik changed the title [WIP] Move overriding Scylla Operator's log level to CI scripts and make it configurable Move overriding Scylla Operator's log level to CI scripts and make it configurable Jan 10, 2025
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2025
@rzetelskik
Copy link
Member Author

@mflendrich I believe I addressed all of your comments - let me know if we're on the same page now

Copy link
Collaborator

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good 👍

/lgtm
/approve

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2025
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mflendrich, rzetelskik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2025
@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip e981aa9 link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

#2304 (comment)
known manager flake
/test images
/retest

@scylla-operator-bot scylla-operator-bot bot merged commit 7ed4ab4 into scylladb:master Jan 13, 2025
14 checks passed
@rzetelskik rzetelskik deleted the operator-loglevel-ci-script branch January 13, 2025 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/machinery Categorizes issue or PR as related to Makefile, scripts or similar changes. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

E2E scripts cause unnecessary rolling restart of the operator by patching log level
2 participants