Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Refactor certificate watcher to use polling, instead of fsnotify #3020

Merged
merged 3 commits into from
Nov 26, 2024

Conversation

m-messiah
Copy link
Member

Summary

In the current controller-runtime/certwatcher behaviour if any of the files (cert or key) are deleted before the replacement, the code removes the watcher. It gets stuck with watching nothing, so when the replacement arrives - no goroutine will receive that event and all the future certificate updates (including this) are missed until the code is restarted.

Details

The added test in the PR shows the example scenario of updating the certificate in two steps:

  1. rm /certs/mtls.pem /certs/mtls.key
  2. tar xvf -C /certs new_mtls.tar.gz (or similar).

Expected behaviour:
Return the previous certificate until the new files appear, reestablish the watch and reread the certs when ready.

Current behaviour:
The watch is removed and new files do not trigger anything.

Possible options to consider

  1. The ensureAllFilesAreWatched could be called in select/default branch, but it increases the load drastically. 1 second seems reasonable enough.
  2. The ticker could be increased to 10 seconds to reduce load (or be customizable), but don't see a big issue here as it checks the filesystem and its internal locks only.
  3. Instead of reestablishing watches we could stop the watcher and trigger controller code to be reinitialised from the beginning - it is a nice behaviour we use for Kube caches and watchers, but does not seem good for certificates to reinitialise the whole context after such small change.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 22, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 22, 2024
@@ -139,6 +163,8 @@ func (cw *CertWatcher) Watch() {
}

log.Error(err, "certificate watch error")
case <-watcherHealthTimer.C:
cw.ensureAllFilesAreWatched()
Copy link
Member

@alvaroaleman alvaroaleman Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permanently polling once per second seems pretty aggressive. Can we not watch the dir intead if the file gets removed and re-add the filewatch once it exists again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: wondering at this point if we should generally avoid relying of fs events, and rather say poll every N seconds and reload the certificates if the hash changed or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also make the polling configurable, and default it to something reasonable where both old (loaded) and new certificates are supposed to be valid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not watch the dir intead if the file gets removed and re-add the filewatch once it exists again?

It does not watch the dir, it just tries to reinstall the notify watch that fails if the file is not present.
So technically, it is the same as "waiting until the file exists again" as the first step of addWatch is os.Lstat(name).

The other option is to watch the directory events instead of file events and parse all events for the file names but it:

  1. Increases the number of events to parse and handle
  2. Does not solve the case when the directory would be removed and recreated again (or have mount issues, whatever).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: wondering at this point if we should generally avoid relying of fs events, and rather say poll every N seconds and reload the certificates if the hash changed or something?

It makes sense, and in some of our internal projects, we rely on this behaviour instead. The downside is that you need to find a balance between polling interval, certificate expiration time, and number of file reads in this case; so ideally this param should also be configurable across all usages of certwatcher (webhook server and metrics server at least)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the main point is with the aggressive 1 second polling. I like vinces idea of just re-reading the file every 10 seconds and caching in between, that is also what client-go does for the SA token

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions.
@alvaroaleman @vincepri I updated the PR with full removal of fsnotify from certwatcher and having a simple read/cache/repeat ticker loop. Please check and leave your opinions.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 26, 2024
@k8s-ci-robot
Copy link
Contributor

@m-messiah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-controller-runtime-apidiff cbb3859 link false /test pull-controller-runtime-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
/hold

@vincepri does this look good to you?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3ab463016c247d6ff9787de7b6df1faffe0d50d0

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2024
@alvaroaleman alvaroaleman added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 26, 2024
@m-messiah
Copy link
Member Author

Do you know what I should do with apidiff check?

The method Watch does not look to be important outside of the Start, but it was Exported. Shall I keep it there and do nothing?

sigs.k8s.io/controller-runtime/pkg/certwatcher
  Incompatible changes:
  - (*CertWatcher).Watch: removed
  Compatible changes:
  - (*CertWatcher).WithWatchInterval: added

@alvaroaleman
Copy link
Member

Do you know what I should do with apidiff check?

It is just informative and tells us that there was a breaking change to exported interfaces. It won't block merge

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, m-messiah, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alvaroaleman,vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vincepri
Copy link
Member

/retitle 🐛 Refactor certificate watcher to use polling, instead of fsnotify

@k8s-ci-robot k8s-ci-robot changed the title 🐛 Reestablish watch for the certificate paths 🐛 Refactor certificate watcher to use polling, instead of fsnotify Nov 26, 2024
@vincepri
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024
@k8s-ci-robot k8s-ci-robot merged commit 8e44a43 into kubernetes-sigs:main Nov 26, 2024
15 of 16 checks passed
@vincepri
Copy link
Member

/cherry-pick release-0.19

@k8s-infra-cherrypick-robot

@vincepri: new pull request created: #3023

In response to this:

/cherry-pick release-0.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@skhalash
Copy link

@m-messiah After this change our tests showed the following problem - our operator starts, a certificate is generated on startup and saved in the cert dir, and only after 10 seconds it is actually getting picked up. Which was not the case with the old fsnotify approach. We can fix it by decreasing the timeout, but is still feels wrong to me.

@m-messiah
Copy link
Member Author

@m-messiah After this change our tests showed the following problem - our operator starts, a certificate is generated on startup and saved in the cert dir, and only after 10 seconds it is actually getting picked up. Which was not the case with the old fsnotify approach. We can fix it by decreasing the timeout, but is still feels wrong to me.

The expectation of certwatcher is that files are already present, or be present in the next loop after interval.
I understand the problem, but it looks like your operator is trying to start the watcher before the certs are generated and to solve this you should be able to fix the order of prerequisites and the actual start.

@m-messiah m-messiah deleted the re-establish-watch branch December 18, 2024 15:15
@skhalash
Copy link

skhalash commented Dec 18, 2024

@m-messiah Hmm, we are not doing anything fancy there and as far as I understand the initial cert read and the timer creation happen here

certWatcher, err := certwatcher.New(certPath, keyPath)
and at that point our cert is already there.

@skhalash
Copy link

I found the root cause in our codebase, but I’m still questioning this change. Our operator manages its own webhook certificates (rotation, etc.) instead of using an external service like cert-manager. The issue is that we assume updating the cert file makes it immediately usable by the HTTPS server. This worked before but doesn’t anymore after this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants