-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Refactor certificate watcher to use polling, instead of fsnotify #3020
🐛 Refactor certificate watcher to use polling, instead of fsnotify #3020
Conversation
pkg/certwatcher/certwatcher.go
Outdated
@@ -139,6 +163,8 @@ func (cw *CertWatcher) Watch() { | |||
} | |||
|
|||
log.Error(err, "certificate watch error") | |||
case <-watcherHealthTimer.C: | |||
cw.ensureAllFilesAreWatched() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Permanently polling once per second seems pretty aggressive. Can we not watch the dir intead if the file gets removed and re-add the filewatch once it exists again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sidenote: wondering at this point if we should generally avoid relying of fs events, and rather say poll every N seconds and reload the certificates if the hash changed or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also make the polling configurable, and default it to something reasonable where both old (loaded) and new certificates are supposed to be valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not watch the dir intead if the file gets removed and re-add the filewatch once it exists again?
It does not watch the dir, it just tries to reinstall the notify watch that fails if the file is not present.
So technically, it is the same as "waiting until the file exists again" as the first step of addWatch
is os.Lstat(name)
.
The other option is to watch the directory events instead of file events and parse all events for the file names but it:
- Increases the number of events to parse and handle
- Does not solve the case when the directory would be removed and recreated again (or have mount issues, whatever).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sidenote: wondering at this point if we should generally avoid relying of fs events, and rather say poll every N seconds and reload the certificates if the hash changed or something?
It makes sense, and in some of our internal projects, we rely on this behaviour instead. The downside is that you need to find a balance between polling interval, certificate expiration time, and number of file reads in this case; so ideally this param should also be configurable across all usages of certwatcher (webhook server and metrics server at least)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the main point is with the aggressive 1 second polling. I like vinces idea of just re-reading the file every 10 seconds and caching in between, that is also what client-go does for the SA token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestions.
@alvaroaleman @vincepri I updated the PR with full removal of fsnotify from certwatcher and having a simple read/cache/repeat ticker loop. Please check and leave your opinions.
@m-messiah: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
/hold
@vincepri does this look good to you?
LGTM label has been added. Git tree hash: 3ab463016c247d6ff9787de7b6df1faffe0d50d0
|
Do you know what I should do with The method Watch does not look to be important outside of the Start, but it was Exported. Shall I keep it there and do nothing?
|
It is just informative and tells us that there was a breaking change to exported interfaces. It won't block merge |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, m-messiah, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retitle 🐛 Refactor certificate watcher to use polling, instead of fsnotify |
/hold cancel |
/cherry-pick release-0.19 |
@vincepri: new pull request created: #3023 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@m-messiah After this change our tests showed the following problem - our operator starts, a certificate is generated on startup and saved in the cert dir, and only after 10 seconds it is actually getting picked up. Which was not the case with the old |
The expectation of certwatcher is that files are already present, or be present in the next loop after interval. |
@m-messiah Hmm, we are not doing anything fancy there and as far as I understand the initial cert read and the timer creation happen here controller-runtime/pkg/webhook/server.go Line 207 in aea2e32
|
I found the root cause in our codebase, but I’m still questioning this change. Our operator manages its own webhook certificates (rotation, etc.) instead of using an external service like cert-manager. The issue is that we assume updating the cert file makes it immediately usable by the HTTPS server. This worked before but doesn’t anymore after this change. |
Summary
In the current controller-runtime/certwatcher behaviour if any of the files (cert or key) are deleted before the replacement, the code removes the watcher. It gets stuck with watching nothing, so when the replacement arrives - no goroutine will receive that event and all the future certificate updates (including this) are missed until the code is restarted.
Details
The added test in the PR shows the example scenario of updating the certificate in two steps:
rm /certs/mtls.pem /certs/mtls.key
tar xvf -C /certs new_mtls.tar.gz
(or similar).Expected behaviour:
Return the previous certificate until the new files appear, reestablish the watch and reread the certs when ready.
Current behaviour:
The watch is removed and new files do not trigger anything.
Possible options to consider
ensureAllFilesAreWatched
could be called inselect/default
branch, but it increases the load drastically. 1 second seems reasonable enough.