-
Notifications
You must be signed in to change notification settings - Fork 236
API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722
Conversation
9130e0c
to
d10d787
Compare
@vrutkovs: This pull request references API-1802 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cc @tkashem @p0lyn0mial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) RotatedSigningCASecret
creates the signer CA secret
b) CABundleConfigMap
creates a configmap, and
c) RotatedSelfSignedCertKeySecret
creates a secret
I like the idea of a controller doing one thing, can we explore the idea of the individual controllers?
a) SignerCAController
: this controller manager the signer secret object.
b) CABundleController
: it watched the secret object from a
and creates a single configmap and manages it.
c) CertKeySecretController
: this can watch objects from a
and b
and creates a secret with cert/key and manages it.
With this, we can have N instances of CertKeySecretController
, where each instance derives it cert/key from a single instance of a
and b
WithPostStartHooks( | ||
c.targetCertRecheckerPostRunHook, | ||
). | ||
ToController("CertRotationController", recorder.WithComponentSuffix("cert-rotation-controller").WithComponentSuffix(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultipleTargetCertRotationController
, so we have two distinct names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
} | ||
}(ch) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid making any runtime behavioral changes to NewCertRotationController
if possible, we have the following options:
a) completely separate implementations: CertRotationController
for NewCertRotationController
, and MultipleTargetCertRotationController
for NewCertRotationControllerMultipleTargets
b) abstract out the targetCertRecheckerPostRunHook
implementations: singleTargetCertRecheckerPostRunHook
and multiTargetCertRecheckerPostRunHook
. This way we can reuse CertRotationController
for both single and multiple. NewCertRotationControllerMultipleTargets
will use multiTargetCertRecheckerPostRunHook
.
c) can we have a single channel <-chan time.Time
to be shared by multiple instances of CertCreator
(for example ServingRotation
), then the logic inside targetCertRecheckerPostRunHook
does not need to change at all.
I prefer c
, if doable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked this to make it look like b). Added a test which verifies goroutines don't leak
I don't quite understand what c) is meant for - make controller accept a single channel reused across all CertCreators
? Not sure what's the benefit of that
targetRefresh := refresher.RecheckChannel() | ||
aggregateTargetRefresher := make(chan struct{}) | ||
for _, ch := range targetRefreshers { | ||
go func(c <-chan struct{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go func
does not have the same guarantee as go wait.Until(func() {}, time.Minute, ctx.Done())
for _, ch := range targetRefreshers { | ||
go func(c <-chan struct{}) { | ||
for msg := range c { | ||
aggregateTargetRefresher <- msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goroutine leaking: <-ctx.Done()
, could be a problem for integration tests that check goroutine leakages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx.Done
would close them iiuc, but yeah, worth adding a unit test which uses runhook and ensures no goroutines are leaking
That's possible, but the goal is to create target certs, signers and CA are merely prerequisites to it. We could have separate signer/ca controllers, but we might end up with signer certs not producing any target certs or CA bundles without any new signers etc. |
d10d787
to
a528215
Compare
New changes are detected. LGTM label has been removed. |
a528215
to
d3c0949
Compare
|
||
// Ensure both target certs have been called exactly three times | ||
// initial sync and two hook calls for target certs | ||
// TODO[vrutkovs]: informers make unpredictable number of calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to tackle that - or how to make sure two hook syncs were included in regular informer sync. informerFactory
and NewCertRotationControllerMultipleTargets
promise to sync every minute, but it happens much more often
An alternative design would be to have a controller per certificate – this would actually preserve the current behaviour. I prefer having a single controller per certificate as it is easier to debug, report status, retry on error, and reason about. Thoughts? Our issue is that both the I think the least invasive change would be to make both Another idea that comes to mind (I think Abu suggested the same) is to turn both |
Similar to Abu's idea in #1722 (review)? That would prevent races, but would make sequence of events to do a proper rollout complicated (three different controllers would need to be properly synced so that CA bundle would be updated before target cert etc.). Also may lead to "orphan" controllers managing CA bundle without a target cert.
Yes, this issue still potentially remains. The PR focuses on solving a much more widespread issue of multiple target certs
I don't know if its feasible, as its not just thread-safety but also process-safety we're concerned about |
Don't we do it all the time? The aggregator controller waits until a service is created before it wires an HTTP handler. What I like about these and other controllers is that when you look inside, you will see that these controllers are reconciling a single resource. They are simply reading their prerequisites and reacting to any changes before reconciling. In our case, it would boil down to three separate controllers that reconcile their resources, where the second and the third controller read the crypto material from the lister before reconciling. For example: NewCertRotationController would:
When there are issues with the signerCA you go to |
/test unit |
RotatedSelfSignedCertKeySecret RotatedSelfSignedCertKeySecret | ||
|
||
// RotatedTargetSecrets contains a list of key and cert signed by a signing CA to rotate. | ||
RotatedTargetSecrets []RotatedSelfSignedCertKeySecret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the external API and clients would need to update if they're accessing this field directly.
Since you're already adding a new "Create" function for the multi rotation controller, would it make more sense to have two cert controllers? One for single rotation and one for multi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the external API and clients would need to update if they're accessing this field directly
Correct, this would require users to update the code potentially. So far we identified that only cluster-kube-apiserver-operator and cluster-kube-control-manager-operator use multitarget controllers (and they don't need to access RotatedTargetSecrets
directly)
would it make more sense to have two cert controllers? One for single rotation and one for multi.
It may be beneficial later to move common parts into functions and reuse them, however at this point only two different New...
functions are used externally
6a2734e
to
26bcb48
Compare
Is the goal of this change to avoid races between multiple updating controllers? If so, this solution appears to miss the critical change required to resolve the cross-binary active-active problem. The core mistake is using the |
This prevents races within one controller (thread-level). Component PRs like this ensure that multiple controllers don't run simultaneously.
We don't use
|
By having it read the live secret and use its RV instead of the RV the controller used to make decisions, conflicts are avoided and produce a situation that looks like
Using .Update instead should eliminate the read in 4.1 and result in a conflict instead of the race. |
Am I missing something? We already use .Update there - also discussed it with Lukasz in #1763 (comment) and he wants me to change this to use .Update too. Also, this PR would prevent client/1 and client/2 from ever happening. If |
@vrutkovs let's set up a pair programming session tomorrow/this-week where I can show you how we could use optimistic concurrency control built into kube to solve the race. |
Ah, the problem here was:
Without second However, its orthogonal to this PR - we can both remove racing controllers and make sure |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
088b03d
to
51c0990
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dinhxuanvu, vrutkovs The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test unit |
/remove-lifecycle rotten |
…Controller Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.
Collect a list of secrets/configmaps tracked in controllers to make sure any secret/configmap is not being declared twice to avoid races when updating it
51c0990
to
a6fca12
Compare
@vrutkovs: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.
Tested with
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=300","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=600","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=900","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1200","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1500","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1800","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2100","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2400","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"