-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Gating Flux reconciliation #3158
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Stefan Prodan <[email protected]>
Signed-off-by: Paulo Gomes <[email protected]>
I'm curious whether this proposal would easily be able to support
|
|
||
```yaml | ||
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 | ||
kind: Kustomization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose gates are to be applied at sources only instead of Kustomization
and HelmRelease
objects, as otherwise may face the non-trivial challenges below:
- Incompatible desired state when
Kustomization
/HelmRelease
objects are gated whilst others are not, specially when they depend on the same Flux source that is not gated.- Example 1:
Kustomization
objectsapp2
andapp1
depends onGitRepository
team-source
. A change is merged into the repository and reconciled for objectteam-source
, howeverapp2
has a close Gate andapp1
does not. In this case the desired state at both Git repository and the Flux source in the cluster are valid, but only one Kustomizations is being applied leading to an invalid observed state in the cluster. Same applies if both Kustomizations are behind a Gate which has its state changed before the second object is reconcilied (e.g. example 2 below).
- Example 1:
- Enforcing the desired state which was allowed before the Gate closing (User Story 2) will be difficult to manage from any point beyond the source, which could lead to the cluster to be in an invalid state.
- Example 1: user manually deletes Kubernetes object managed in the hope Flux will re-create it.
- Example 2: two kustomizations (
parent
andchild
) having the same Gate, the latter has adependsOn
pointing to the former. If the Gate closes afterparent
but beforechild
is applied, the cluster observed state may be invalid (considering the desired state at source).
- How can an operator determine if the feature is in use? | ||
- Are there any drawbacks when enabling this feature? | ||
--> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few ideas on design details to be discussed:
Multiple Gates
Flux objects that support gating can specify multiple gates. By default,
all gates specified must be open for the reconciliation to go ahead. To
change the behavior spec.gates.require
can be set to oneOf
instead:
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
gates:
# all (default): all gates must be open for the reconciliation to go ahead.
# oneOf: at least one of the gates must be open for the reconciliation to go ahead.
require: oneOf # <all|oneOf>
refs:
- change-freeze # gate that enforces a change freeze time window
- bypass-signoff # gate that allows other gates to be overriden.
When oneOf
is used, a single open Gate is required for reconcilations
to proceed.
Recovering from wiped Storage
The source artifact storage lives in the running source controller instance. If its Pod
is recreated, all the artifacts will need to be reacquired. This process happens
automatically as part of each source reconciliation, however, when sources
are restricted by a gate they will require a new process.
In such situation the controller will need to fetch the version used before the Gate was
closed, and that was in the storage before it being wiped. Not all types will support this
and therefore here are the outcomes expected:
- GitRepository: restores artifact based on the last known revision (commit ID).
- S3 Buckets: fail reconciliation with "artifact not found artifact whilst under closed Gate".
- OCI: TBC.
- Helm Repository: TBC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OCI we have the upstream digest in status, so like with Git, we can use it to pull the data.
``` | ||
|
||
You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmnd...moved...
> As a member of the SRE team, I want existing deployments to still be | ||
> reconciled during a change freeze. | ||
|
||
Gates can be used to block Flux sources from being refreshed, resulting in Flux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify that block Flux sources from being refreshed
in this case means the gate prevent the source controller from making the new assets available, rather than to suspend the polling in the source controller?
It is desirable for the source to still be able to detect changes, such that in-cluster logic that is tied to the gating mechanisms can be notified of available changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it stopped polling, it would avoid all the resource consumption (CPU, memory, Network and storage) for an operation that would not cause any side effect to the cluster, which would stream line the controllers to better handle reconciliations that were not gated.
@hassenius is there any specific scenario you have in mind?
|
||
### Goals | ||
|
||
- Offer a dedicated API for defining time-based gates in a declarative manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that the gates should not be exclusively time based, but also support specific revisions
This would ensure that Gate changes would not impact the eventual consistency of | ||
mid-flight reconciliations that were already deployed in the cluster. Flux would also | ||
continue to re-create Flux managed objects that were manually deleted from the cluster. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose an additional use case, bringing in the reference dimension in addition to the time dimension for gates, such that a gate can be defined as open to a specific reference, either in spec, or as an annotation
Story 3
As a multi-cluster operator, I want to control how cluster configuration changes
are rolled out across the cluster estate from a single configuration source
Gates can be used to control the timing and order of where configurations are reconciled
For example
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: approved-revisions
namespace: flux-system
spec:
openTo: 4f08fc93e31
or
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
annotations:
open.gate.fluxcd.io/revision: 4f08fc93e31
name: approved-revisions
namespace: flux-system
spec:
default: closed
Have been thinking about the current proposal, and am wondering if the ecosystem would benefit from a simple static By doing this, it would be much easier for others to write their own "gatekeeper" which e.g. listens to button presses in a UI. |
objects. | ||
|
||
## Design Details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In highly regulated environments Gating can be used to ensure specific processes were observed before a new change made its way to a given environment. Flux controllers that support such mechanism should extend their log messages to express what Gates were taken into account at the time of a reconciliation, and their states.
Just throwing in another idea. It would be great if a Gate could evaluate a metric. This way we could decline reconciliation on high load or when some long running task is not yet completed. A metrics based Gate would make checks, whether a release can be rolled out, more automatic. Of course as already mentioned by @pjbgf it would be highly appreciated if we would know the reason why a reconciliation did not happen and which Gate was involved - maybe even what the evaluation of the said metric was. Also it would be great to force-open one or all Gates. This could make especially sense when a rollout starts with Gates opened but errors come up during the rollout process and the maintenance window reaches its end. Sometimes the way forward is better than a rollback and Operators should be able to decide that the release should be fixed and rolled out never the less. Maybe this could be done with a label or annotation overriding the close-time of the Gate. |
|
||
### Goals | ||
|
||
- Offer a dedicated API for defining time-based gates in a declarative manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Offer a dedicated API for defining time-based gates in a declarative manner. | |
- Offer a dedicated API for defining gates in a declarative manner. |
|
||
The `Gate` API would replace Flagger's current | ||
[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional audit trail for this feature:
- Enriching existing logs so that actions that can be gated should log:
- What gate(s) were checked (if any).
- What gate(s) caused the final result of (not) going ahead with the action.
- If multiple gates can be defined, changes to which gates are being observed and what logical operators (and/or) are being used, should be descriptively logged.
- Logs for the new Gate controller:
- Descriptive summary of the gate state at creation time.
- Any change of Gate state should log a descriptive summary.
- The additional logs above would be an "opt-in" feature, so that users that do not require this level of governance don't have to process the extra data.
I agree with @hiddeco, this would make the implementation quite extensible. Users could implement whatever logic they need to toggle the state on top of the static Gate. In-tree "GateKeeper" implementations could be added based on most popular requests amongst the community. |
@rupelu The Gate happens before a new state is applied into the cluster. So I think they are a better fit for defining whether or not sources should be refreshed, for the reasons I mentioned here. However, you could have similar results by using Flagger to do the progressive roll-out side of things for you. |
I have worked out an example on what this could look like if we were to follow @hiddeco's suggestion on using static gates. This is what I got to: https://hackmd.io/N-xK-y5fTCiFFnusF22Sfw. I talked with a few users to ascertain whether this would mitigate their key use cases, and so far the response has been positive. The example is not meant to define the implementation details, but rather to flesh out the workflow so folks can raise concerns sooner rather than later. |
Counterpoint: I am fond of the shifting left that flux encourages with its watchful polling of sources like git repos and image registries. Maybe it's a blocker to widespread community adoption to not have a way to put a human in the loop, but in some cases that human that pushes the deploy button is not the one that merged the app code. Flux is differentiated from many other imperative pipeline CD tools by this very nature which sets the "human in the loop" to be those that commit artifacts to the watched sources of truth and the policies/controls around them. Let's take the scenario of putting a gate up on a Friday afternoon, having the source evolve with new commits/pushes. Some developers finish their projects and update the source of truth, say with refs to new images. On Monday when the gate is lifted the accumulation of several code changes are released. If things perform unexpectedly, what is the last good state? The last commit before the gate went up, sure, but this is cross domain. I would much rather roll back a merge train that hit Monday morning with reverting commits. You will have better team engagement with the release if you remove the safety of a gate window. Promoting the suspend feature to a more first class implementation seems to strike the right balance IMHO. Disclaimer: Please forgive my rant, but to me flux means "minutes of time difference between my gitops source of truth and my target environment" and I love it that way. |
I think your argument is unrelated to what most of the people here ask for. The most valuable use case for the gating feature is to have "maintenance windows" on the environments that reconcile from a stable release branch, and not for having CD-style of deployment that you've described. Imagine you have dozens of environments that have to be updated across the globe within fixed timeframes. Do you want to wake up at night to press the merge button to update the environment? If not, then you have to play around with scheduling of "PR merge" events as CronJobs and creating different branches for different timezones in your version control platform. This will become messy and complex pretty fast. Gating feature is an elegant solution for this problem, and if you need to update the environment outside the maintenance window, an "override" flag should be there. |
Add proposal for introducing a gating mechanism to Flux that would offer a way for cluster admins and other teams involved in the release process to manually approve the rollout of changes onto clusters.
In addition, Flux could offer a way to define maintenance time windows and other time-based gates, to allow a better control of applications and infrastructure changes to critical system.
Note that this proposal is WIP, we need to cover more use-cases from #870