-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Group Upgrade and Managed Clusters status enhancement #5
base: main
Are you sure you want to change the base?
Conversation
Adding @jc-rh @sabbir-47 @ijolliffe @vitus133 |
- **notApplied**: the policy does not apply to enforce remediation | ||
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant. | ||
- **compliant**: the policy applied to enforce remediation and it is compliant. | ||
- **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout value in the spec is for the whole CGU. Not sure if we need this on a per cluster basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is per the whole CGU but if it is, we need to change that. The timeout defined in the CGU get applied on each managed cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do that. The overall timeout is the first order feature so operator knows it would fit in the window. We can debate more on whether a per cluster or per batch timeout should be introduced. My initial thought is that it would become hard to manage with the overall timeout configured.
|
||
## Motivation | ||
|
||
1. Currently The CGU status does not provide state per managed cluster. In case of failure to apply the ACM policies, the end user cannot identify which managed cluster failed. The enhancement proposes a change to the CGU status to include policy state per managed cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify here. The user is currently able to identify which policies are non-compliant for a cluster. This enhancement, I believe, is making the failure(s) more visible in the CGU CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean its not visible through the CGU status, u need to check the policies under the cluster NS then from there u can identify which policy is nonCompliant. I will clarify this more
|
||
1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster. | ||
|
||
1. As an end user, I would like to upgrade a group of clusters and be able to track the upgrade process state per cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this story. I believe that currently the user can track the upgrade status using the ClusterVersion CR on the managed cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same , its not through the CGU status. as user I have to have access to the spoke cluster in order to check the clusterVersion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this point with @serngawy, if a user is upgrading a set of clusters through CGU, CGU should pull status of those clusters and publish through its own API
|
||
1. Enhance the CGU status to present the state of ACM policies per managed clusters | ||
|
||
1. Provide a managed clusters upgrade APIs to present the upgrade process state per managed cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a new API needed to present the upgrade state/progress? The first goal describes adding per cluster status to the CGU. Can upgrade state/progress not be included in that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not possible to include that in the same CGU status. Plus as i mentioned using policy does not give the ability to to have a progress state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does policy not allow for a progress state? The TALM controller, when it sees that it is managing a ClusterVersion Policy (it already detects subscriptions and has special handling), can run specific logic to pull the upgrade status from the ClusterVersion CR on a periodic basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two different controllers / APIs is confusing to me, and might confuse our users. It could have been better if it was an isolated change, but it also moves stuff from CGU to MCU, giving our users grief. Can we try and design the CGU API of our dreams first, and then discuss which underlying technology we use to fulfill it?
For example, we can extend the existing API in a non-disruptive way with detailed-status
:
- name: spoke3
policies:
policy1-common: compliant
policy2-group: in progress
policy3-site: notApplied
state: failed
detailed-status:
backup:
message: succeeded in 0 minutes 12 seconds
pre-caching:
message: succeeded in 40 minutes 33 seconds
per-cluster-spec: {if we ever decide to implement it}
images-pre-cached: 502 of 502
size-on-disk: 12.6 GiB
free-space-percent: 32
policies:
- policy1-common:
message: compliant
- policy2-group:
message: waiting object performance.openshift.io/v2/performanceprofiles to comply, node NotReady
We can implement it using one underlying technology today and switch to another one tomorrow if we find it worthy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @vitus133, it looks more cleaner to be in the same place i.e. at CGU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my concern about having all spec together in the same CR is the status will be different based on the process and it might be very long.
policy1-common: notApplied | ||
policy2-group: notApplied | ||
policy3-site: notApplied | ||
state: nonCompliant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would be a really good place to have an indication of whether the cluster has been processed in a batch (eg "complete") or is still "pending". That way the user can see which clusters are done (even if failed) and which have not yet been run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, may we can make the states as (pending, nonCompliant, compliant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we will still have the current batch status field with the list of clusters, i feel we don't need to repeat it here. And a cluster is compliant if all policies are compliant. Do we really need a separate cluster status field?
state: InProgress | ||
``` | ||
|
||
This enhancement proposes to change the CGU status APIs to be as the example below (cgu-upgrade-new). The CGU status contains lists of conditions, selected clusters and canary clusters if defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the layout/format of the new status much better. It is much clearer to see what state everything is in. On the other hand, my understanding of the initial implementation is that the controller needed the information in the status to maintain state between reconcile loops. @jc-rh will the status content below be sufficient for the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to evolve the existing field to include what is being proposed here. However, our hands are tied by api compatibility. So we will probably keep this existing current batch status as is and it would be more for internal use. Data added by this proposal can be in brand new fields.
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant. | ||
- **compliant**: the policy applied to enforce remediation and it is compliant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should consider using something other than compliant/non-compliant. I am concerned that a user could be confused into thinking that this field tracks the "real time" compliance state of the cluster for that policy. If I'm understanding correctly this field is simply telling us whether the policy went compliant during the CGU handling for its batch. Maybe a state of [notApplied | success | timeout]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well yes, the idea it tracks real time policy state with the spoke cluster till it becomes compliant or timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add inProgress to the list to track the running state? So [notApplied | inProgress | success | timedout]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like "inProgress". We already have [notStarted | inProgress | completed] in the current batch status field. I am hesitant about introducing timedout on a per policy basis though. In theory, even after the batch timeout (which also varies), the policy can still become compliant.
As it explained in the motivation section above, using ACM policy cannot properly report cluster upgrade state plus the CGU CR does not have a declarative API definition to create clusters upgrade state. | ||
The ManagedClustersUpgrade CR (MCU) provides a declarative definition for the cluster's upgrade APIs as well as provides a cluster's upgrade states per managed cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goals section talks about providing a per-cluster upgrade progress status. This introduction talks about that status but also a new API to initiate upgrades. These feel like two different enhancements. Based on the prior section (per cluster status) can the upgrade status be added to the proposed per-cluster CGU status without having to create a new (MCU) API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not possible Ian, the issue CGU only accept policy to do actions and policy cannot gives a progress state its either compliant or nonCompliant like a binary state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't a new API make it very hard to satisfy the other requirement where we want to have upgrade and non-upgrade changes in one sequence? I would rather implement what Ian suggested above for now and put it into the violation message field, along with policy status. Meanwhile, we need to push for this ACM policy enhancement (https://issues.redhat.com/browse/ACM-1413) hard. Once it's implemented, we can get the info we need in a generic way.
- clusterID: 84a94c75-08b6-4dfe-9138-654d94acc87 | ||
clusterUpgradeStatus: | ||
message: Cluster version is 4.10.9 | ||
state: complete | ||
verified: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this section be added to the proposed CGU per-cluster status field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A great start - thanks for putting this together
## Summary | ||
|
||
Cluster Group upgrade (CGU) and Managed Cluster Upgrade (MCU) provide aggregator APIs for managed cluster upgrades and configuration change. CGU uses ACM policy APIs to apply cluster configuration changes and present the status of the policies per managed cluster. MCU uses ACM manifestwork APIs to upgrade ocp clusters and present the status of the upgrade process per managed cluster. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the purpose of this enhancement to better leverage and expose existing status messages from elsewhere? The second sentence mentions an API change - could you expand on this sentence - the api's will change or have changed?
Perhaps also highlight that CGU exists and MCU is proposed?
### User Stories | ||
|
||
1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the user story to also expand the granularity of the status visible beyond complian/non-compliant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably a good idea to have a placeholder in the data structure for the violation message when policy is non-compliant
|
||
|
||
### Non-Goals | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be focused on things we explicitly want to exclude from scope. It seems these are defining behaviours. food for thought
#### Removing a deprecated feature | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is needed - implications on upgrades need to be factored into the design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion today, we think this should detail the user experience if they're running 4.11 on 4.12
If I previously did an upgrade from 4.11 to 4.12 using the existing mechanisms, how would my behaviors differ if I'm going from 4.12 to the release where this change has occurred?
|
||
|
||
### Risks and Mitigations [optional] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section is needed
Signed-off-by: melserngawy <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@sudomakeinstall2 @sabbir-47 @vitus133 would you review the enhancement |
@serngawy: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enhancement is certainly good and necessary but there are a lot of API changes. Do we have a plan for how to go about implementing these changes?
policy3-site: compliant | ||
state: complete | ||
clusters: | ||
- name: spoke2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jc-rh In the original story it was proposed to keep Enforce start time
and Enforce complete time
for each cluster? Do we want them to be added here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those can be added in the future.
##### 2- batches | ||
|
||
The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field. | ||
The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this. I can't figure out how we determine the current batch number? Can we have more explanation on what is the selected clusters list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea is for 10 clusters and maxConcurrency equal to 4, the first 4 clusters (1-4) in the cluster list will be the first batch then the next 4 cluster (5-8) will be the next batch and so on. Will clarify it more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, but my question is how do we know which one is active? The current batch number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it based on the cluster state if it is inProgress then still active when it is failed/timeout we move to the next one
The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field. | ||
The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list. | ||
|
||
#### Deprecate backup & precache status fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs to the other enhancement. Better to have two separate PRs.
- **notApplied**: the policy does not apply to enforce remediation | ||
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant. | ||
- **compliant**: the policy applied to enforce remediation and it is compliant. | ||
- **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same thing as nonCompliant when the cluster status is failed/timedout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nonCompliant mean it still trying to enforce the policy while timeout mean TALM stopped trying to enforce the policy
The cluster state has 2 possible state; | ||
- **complete**: if all the policies has a compliant state on the cluster | ||
- **inProgress**: if all the policies are in compliant/nonCompliant or notApplied state. | ||
- **failed**: if at least 1 policy has timeout state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer timedout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, we can make it timeout. do we will/have any other failure reason ?
|
||
##### 2- batches | ||
|
||
The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to force the controller to go through the whole list of clusters on each reconcile, just to figure out which ones it should be remediating. Plus, the current batch status field is more for internal use, therefore the requirements are different. Mixing user oriented requirements and the internal requirements (e.g. policy index) in the same field would be difficult to implement, especially for api compatibility reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is issue with back compatibility we still do batching . can u elaborate more why its difficult to implement ?
Signed-off-by: melserngawy [email protected]