Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Group Upgrade and Managed Clusters status enhancement #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

serngawy
Copy link

@serngawy serngawy commented Jul 6, 2022

Signed-off-by: melserngawy [email protected]

@serngawy
Copy link
Author

serngawy commented Jul 6, 2022

Adding @jc-rh @sabbir-47 @ijolliffe @vitus133
please give it a review

- **notApplied**: the policy does not apply to enforce remediation
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant.
- **compliant**: the policy applied to enforce remediation and it is compliant.
- **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout value in the spec is for the whole CGU. Not sure if we need this on a per cluster basis.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it is per the whole CGU but if it is, we need to change that. The timeout defined in the CGU get applied on each managed cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that. The overall timeout is the first order feature so operator knows it would fit in the window. We can debate more on whether a per cluster or per batch timeout should be introduced. My initial thought is that it would become hard to manage with the overall timeout configured.


## Motivation

1. Currently The CGU status does not provide state per managed cluster. In case of failure to apply the ACM policies, the end user cannot identify which managed cluster failed. The enhancement proposes a change to the CGU status to include policy state per managed cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify here. The user is currently able to identify which policies are non-compliant for a cluster. This enhancement, I believe, is making the failure(s) more visible in the CGU CR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean its not visible through the CGU status, u need to check the policies under the cluster NS then from there u can identify which policy is nonCompliant. I will clarify this more


1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster.

1. As an end user, I would like to upgrade a group of clusters and be able to track the upgrade process state per cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this story. I believe that currently the user can track the upgrade status using the ClusterVersion CR on the managed cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same , its not through the CGU status. as user I have to have access to the spoke cluster in order to check the clusterVersion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this point with @serngawy, if a user is upgrading a set of clusters through CGU, CGU should pull status of those clusters and publish through its own API


1. Enhance the CGU status to present the state of ACM policies per managed clusters

1. Provide a managed clusters upgrade APIs to present the upgrade process state per managed cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a new API needed to present the upgrade state/progress? The first goal describes adding per cluster status to the CGU. Can upgrade state/progress not be included in that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not possible to include that in the same CGU status. Plus as i mentioned using policy does not give the ability to to have a progress state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does policy not allow for a progress state? The TALM controller, when it sees that it is managing a ClusterVersion Policy (it already detects subscriptions and has special handling), can run specific logic to pull the upgrade status from the ClusterVersion CR on a periodic basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two different controllers / APIs is confusing to me, and might confuse our users. It could have been better if it was an isolated change, but it also moves stuff from CGU to MCU, giving our users grief. Can we try and design the CGU API of our dreams first, and then discuss which underlying technology we use to fulfill it?
For example, we can extend the existing API in a non-disruptive way with detailed-status:

- name: spoke3
   policies:
     policy1-common: compliant
     policy2-group: in progress
     policy3-site:  notApplied
   state: failed
   detailed-status:
     backup: 
       message: succeeded in 0 minutes 12 seconds
     pre-caching:
       message:  succeeded in 40 minutes 33 seconds
       per-cluster-spec: {if we ever decide to implement it}
       images-pre-cached: 502 of 502
       size-on-disk: 12.6 GiB
       free-space-percent: 32
     policies:
     - policy1-common:
       message: compliant 
     - policy2-group:
       message: waiting object performance.openshift.io/v2/performanceprofiles to comply, node NotReady

We can implement it using one underlying technology today and switch to another one tomorrow if we find it worthy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @vitus133, it looks more cleaner to be in the same place i.e. at CGU

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern about having all spec together in the same CR is the status will be different based on the process and it might be very long.

policy1-common: notApplied
policy2-group: notApplied
policy3-site: notApplied
state: nonCompliant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this would be a really good place to have an indication of whether the cluster has been processed in a batch (eg "complete") or is still "pending". That way the user can see which clusters are done (even if failed) and which have not yet been run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, may we can make the states as (pending, nonCompliant, compliant)

Copy link
Member

@jc-rh jc-rh Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we will still have the current batch status field with the list of clusters, i feel we don't need to repeat it here. And a cluster is compliant if all policies are compliant. Do we really need a separate cluster status field?

state: InProgress
```

This enhancement proposes to change the CGU status APIs to be as the example below (cgu-upgrade-new). The CGU status contains lists of conditions, selected clusters and canary clusters if defined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the layout/format of the new status much better. It is much clearer to see what state everything is in. On the other hand, my understanding of the initial implementation is that the controller needed the information in the status to maintain state between reconcile loops. @jc-rh will the status content below be sufficient for the controller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to evolve the existing field to include what is being proposed here. However, our hands are tied by api compatibility. So we will probably keep this existing current batch status as is and it would be more for internal use. Data added by this proposal can be in brand new fields.

Comment on lines 142 to 143
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant.
- **compliant**: the policy applied to enforce remediation and it is compliant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider using something other than compliant/non-compliant. I am concerned that a user could be confused into thinking that this field tracks the "real time" compliance state of the cluster for that policy. If I'm understanding correctly this field is simply telling us whether the policy went compliant during the CGU handling for its batch. Maybe a state of [notApplied | success | timeout]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well yes, the idea it tracks real time policy state with the spoke cluster till it becomes compliant or timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add inProgress to the list to track the running state? So [notApplied | inProgress | success | timedout]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "inProgress". We already have [notStarted | inProgress | completed] in the current batch status field. I am hesitant about introducing timedout on a per policy basis though. In theory, even after the batch timeout (which also varies), the policy can still become compliant.

Comment on lines +212 to +66
As it explained in the motivation section above, using ACM policy cannot properly report cluster upgrade state plus the CGU CR does not have a declarative API definition to create clusters upgrade state.
The ManagedClustersUpgrade CR (MCU) provides a declarative definition for the cluster's upgrade APIs as well as provides a cluster's upgrade states per managed cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goals section talks about providing a per-cluster upgrade progress status. This introduction talks about that status but also a new API to initiate upgrades. These feel like two different enhancements. Based on the prior section (per cluster status) can the upgrade status be added to the proposed per-cluster CGU status without having to create a new (MCU) API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not possible Ian, the issue CGU only accept policy to do actions and policy cannot gives a progress state its either compliant or nonCompliant like a binary state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a new API make it very hard to satisfy the other requirement where we want to have upgrade and non-upgrade changes in one sequence? I would rather implement what Ian suggested above for now and put it into the violation message field, along with policy status. Meanwhile, we need to push for this ACM policy enhancement (https://issues.redhat.com/browse/ACM-1413) hard. Once it's implemented, we can get the info we need in a generic way.

Comment on lines 291 to 152
- clusterID: 84a94c75-08b6-4dfe-9138-654d94acc87
clusterUpgradeStatus:
message: Cluster version is 4.10.9
state: complete
verified: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this section be added to the proposed CGU per-cluster status field?

Copy link

@ijolliffe ijolliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great start - thanks for putting this together

## Summary

Cluster Group upgrade (CGU) and Managed Cluster Upgrade (MCU) provide aggregator APIs for managed cluster upgrades and configuration change. CGU uses ACM policy APIs to apply cluster configuration changes and present the status of the policies per managed cluster. MCU uses ACM manifestwork APIs to upgrade ocp clusters and present the status of the upgrade process per managed cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the purpose of this enhancement to better leverage and expose existing status messages from elsewhere? The second sentence mentions an API change - could you expand on this sentence - the api's will change or have changed?

Perhaps also highlight that CGU exists and MCU is proposed?

### User Stories

1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the user story to also expand the granularity of the status visible beyond complian/non-compliant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably a good idea to have a placeholder in the data structure for the violation message when policy is non-compliant



### Non-Goals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be focused on things we explicitly want to exclude from scope. It seems these are defining behaviours. food for thought

#### Removing a deprecated feature

### Upgrade / Downgrade Strategy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is needed - implications on upgrades need to be factored into the design

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion today, we think this should detail the user experience if they're running 4.11 on 4.12

If I previously did an upgrade from 4.11 to 4.12 using the existing mechanisms, how would my behaviors differ if I'm going from 4.12 to the release where this change has occurred?



### Risks and Mitigations [optional]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is needed

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 15, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign imiller0 for approval by writing /assign @imiller0 in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serngawy
Copy link
Author

serngawy commented Aug 15, 2022

@sudomakeinstall2 @sabbir-47 @vitus133 would you review the enhancement

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 15, 2022

@serngawy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint d04615b link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link

@sudomakeinstall2 sudomakeinstall2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enhancement is certainly good and necessary but there are a lot of API changes. Do we have a plan for how to go about implementing these changes?

policy3-site: compliant
state: complete
clusters:
- name: spoke2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jc-rh In the original story it was proposed to keep Enforce start time and Enforce complete time for each cluster? Do we want them to be added here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those can be added in the future.

##### 2- batches

The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.
The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this. I can't figure out how we determine the current batch number? Can we have more explanation on what is the selected clusters list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is for 10 clusters and maxConcurrency equal to 4, the first 4 clusters (1-4) in the cluster list will be the first batch then the next 4 cluster (5-8) will be the next batch and so on. Will clarify it more

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, but my question is how do we know which one is active? The current batch number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it based on the cluster state if it is inProgress then still active when it is failed/timeout we move to the next one

The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.
The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list.

#### Deprecate backup & precache status fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs to the other enhancement. Better to have two separate PRs.

- **notApplied**: the policy does not apply to enforce remediation
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant.
- **compliant**: the policy applied to enforce remediation and it is compliant.
- **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same thing as nonCompliant when the cluster status is failed/timedout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonCompliant mean it still trying to enforce the policy while timeout mean TALM stopped trying to enforce the policy

The cluster state has 2 possible state;
- **complete**: if all the policies has a compliant state on the cluster
- **inProgress**: if all the policies are in compliant/nonCompliant or notApplied state.
- **failed**: if at least 1 policy has timeout state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer timedout

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, we can make it timeout. do we will/have any other failure reason ?


##### 2- batches

The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to force the controller to go through the whole list of clusters on each reconcile, just to figure out which ones it should be remediating. Plus, the current batch status field is more for internal use, therefore the requirements are different. Mixing user oriented requirements and the internal requirements (e.g. policy index) in the same field would be difficult to implement, especially for api compatibility reasons.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is issue with back compatibility we still do batching . can u elaborate more why its difficult to implement ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants