Cluster Group Upgrade and Managed Clusters status enhancement #5

serngawy · 2022-07-06T10:45:17Z

Signed-off-by: melserngawy [email protected]

serngawy · 2022-07-06T10:47:19Z

Adding @jc-rh @sabbir-47 @ijolliffe @vitus133
please give it a review

jc-rh · 2022-07-06T19:45:26Z

enhancements/managedClustersUpgrade.md

+  - **notApplied**: the policy does not apply to enforce remediation 
+  - **nonCompliant**: the policy applied to enforce remediation but it did not get compliant.
+  - **compliant**: the policy applied to enforce remediation and it is compliant.
+  - **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy.


Timeout value in the spec is for the whole CGU. Not sure if we need this on a per cluster basis.

not sure if it is per the whole CGU but if it is, we need to change that. The timeout defined in the CGU get applied on each managed cluster.

I don't think we can do that. The overall timeout is the first order feature so operator knows it would fit in the window. We can debate more on whether a per cluster or per batch timeout should be introduced. My initial thought is that it would become hard to manage with the overall timeout configured.

imiller0 · 2022-07-07T12:06:52Z

enhancements/managedClustersUpgrade.md

+
+## Motivation
+
+1. Currently The CGU status does not provide state per managed cluster. In case of failure to apply the ACM policies, the end user cannot identify which managed cluster failed. The enhancement proposes a change to the CGU status to include policy state per managed cluster.


Maybe clarify here. The user is currently able to identify which policies are non-compliant for a cluster. This enhancement, I believe, is making the failure(s) more visible in the CGU CR.

I mean its not visible through the CGU status, u need to check the policies under the cluster NS then from there u can identify which policy is nonCompliant. I will clarify this more

imiller0 · 2022-07-07T12:09:12Z

enhancements/managedClustersUpgrade.md

+
+1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster.
+
+1. As an end user, I would like to upgrade a group of clusters and be able to track the upgrade process state per cluster.


Can you expand on this story. I believe that currently the user can track the upgrade status using the ClusterVersion CR on the managed cluster.

same , its not through the CGU status. as user I have to have access to the spoke cluster in order to check the clusterVersion.

I agree on this point with @serngawy, if a user is upgrading a set of clusters through CGU, CGU should pull status of those clusters and publish through its own API

imiller0 · 2022-07-07T12:10:40Z

enhancements/managedClustersUpgrade.md

+
+1. Enhance the CGU status to present the state of ACM policies per managed clusters
+
+1. Provide a managed clusters upgrade APIs to present the upgrade process state per managed cluster


Why is a new API needed to present the upgrade state/progress? The first goal describes adding per cluster status to the CGU. Can upgrade state/progress not be included in that?

its not possible to include that in the same CGU status. Plus as i mentioned using policy does not give the ability to to have a progress state.

Why does policy not allow for a progress state? The TALM controller, when it sees that it is managing a ClusterVersion Policy (it already detects subscriptions and has special handling), can run specific logic to pull the upgrade status from the ClusterVersion CR on a periodic basis.

Having two different controllers / APIs is confusing to me, and might confuse our users. It could have been better if it was an isolated change, but it also moves stuff from CGU to MCU, giving our users grief. Can we try and design the CGU API of our dreams first, and then discuss which underlying technology we use to fulfill it?
For example, we can extend the existing API in a non-disruptive way with detailed-status:

- name: spoke3 policies: policy1-common: compliant policy2-group: in progress policy3-site: notApplied state: failed detailed-status: backup: message: succeeded in 0 minutes 12 seconds pre-caching: message: succeeded in 40 minutes 33 seconds per-cluster-spec: {if we ever decide to implement it} images-pre-cached: 502 of 502 size-on-disk: 12.6 GiB free-space-percent: 32 policies: - policy1-common: message: compliant - policy2-group: message: waiting object performance.openshift.io/v2/performanceprofiles to comply, node NotReady

We can implement it using one underlying technology today and switch to another one tomorrow if we find it worthy

I agree with @vitus133, it looks more cleaner to be in the same place i.e. at CGU

my concern about having all spec together in the same CR is the status will be different based on the process and it might be very long.

imiller0 · 2022-07-07T12:15:38Z

enhancements/managedClustersUpgrade.md

+     policy1-common: notApplied
+     policy2-group: notApplied
+     policy3-site:  notApplied
+   state: nonCompliant


It seems like this would be a really good place to have an indication of whether the cluster has been processed in a batch (eg "complete") or is still "pending". That way the user can see which clusters are done (even if failed) and which have not yet been run.

make sense, may we can make the states as (pending, nonCompliant, compliant)

since we will still have the current batch status field with the list of clusters, i feel we don't need to repeat it here. And a cluster is compliant if all policies are compliant. Do we really need a separate cluster status field?

imiller0 · 2022-07-07T12:18:01Z

enhancements/managedClustersUpgrade.md

+        state: InProgress
+```
+
+This enhancement proposes to change the CGU status APIs to be as the example below (cgu-upgrade-new). The CGU status contains lists of conditions, selected clusters and canary clusters if defined.


I like the layout/format of the new status much better. It is much clearer to see what state everything is in. On the other hand, my understanding of the initial implementation is that the controller needed the information in the status to maintain state between reconcile loops. @jc-rh will the status content below be sufficient for the controller?

I would love to evolve the existing field to include what is being proposed here. However, our hands are tied by api compatibility. So we will probably keep this existing current batch status as is and it would be more for internal use. Data added by this proposal can be in brand new fields.

imiller0 · 2022-07-07T12:20:05Z

enhancements/managedClustersUpgrade.md

+  - **nonCompliant**: the policy applied to enforce remediation but it did not get compliant.
+  - **compliant**: the policy applied to enforce remediation and it is compliant.


Maybe we should consider using something other than compliant/non-compliant. I am concerned that a user could be confused into thinking that this field tracks the "real time" compliance state of the cluster for that policy. If I'm understanding correctly this field is simply telling us whether the policy went compliant during the CGU handling for its batch. Maybe a state of [notApplied | success | timeout]?

well yes, the idea it tracks real time policy state with the spoke cluster till it becomes compliant or timeout.

Maybe add inProgress to the list to track the running state? So [notApplied | inProgress | success | timedout]?

I like "inProgress". We already have [notStarted | inProgress | completed] in the current batch status field. I am hesitant about introducing timedout on a per policy basis though. In theory, even after the batch timeout (which also varies), the policy can still become compliant.

imiller0 · 2022-07-07T12:24:11Z

enhancements/managedClustersUpgrade.md

+As it explained in the motivation section above, using ACM policy cannot properly report cluster upgrade state plus the CGU CR does not have a declarative API definition to create clusters upgrade state.
+The ManagedClustersUpgrade CR (MCU) provides a declarative definition for the cluster's upgrade APIs as well as provides a cluster's upgrade states per managed cluster.


The goals section talks about providing a per-cluster upgrade progress status. This introduction talks about that status but also a new API to initiate upgrades. These feel like two different enhancements. Based on the prior section (per cluster status) can the upgrade status be added to the proposed per-cluster CGU status without having to create a new (MCU) API.

its not possible Ian, the issue CGU only accept policy to do actions and policy cannot gives a progress state its either compliant or nonCompliant like a binary state.

Wouldn't a new API make it very hard to satisfy the other requirement where we want to have upgrade and non-upgrade changes in one sequence? I would rather implement what Ian suggested above for now and put it into the violation message field, along with policy status. Meanwhile, we need to push for this ACM policy enhancement (https://issues.redhat.com/browse/ACM-1413) hard. Once it's implemented, we can get the info we need in a generic way.

imiller0 · 2022-07-07T12:24:57Z

enhancements/managedClustersUpgrade.md

+    - clusterID: 84a94c75-08b6-4dfe-9138-654d94acc87
+      clusterUpgradeStatus:
+        message: Cluster version is 4.10.9
+        state: complete
+        verified: true


Can this section be added to the proposed CGU per-cluster status field?

ijolliffe

A great start - thanks for putting this together

ijolliffe · 2022-07-11T12:23:00Z

enhancements/managedClustersUpgrade.md

+## Summary
+
+Cluster Group upgrade (CGU) and Managed Cluster Upgrade (MCU) provide aggregator  APIs for managed cluster upgrades and configuration change. CGU uses ACM policy APIs to apply cluster configuration changes and present the status of the policies per managed cluster. MCU uses ACM manifestwork APIs to upgrade ocp clusters and present the status of the upgrade process per managed cluster.
+


is the purpose of this enhancement to better leverage and expose existing status messages from elsewhere? The second sentence mentions an API change - could you expand on this sentence - the api's will change or have changed?

Perhaps also highlight that CGU exists and MCU is proposed?

ijolliffe · 2022-07-11T12:25:54Z

enhancements/managedClustersUpgrade.md

+### User Stories
+
+1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster.
+


is the user story to also expand the granularity of the status visible beyond complian/non-compliant?

it's probably a good idea to have a placeholder in the data structure for the violation message when policy is non-compliant

ijolliffe · 2022-07-11T19:27:39Z

enhancements/managedClustersUpgrade.md

+
+
+### Non-Goals
+


These should be focused on things we explicitly want to exclude from scope. It seems these are defining behaviours. food for thought

ijolliffe · 2022-07-11T19:31:28Z

enhancements/managedClustersUpgrade.md

+#### Removing a deprecated feature
+
+### Upgrade / Downgrade Strategy
+


This section is needed - implications on upgrades need to be factored into the design

Based on our discussion today, we think this should detail the user experience if they're running 4.11 on 4.12

If I previously did an upgrade from 4.11 to 4.12 using the existing mechanisms, how would my behaviors differ if I'm going from 4.12 to the release where this change has occurred?

ijolliffe · 2022-07-11T19:31:36Z

enhancements/managedClustersUpgrade.md

+
+
+### Risks and Mitigations [optional]
+


this section is needed

Signed-off-by: melserngawy <[email protected]>

openshift-ci · 2022-08-15T10:20:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign imiller0 for approval by writing /assign @imiller0 in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

serngawy · 2022-08-15T10:23:30Z

@sudomakeinstall2 @sabbir-47 @vitus133 would you review the enhancement

openshift-ci · 2022-08-15T10:29:07Z

@serngawy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`d04615b`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sudomakeinstall2

This enhancement is certainly good and necessary but there are a lot of API changes. Do we have a plan for how to go about implementing these changes?

sudomakeinstall2 · 2022-08-15T17:58:59Z

enhancements/clusterGroupUpgradeStatus.md

+     policy3-site:  compliant
+   state: complete
+ clusters:
+ - name: spoke2


@jc-rh In the original story it was proposed to keep Enforce start time and Enforce complete time for each cluster? Do we want them to be added here?

Those can be added in the future.

sudomakeinstall2 · 2022-08-15T18:03:31Z

enhancements/clusterGroupUpgradeStatus.md

+##### 2- batches
+
+The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch.  The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.
+The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list.


I am confused by this. I can't figure out how we determine the current batch number? Can we have more explanation on what is the selected clusters list?

the idea is for 10 clusters and maxConcurrency equal to 4, the first 4 clusters (1-4) in the cluster list will be the first batch then the next 4 cluster (5-8) will be the next batch and so on. Will clarify it more

That's fair, but my question is how do we know which one is active? The current batch number?

it based on the cluster state if it is inProgress then still active when it is failed/timeout we move to the next one

jc-rh · 2022-08-17T17:35:02Z

enhancements/clusterGroupUpgradeStatus.md

+The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch.  The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.
+The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list.
+
+#### Deprecate backup & precache status fields


This belongs to the other enhancement. Better to have two separate PRs.

jc-rh · 2022-08-17T18:00:19Z

enhancements/clusterGroupUpgradeStatus.md

+  - **notApplied**: the policy does not apply to enforce remediation 
+  - **nonCompliant**: the policy applied to enforce remediation but it did not get compliant.
+  - **compliant**: the policy applied to enforce remediation and it is compliant.
+  - **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy.


isn't this the same thing as nonCompliant when the cluster status is failed/timedout?

nonCompliant mean it still trying to enforce the policy while timeout mean TALM stopped trying to enforce the policy

jc-rh · 2022-08-17T18:01:18Z

enhancements/clusterGroupUpgradeStatus.md

+The cluster state has 2 possible state;
+  - **complete**: if all the policies has a compliant state on the cluster
+  - **inProgress**: if all the policies are in compliant/nonCompliant or notApplied state.
+  - **failed**: if at least 1 policy has timeout state.


I prefer timedout

okay, we can make it timeout. do we will/have any other failure reason ?

jc-rh · 2022-08-17T18:07:02Z

enhancements/clusterGroupUpgradeStatus.md

+
+##### 2- batches
+
+The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch.  The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.


I don't think it makes sense to force the controller to go through the whole list of clusters on each reconcile, just to figure out which ones it should be remediating. Plus, the current batch status field is more for internal use, therefore the requirements are different. Mixing user oriented requirements and the internal requirements (e.g. policy index) in the same field would be difficult to implement, especially for api compatibility reasons.

I don't think there is issue with back compatibility we still do batching . can u elaborate more why its difficult to implement ?

openshift-ci bot requested review from danielmellado and imiller0 July 6, 2022 10:45

jc-rh reviewed Jul 6, 2022

View reviewed changes

imiller0 reviewed Jul 7, 2022

View reviewed changes

ijolliffe reviewed Jul 11, 2022

View reviewed changes

Cluster Group Upgrade and Managed Clusters status enhancement

d04615b

Signed-off-by: melserngawy <[email protected]>

serngawy force-pushed the mcu branch from 6845646 to d04615b Compare August 15, 2022 10:20

sudomakeinstall2 reviewed Aug 15, 2022

View reviewed changes

jc-rh reviewed Aug 17, 2022

View reviewed changes


		## Motivation

		1. Currently The CGU status does not provide state per managed cluster. In case of failure to apply the ACM policies, the end user cannot identify which managed cluster failed. The enhancement proposes a change to the CGU status to include policy state per managed cluster.


		1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster.

		1. As an end user, I would like to upgrade a group of clusters and be able to track the upgrade process state per cluster.


		1. Enhance the CGU status to present the state of ACM policies per managed clusters

		1. Provide a managed clusters upgrade APIs to present the upgrade process state per managed cluster

		- nonCompliant: the policy applied to enforce remediation but it did not get compliant.
		- compliant: the policy applied to enforce remediation and it is compliant.

		As it explained in the motivation section above, using ACM policy cannot properly report cluster upgrade state plus the CGU CR does not have a declarative API definition to create clusters upgrade state.
		The ManagedClustersUpgrade CR (MCU) provides a declarative definition for the cluster's upgrade APIs as well as provides a cluster's upgrade states per managed cluster.

		## Summary

		Cluster Group upgrade (CGU) and Managed Cluster Upgrade (MCU) provide aggregator APIs for managed cluster upgrades and configuration change. CGU uses ACM policy APIs to apply cluster configuration changes and present the status of the policies per managed cluster. MCU uses ACM manifestwork APIs to upgrade ocp clusters and present the status of the upgrade process per managed cluster.

		### User Stories

		1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster.

		#### Removing a deprecated feature

		### Upgrade / Downgrade Strategy


		##### 2- batches

		The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.

Cluster Group Upgrade and Managed Clusters status enhancement #5

Are you sure you want to change the base?

Cluster Group Upgrade and Managed Clusters status enhancement #5

Conversation

serngawy commented Jul 6, 2022

serngawy commented Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jc-rh Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijolliffe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 15, 2022

serngawy commented Aug 15, 2022 • edited Loading

openshift-ci bot commented Aug 15, 2022

sudomakeinstall2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serngawy commented Jul 6, 2022 •

edited

Loading

jc-rh Jul 13, 2022 •

edited

Loading

serngawy commented Aug 15, 2022 •

edited

Loading