Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Group Upgrade and Managed Clusters status enhancement #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
319 changes: 319 additions & 0 deletions enhancements/clusterGroupUpgradeStatus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,319 @@
---
title: Managed Clusters status enhancement

authors:
- @serngawy

reviewers:
- TBD

approvers:
- @ijolliffe

api-approvers:
- TBD

creation-date: 2022-06-28

last-updated: 2022-06-28

see-also:
- TBD

tracking-link:
- TBD
---


# Cluster Group Upgrade (CGU) status enhancement


## Summary

TALM provides Cluster Group upgrade (CGU) API to enforce group of ACM policies to selected clusters. CGU status field present the policies that will be enforced with the created placement rules and placement binding while the selected clusters state does not exist. The enhancement proposes a changes to the CGU status to include policies state per managed cluster.

## Motivation

1. Currently the CGU status field does not provide state per managed cluster. In case of failure to apply an ACM policy, the end user cannot identify which managed cluster failed through the CGU API. As well as, the CGU status field store the created placement rules and placemet binding info while they are meaning less to the end user.
The enhancement proposes changes to the CGU status to include policy state per managed cluster and remove the meaningless status info fields.


### User Stories

1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster.

### Goals

1. Enhance the CGU status to present the state of ACM policies per managed clusters and remove unnecessary status data


### Non-Goals

1. CGU status does not present the policies components state for the managed clusters.

1. CGU status does not provide a long life aggregator for the policies state with the selected clusters, its only present the policies state per cluster during the CGU policy enforcement process.

## Proposal

#### ClusterGroupUpgrade (CGU) status

Currently the CGU status has repetition and internal implementation data that is not useful for the CGU API description.
Below is a CGU example (cgu-upgrade-complete) applying 2 policies (cluster-version-policy and pao-sub-policy) into a group of clusters (spoke1,2,3).
The CGU example below has the status->managedPoliciesForUpgrade data repeated under status->managedPoliciesNs as well as the status->copiedPolicies data is repeated under status->safeResourceNames.
The status->remediationPlan is the same data defined by end user under spec->clusters. The Selected Clusters names list is repeated under the status remediationPlan, backup and precache.
Finally, the CGU status presents only the current remediation clusters under status->status->currentBatchRemediationProgress once it moves to the next batch, previous batch data will be replaced by the new one under status->status->currentBatchRemediationProgress.

```
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-upgrade-complete
namespace: default
spec:
clusters:
- spoke1
- spoke2
- spoke3
enable: true
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status:
conditions:
- message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant
reason: UpgradeNotCompleted
status: "False"
type: Ready
copiedPolicies:
- cgu-upgrade-complete-policy1-common-cluster-versi-kuttl
- cgu-upgrade-complete-policy2-common-pao-sub-polic-kuttl
managedPoliciesContent:
policy1-common-cluster-version-policy: "null"
policy2-common-pao-sub-policy: '[{"kind":"Subscription","name":"performance-addon-operator","namespace":"openshift-performance-addon-operator"}]'
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-pao-sub-policy
namespace: default
managedPoliciesNs:
policy1-common-cluster-version-policy: default
policy2-common-pao-sub-policy: default
placementBindings:
- cgu-upgrade-complete-policy1-common-cluster-version-policy-placement-kuttl
- cgu-upgrade-complete-policy2-common-pao-sub-policy-placement-kuttl
placementRules:
- cgu-upgrade-complete-policy1-common-cluster-version-policy-placement-kuttl
- cgu-upgrade-complete-policy2-common-pao-sub-policy-placement-kuttl
remediationPlan:
- - spoke1
- - spoke2
- - spoke3
safeResourceNames:
cgu-upgrade-complete-common-cluster-version-policy-config: cgu-upgrade-complete-common-cluster-version-policy-config-kuttl
cgu-upgrade-complete-common-pao-sub-policy-config: cgu-upgrade-complete-common-pao-sub-policy-config-kuttl
cgu-upgrade-complete-default-subscription-performance-addon-operator: cgu-upgrade-complete-default-subscription-performance-addon-operator-kuttl
cgu-upgrade-complete-policy1-common-cluster-version-policy: cgu-upgrade-complete-policy1-common-cluster-versi-kuttl
cgu-upgrade-complete-policy1-common-cluster-version-policy-placement: cgu-upgrade-complete-policy1-common-cluster-version-policy-placement-kuttl
cgu-upgrade-complete-policy2-common-pao-sub-policy: cgu-upgrade-complete-policy2-common-pao-sub-polic-kuttl
cgu-upgrade-complete-policy2-common-pao-sub-policy-placement: cgu-upgrade-complete-policy2-common-pao-sub-policy-placement-kuttl
status:
currentBatch: 1
currentBatchRemediationProgress:
spoke1:
policyIndex: 0
state: InProgress
backup:
clusters:
- "spoke2"
- "spoke1"
- "spoke3"
status:
- "spoke1": "Succeeded"
- "spoke2": "Succeeded"
- "spoke3": "Succeeded"
precaching:
clusters:
- "spoke2"
- "spoke1"
- "spoke3"
spec:
"platformImage": "quay.io/openshift-release-dev/ocp-release@sha256:b9ede044950f73730f00415a6fe8eb1b5afac34def872292fd0f9392c9b483f1"
status:
- "spoke1": "Active"
- "spoke2": "Active"
- "spoke3": "Active"

```

This enhancement proposes to change the CGU status APIs to be as the example below (cgu-upgrade-new). The CGU status contains lists of conditions, selected clusters and canary clusters if defined.
The cluster/canaryCluster list item present the cluster name, policies list and state.
The policies list has the managed policies names and its state on the cluster. The policy state has 4 possible state;
- **notApplied**: the policy does not apply to enforce remediation
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant.
- **compliant**: the policy applied to enforce remediation and it is compliant.
- **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same thing as nonCompliant when the cluster status is failed/timedout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonCompliant mean it still trying to enforce the policy while timeout mean TALM stopped trying to enforce the policy


The cluster state has 2 possible state;
- **complete**: if all the policies has a compliant state on the cluster
- **inProgress**: if all the policies are in compliant/nonCompliant or notApplied state.
- **failed**: if at least 1 policy has timeout state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer timedout

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, we can make it timeout. do we will/have any other failure reason ?



```
apiVersion: cluster.open-cluster-management-extension.io/v1beta1
kind: ClusterGroupUpgrade
metadata:
name: cgu-upgrade-new
namespace: default
spec:
enable: true
clusters:
- spoke2
- spoke3
- spoke4
- spoke5
managedPolicies:
- policy1-common
- policy2-group
- policy3-site
remediationStrategy:
canaries:
- prod1
maxConcurrency: 2
timeout: 240
status:
conditions:
...
canaryClusters:
- name: spoke1
policies:
policy1-common: compliant
policy2-group: compliant
policy3-site: compliant
state: complete
clusters:
- name: spoke2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jc-rh In the original story it was proposed to keep Enforce start time and Enforce complete time for each cluster? Do we want them to be added here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those can be added in the future.

policies:
policy1-common: compliant
policy2-group: nonCompliant
policy3-site: notApplied
state: inProgress
- name: spoke3
policies:
policy1-common: compliant
policy2-group: timeout
policy3-site: notApplied
state: failed
- name: spoke4
policies:
policy1-common: notApplied
policy2-group: notApplied
policy3-site: notApplied
state: inProgress
- name: spoke5
policies:
policy1-common: notApplied
policy2-group: notApplied
policy3-site: notApplied
state: inProgress
```


### Implementation Details/Notes/Constraints [optional]

#### CGU status

##### 1- copied policies, placement rules and placement binding names

The CGU controller copies the ACM policies defined under the CGU->managedPolicies field then change its remediation to enforce and create a placement rule and placement binding accordingly to the select clusters.
While storing the copied policies, placement rules and placement binding names in the CGU status is required by the current implementation of the reconcile process however its not mandatory. Having the copied policies names consist of the CGU name and policy index will make it unique for the reconciling loop.
Moreover, based on the [selective policy enhamcement](https://github.com/open-cluster-management-io/enhancements/pull/57) there is no needs to copies policies for future integration with ACM coming releases.

##### 2- batches

The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to force the controller to go through the whole list of clusters on each reconcile, just to figure out which ones it should be remediating. Plus, the current batch status field is more for internal use, therefore the requirements are different. Mixing user oriented requirements and the internal requirements (e.g. policy index) in the same field would be difficult to implement, especially for api compatibility reasons.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is issue with back compatibility we still do batching . can u elaborate more why its difficult to implement ?

The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this. I can't figure out how we determine the current batch number? Can we have more explanation on what is the selected clusters list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is for 10 clusters and maxConcurrency equal to 4, the first 4 clusters (1-4) in the cluster list will be the first batch then the next 4 cluster (5-8) will be the next batch and so on. Will clarify it more

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, but my question is how do we know which one is active? The current batch number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it based on the cluster state if it is inProgress then still active when it is failed/timeout we move to the next one


#### Deprecate backup & precache status fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs to the other enhancement. Better to have two separate PRs.


The backup and precaching both having there own status under the CGU status field as below

```
status:
conditions:
...
backup:
clusters:
- "test-spoke2"
- "test-spoke1"
status:
- "test-spoke1": "Succeeded"
- "test-spoke2": "Succeeded"
precaching:
clusters:
- "test-spoke2"
- "test-spoke1"
spec:
"platformImage": "quay.io/openshift-release-dev/ocp-release@sha256:b9ede044950f73730f00415a6fe8eb1b5afac34def872292fd0f9392c9b483f1"
status:
- "test-spoke1": "Active"
- "test-spoke2": "Active"
```

As we explained in the proposal section the cluster list is repeated for backup and precache status. Moreover, having the cluster upgrade and subscription upgrade defined by policies while precache and backup are defined as boolean flag made the CGU APIs neither transparent nor declarative for consumers APIs.
This enhancement propose to deprecate the precache and backup field under the CGU API for the target TALM release as they will be presented in the new MCU APIs. For more info regards MCU check [managedClusterUpgrade.md]()

### Workflow Description

#### Variation [optional]


### API Extensions

### Risks and Mitigations [optional]


### Drawbacks [optional]

### Open Questions [optional]

### Test Plan

### Graduation Criteria
**Note:** *Section not required until targeted at a release.*

#### Dev Preview -> Tech Preview

#### Tech Preview -> GA
**For non-optional features moving to GA, the graduation criteria must include
end to end tests.**

#### Removing a deprecated feature

The precache and backup fields will be stated as deprecated under the CGU APIs for the target TALM release.

### Upgrade / Downgrade Strategy

The proposed changes to the CGU status APIs will be supported for the target TALM release and it must maintain back compatibility with the previous CGU status APIs.


### Version Skew Strategy

### Operational Aspects of API Extensions

#### Failure Modes

#### Support Procedures

## Implementation History


## Alternatives

The proposed changes to the CGU status can be defined under a new APIs (ex; PolicyGroupUpgrade/PolicyGroupApply) with its own controller and group APIs name.

## Infrastructure Needed [optional]

Loading