-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Group Upgrade and Managed Clusters status enhancement #5
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,319 @@ | ||
--- | ||
title: Managed Clusters status enhancement | ||
|
||
authors: | ||
- @serngawy | ||
|
||
reviewers: | ||
- TBD | ||
|
||
approvers: | ||
- @ijolliffe | ||
|
||
api-approvers: | ||
- TBD | ||
|
||
creation-date: 2022-06-28 | ||
|
||
last-updated: 2022-06-28 | ||
|
||
see-also: | ||
- TBD | ||
|
||
tracking-link: | ||
- TBD | ||
--- | ||
|
||
|
||
# Cluster Group Upgrade (CGU) status enhancement | ||
|
||
|
||
## Summary | ||
|
||
TALM provides Cluster Group upgrade (CGU) API to enforce group of ACM policies to selected clusters. CGU status field present the policies that will be enforced with the created placement rules and placement binding while the selected clusters state does not exist. The enhancement proposes a changes to the CGU status to include policies state per managed cluster. | ||
|
||
## Motivation | ||
|
||
1. Currently the CGU status field does not provide state per managed cluster. In case of failure to apply an ACM policy, the end user cannot identify which managed cluster failed through the CGU API. As well as, the CGU status field store the created placement rules and placemet binding info while they are meaning less to the end user. | ||
The enhancement proposes changes to the CGU status to include policy state per managed cluster and remove the meaningless status info fields. | ||
|
||
|
||
### User Stories | ||
|
||
1. As an end user, I would like to enforce a set of ACM policies to a group of clusters and be able to track the policies state compliant/NonCompliant per cluster. | ||
|
||
### Goals | ||
|
||
1. Enhance the CGU status to present the state of ACM policies per managed clusters and remove unnecessary status data | ||
|
||
|
||
### Non-Goals | ||
|
||
1. CGU status does not present the policies components state for the managed clusters. | ||
|
||
1. CGU status does not provide a long life aggregator for the policies state with the selected clusters, its only present the policies state per cluster during the CGU policy enforcement process. | ||
|
||
## Proposal | ||
|
||
#### ClusterGroupUpgrade (CGU) status | ||
|
||
Currently the CGU status has repetition and internal implementation data that is not useful for the CGU API description. | ||
Below is a CGU example (cgu-upgrade-complete) applying 2 policies (cluster-version-policy and pao-sub-policy) into a group of clusters (spoke1,2,3). | ||
The CGU example below has the status->managedPoliciesForUpgrade data repeated under status->managedPoliciesNs as well as the status->copiedPolicies data is repeated under status->safeResourceNames. | ||
The status->remediationPlan is the same data defined by end user under spec->clusters. The Selected Clusters names list is repeated under the status remediationPlan, backup and precache. | ||
Finally, the CGU status presents only the current remediation clusters under status->status->currentBatchRemediationProgress once it moves to the next batch, previous batch data will be replaced by the new one under status->status->currentBatchRemediationProgress. | ||
|
||
``` | ||
apiVersion: ran.openshift.io/v1alpha1 | ||
kind: ClusterGroupUpgrade | ||
metadata: | ||
name: cgu-upgrade-complete | ||
namespace: default | ||
spec: | ||
clusters: | ||
- spoke1 | ||
- spoke2 | ||
- spoke3 | ||
enable: true | ||
managedPolicies: | ||
- policy1-common-cluster-version-policy | ||
- policy2-common-pao-sub-policy | ||
remediationStrategy: | ||
maxConcurrency: 1 | ||
timeout: 240 | ||
status: | ||
conditions: | ||
- message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant | ||
reason: UpgradeNotCompleted | ||
status: "False" | ||
type: Ready | ||
copiedPolicies: | ||
- cgu-upgrade-complete-policy1-common-cluster-versi-kuttl | ||
- cgu-upgrade-complete-policy2-common-pao-sub-polic-kuttl | ||
managedPoliciesContent: | ||
policy1-common-cluster-version-policy: "null" | ||
policy2-common-pao-sub-policy: '[{"kind":"Subscription","name":"performance-addon-operator","namespace":"openshift-performance-addon-operator"}]' | ||
managedPoliciesForUpgrade: | ||
- name: policy1-common-cluster-version-policy | ||
namespace: default | ||
- name: policy2-common-pao-sub-policy | ||
namespace: default | ||
managedPoliciesNs: | ||
policy1-common-cluster-version-policy: default | ||
policy2-common-pao-sub-policy: default | ||
placementBindings: | ||
- cgu-upgrade-complete-policy1-common-cluster-version-policy-placement-kuttl | ||
- cgu-upgrade-complete-policy2-common-pao-sub-policy-placement-kuttl | ||
placementRules: | ||
- cgu-upgrade-complete-policy1-common-cluster-version-policy-placement-kuttl | ||
- cgu-upgrade-complete-policy2-common-pao-sub-policy-placement-kuttl | ||
remediationPlan: | ||
- - spoke1 | ||
- - spoke2 | ||
- - spoke3 | ||
safeResourceNames: | ||
cgu-upgrade-complete-common-cluster-version-policy-config: cgu-upgrade-complete-common-cluster-version-policy-config-kuttl | ||
cgu-upgrade-complete-common-pao-sub-policy-config: cgu-upgrade-complete-common-pao-sub-policy-config-kuttl | ||
cgu-upgrade-complete-default-subscription-performance-addon-operator: cgu-upgrade-complete-default-subscription-performance-addon-operator-kuttl | ||
cgu-upgrade-complete-policy1-common-cluster-version-policy: cgu-upgrade-complete-policy1-common-cluster-versi-kuttl | ||
cgu-upgrade-complete-policy1-common-cluster-version-policy-placement: cgu-upgrade-complete-policy1-common-cluster-version-policy-placement-kuttl | ||
cgu-upgrade-complete-policy2-common-pao-sub-policy: cgu-upgrade-complete-policy2-common-pao-sub-polic-kuttl | ||
cgu-upgrade-complete-policy2-common-pao-sub-policy-placement: cgu-upgrade-complete-policy2-common-pao-sub-policy-placement-kuttl | ||
status: | ||
currentBatch: 1 | ||
currentBatchRemediationProgress: | ||
spoke1: | ||
policyIndex: 0 | ||
state: InProgress | ||
backup: | ||
clusters: | ||
- "spoke2" | ||
- "spoke1" | ||
- "spoke3" | ||
status: | ||
- "spoke1": "Succeeded" | ||
- "spoke2": "Succeeded" | ||
- "spoke3": "Succeeded" | ||
precaching: | ||
clusters: | ||
- "spoke2" | ||
- "spoke1" | ||
- "spoke3" | ||
spec: | ||
"platformImage": "quay.io/openshift-release-dev/ocp-release@sha256:b9ede044950f73730f00415a6fe8eb1b5afac34def872292fd0f9392c9b483f1" | ||
status: | ||
- "spoke1": "Active" | ||
- "spoke2": "Active" | ||
- "spoke3": "Active" | ||
|
||
``` | ||
|
||
This enhancement proposes to change the CGU status APIs to be as the example below (cgu-upgrade-new). The CGU status contains lists of conditions, selected clusters and canary clusters if defined. | ||
The cluster/canaryCluster list item present the cluster name, policies list and state. | ||
The policies list has the managed policies names and its state on the cluster. The policy state has 4 possible state; | ||
- **notApplied**: the policy does not apply to enforce remediation | ||
- **nonCompliant**: the policy applied to enforce remediation but it did not get compliant. | ||
- **compliant**: the policy applied to enforce remediation and it is compliant. | ||
- **timeout**: the policy applied to enforce remediation but it does not become compliant during the timeout limits defined in the remediationStrategy. | ||
|
||
The cluster state has 2 possible state; | ||
- **complete**: if all the policies has a compliant state on the cluster | ||
- **inProgress**: if all the policies are in compliant/nonCompliant or notApplied state. | ||
- **failed**: if at least 1 policy has timeout state. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. okay, we can make it timeout. do we will/have any other failure reason ? |
||
|
||
|
||
``` | ||
apiVersion: cluster.open-cluster-management-extension.io/v1beta1 | ||
kind: ClusterGroupUpgrade | ||
metadata: | ||
name: cgu-upgrade-new | ||
namespace: default | ||
spec: | ||
enable: true | ||
clusters: | ||
- spoke2 | ||
- spoke3 | ||
- spoke4 | ||
- spoke5 | ||
managedPolicies: | ||
- policy1-common | ||
- policy2-group | ||
- policy3-site | ||
remediationStrategy: | ||
canaries: | ||
- prod1 | ||
maxConcurrency: 2 | ||
timeout: 240 | ||
status: | ||
conditions: | ||
... | ||
canaryClusters: | ||
- name: spoke1 | ||
policies: | ||
policy1-common: compliant | ||
policy2-group: compliant | ||
policy3-site: compliant | ||
state: complete | ||
clusters: | ||
- name: spoke2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jc-rh In the original story it was proposed to keep There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those can be added in the future. |
||
policies: | ||
policy1-common: compliant | ||
policy2-group: nonCompliant | ||
policy3-site: notApplied | ||
state: inProgress | ||
- name: spoke3 | ||
policies: | ||
policy1-common: compliant | ||
policy2-group: timeout | ||
policy3-site: notApplied | ||
state: failed | ||
- name: spoke4 | ||
policies: | ||
policy1-common: notApplied | ||
policy2-group: notApplied | ||
policy3-site: notApplied | ||
state: inProgress | ||
- name: spoke5 | ||
policies: | ||
policy1-common: notApplied | ||
policy2-group: notApplied | ||
policy3-site: notApplied | ||
state: inProgress | ||
``` | ||
|
||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
#### CGU status | ||
|
||
##### 1- copied policies, placement rules and placement binding names | ||
|
||
The CGU controller copies the ACM policies defined under the CGU->managedPolicies field then change its remediation to enforce and create a placement rule and placement binding accordingly to the select clusters. | ||
While storing the copied policies, placement rules and placement binding names in the CGU status is required by the current implementation of the reconcile process however its not mandatory. Having the copied policies names consist of the CGU name and policy index will make it unique for the reconciling loop. | ||
Moreover, based on the [selective policy enhamcement](https://github.com/open-cluster-management-io/enhancements/pull/57) there is no needs to copies policies for future integration with ACM coming releases. | ||
|
||
##### 2- batches | ||
|
||
The CGU status field store the current cluster batch number in order to iterate to the next batch after success/failed of the running batch. The new proposed CGU status does not require that as all clusters with indies are stored under clusters list field. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it makes sense to force the controller to go through the whole list of clusters on each reconcile, just to figure out which ones it should be remediating. Plus, the current batch status field is more for internal use, therefore the requirements are different. Mixing user oriented requirements and the internal requirements (e.g. policy index) in the same field would be difficult to implement, especially for api compatibility reasons. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think there is issue with back compatibility we still do batching . can u elaborate more why its difficult to implement ? |
||
The iteration of the batches can be determined by the maxConcurrency number defined under the remediationStrategy and the selected clusters list. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am confused by this. I can't figure out how we determine the current batch number? Can we have more explanation on what is the selected clusters list? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the idea is for 10 clusters and maxConcurrency equal to 4, the first 4 clusters (1-4) in the cluster list will be the first batch then the next 4 cluster (5-8) will be the next batch and so on. Will clarify it more There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's fair, but my question is how do we know which one is active? The current batch number? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it based on the cluster state if it is inProgress then still active when it is failed/timeout we move to the next one |
||
|
||
#### Deprecate backup & precache status fields | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This belongs to the other enhancement. Better to have two separate PRs. |
||
|
||
The backup and precaching both having there own status under the CGU status field as below | ||
|
||
``` | ||
status: | ||
conditions: | ||
... | ||
backup: | ||
clusters: | ||
- "test-spoke2" | ||
- "test-spoke1" | ||
status: | ||
- "test-spoke1": "Succeeded" | ||
- "test-spoke2": "Succeeded" | ||
precaching: | ||
clusters: | ||
- "test-spoke2" | ||
- "test-spoke1" | ||
spec: | ||
"platformImage": "quay.io/openshift-release-dev/ocp-release@sha256:b9ede044950f73730f00415a6fe8eb1b5afac34def872292fd0f9392c9b483f1" | ||
status: | ||
- "test-spoke1": "Active" | ||
- "test-spoke2": "Active" | ||
``` | ||
|
||
As we explained in the proposal section the cluster list is repeated for backup and precache status. Moreover, having the cluster upgrade and subscription upgrade defined by policies while precache and backup are defined as boolean flag made the CGU APIs neither transparent nor declarative for consumers APIs. | ||
This enhancement propose to deprecate the precache and backup field under the CGU API for the target TALM release as they will be presented in the new MCU APIs. For more info regards MCU check [managedClusterUpgrade.md]() | ||
|
||
### Workflow Description | ||
|
||
#### Variation [optional] | ||
|
||
|
||
### API Extensions | ||
|
||
### Risks and Mitigations [optional] | ||
|
||
|
||
### Drawbacks [optional] | ||
|
||
### Open Questions [optional] | ||
|
||
### Test Plan | ||
|
||
### Graduation Criteria | ||
**Note:** *Section not required until targeted at a release.* | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
#### Tech Preview -> GA | ||
**For non-optional features moving to GA, the graduation criteria must include | ||
end to end tests.** | ||
|
||
#### Removing a deprecated feature | ||
|
||
The precache and backup fields will be stated as deprecated under the CGU APIs for the target TALM release. | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
The proposed changes to the CGU status APIs will be supported for the target TALM release and it must maintain back compatibility with the previous CGU status APIs. | ||
|
||
|
||
### Version Skew Strategy | ||
|
||
### Operational Aspects of API Extensions | ||
|
||
#### Failure Modes | ||
|
||
#### Support Procedures | ||
|
||
## Implementation History | ||
|
||
|
||
## Alternatives | ||
|
||
The proposed changes to the CGU status can be defined under a new APIs (ex; PolicyGroupUpgrade/PolicyGroupApply) with its own controller and group APIs name. | ||
|
||
## Infrastructure Needed [optional] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same thing as nonCompliant when the cluster status is failed/timedout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nonCompliant mean it still trying to enforce the policy while timeout mean TALM stopped trying to enforce the policy