Enhancement for Cluster Validation Tool #14

imiller0 · 2023-07-13T18:34:14Z

First draft of enhancement for a cluster validation tool and spec for reference configuration format. Developed in collaboration with @pixelsoccupied

@pixelsoccupied

First draft of enhancement for a cluster validation tool and spec for reference configuration format. Developed in collaboration with @pixelsoccupied Signed-off-by: Ian Miller <[email protected]>

openshift-ci · 2023-07-13T18:34:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from imiller0. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jc-rh · 2023-07-13T19:15:50Z

Should this tool support another mode that runs against the policies on the hub? Running against the actual cluster feels like inventing another tool just to verify an existing tool?

imiller0 · 2023-07-14T00:55:19Z

/retest

Signed-off-by: Ian Miller <[email protected]>

Added content regarding graduation criteria. Signed-off-by: Ian Miller <[email protected]>

imiller0 · 2023-07-14T01:36:55Z

Should this tool support another mode that runs against the policies on the hub? Running against the actual cluster feels like inventing another tool just to verify an existing tool?

For the initial MVP we focused on the most common denominator for input (the collection of CRs). Your point is a good one that we could have an optional feature which extracts those CRs from Policy objects (on the hub, in git, etc) as well. Users could also use jsonpath/yaml processors to extract the contained CRs from policies and run them through this tool. For now I think we should defer a Policy input format until a later enhancement.

On the overlap question, the main difference is that the Policies are capturing what users have defined for their deployment. This tool allows verification of their work against the known reference configuration. Both Policy and this tool provide validation, but only this tool provides the verification against known reference.

MarSik · 2023-07-14T12:52:05Z

enhancements/cluster-validation-tool.md

+
+#### Reference CR Annotations
+
+Annotations attached to the CRs are used for providing additional data used during validation. In


I do not like this much, because the annotations make it to the cluster when the validated config is taken as a boilerplate, edited and applied directly.

I would prefer a way that removes the annotations if I ever try simple oc apply -f using the validation pattern. Some other way (side-car files?) would also allow me to quickly create a new validated reference from a cluster dump (by just copying the meta files in).

Now don't take me wrong, it might be the only viable option atm, but I already overloaded by too many annotations when analysing bug reports.

The motivating reason for putting the annotation into the CRs was to keep this data as close as possible (ie same file) to the CR itself. This aids in ongoing maintenance of the reference configuration by not having to update/edit additional content beyond the active CRs.

I agree that this annotation is likely to make it to clusters, but I'm not sure that's necessarily a bad thing. For example, when present it becomes a way to identify (programatically) the use case intended for the cluster and can be used to trigger automated rules checking during analysis of support archives.

If, as noted above, we have a top level yaml which identifies the name and version of the reference configuration, we could consider removing the "part-of" field here (for all CRs it should match the top level name). In that case a reference configuration that was completely missing this annotation (eg initially sourced from a captured dump of a cluster) would default to component-name: "" (meaning "baseline") and required: true which seem like the correct semantics for that reference.

MarSik · 2023-07-14T12:55:06Z

enhancements/cluster-validation-tool.md

+```
+
+#### Example Reference Configuration CR
+User variable content is handled by golang formatted templating within the reference configuration


Are you reinventing kustomize here? Is it a good idea to integrate templating vs. use external templating and let the diff tool do just the diffing?

I believe the reference configuration should be human readable.

The use of templating here is intended to give enough structure and functionality to capture a range of complexity in reference configurations. The templating allows the reference to communicate expected user variation, rules around acceptable/verified values, and optional content (which if included must fall within acceptable bounds as well). Templating languages have a couple of advantages:

There is already common use of them within the domain of k8s/ocp clusters

The languages are well understood, documented, and supported by various libraries.

They are feature rich and support extension when needed

We definitely aren't intending to reinvent kustomize, helm or any other deployment oriented tool flow. In this case the templating would be used within the validation tool to pull select expected user variable content from the user's input CRs directly into a "rendered" version of the reference configuration. This results in a CR that can be diff'd against the original input CR using standard diff tools/techniques. This happens within the context of the validation tool which is able to analyze both the input CRs and the reference configuration in order to do the correlation (which files get compared against one another). Once correlated the rendering is done (handling all user variable content) and then diff'd. The actual step of template processing is fairly trivial due to the library support available.

I agree that the reference configuration should be as human readable as possible, but that it should also allow:

enough structure to capture variations, validations, groups of optional content (with required values within)

flexibility to address future reference configurations without updating/modifying the validation tool

other tools/toolchains to be built around the reference (by using a standardized format for the notations)

MarSik · 2023-07-14T12:56:14Z

enhancements/cluster-validation-tool.md

+      {{- maxListLength 1   .spec.hugepages.pages }} #3-1 defined in Go
+        {{- range .spec.hugepages.pages }}
+      - size:  {{block "validatesize" .}} {{ .size  }} {{end}} #3-2 can be defined later during runtime or falls back to .size
+        count: {{ .count }} {{ if eq "float64" (printf "%T" .count) }}is a float64 {{ else }} is not a float64{{ end }}


This is coding already and should never be exposed like this.

I agree. This is not a good example of what we expect in the reference. It was included to show what can be done if/when required but this level of "coding" is not the mainline use or expectation. I'll update the enhancement to remove this and add a better example here.

MarSik · 2023-07-14T12:59:44Z

enhancements/cluster-validation-tool.md

+1. Can consume input configuration CRs from live cluster
+1. Can consume input configuration CRs from a support archive
+1. Can consume input from local filesystem (files, directories, etc)
+1. Will suppress diffs (not report as a drift) for runtime variable fields (status, managed


Should the list of runtime fields be made configurable / extensible?

First, thanks for the really good review comments!

Initially I think we should follow the behavior of kubectl/oc diff in terms of fields skipped during comparison. If needed, yes we can extend the list and/or make it customizable (eg as a phase 2 enhancement).

MarSik · 2023-07-14T13:02:52Z

enhancements/cluster-validation-tool.md

+
+### Reference Configuration Specification
+
+The reference configuration consists of a set of CRs (yaml files with one CR per file) with


Will we somehow encode the version of the reference config to the configuration itself? Just so we can always be sure what we got even if it is just a zip file from somewhere? Relying purely on git branches is probably cause a lot of mess.

In other words, I am looking for a file like /etc/redhat-release is in RHEL.

Yes, I think this would be quite valuable. It would likely be helpful to have a consistent "entry point" into the reference configuration which has a version and perhaps other metadata (eg reference configuration name).
Something like:

referenceConfig.yaml

name: "Some Reference Config" version: x.y.z

jc-rh · 2023-07-14T13:47:54Z

Should this tool support another mode that runs against the policies on the hub? Running against the actual cluster feels like inventing another tool just to verify an existing tool?

For the initial MVP we focused on the most common denominator for input (the collection of CRs). Your point is a good one that we could have an optional feature which extracts those CRs from Policy objects (on the hub, in git, etc) as well. Users could also use jsonpath/yaml processors to extract the contained CRs from policies and run them through this tool. For now I think we should defer a Policy input format until a later enhancement.

On the overlap question, the main difference is that the Policies are capturing what users have defined for their deployment. This tool allows verification of their work against the known reference configuration. Both Policy and this tool provide validation, but only this tool provides the verification against known reference.

Thanks for the clarification. I guess this MVP is driven by the need of validating a cluster before troubleshooting? Still, shouldn't the work flow be validating git repo or hub first? If that passed, check the compliance status. If there is any gap in our tool chain today for that kind of validation, shouldn't we address that rather than inventing a new tool (looks like moderate complexity at least). It would be nice to have a real time indicator that tells us if a cluster is still on reference config or not, maybe a du config policy set? Now I wonder why we don't do it with policy gen today :)

Clarify part of the description and update the example templating with more relevant example. Signed-off-by: Ian Miller <[email protected]>

imiller0 · 2023-07-17T14:01:57Z

Thanks for the clarification. I guess this MVP is driven by the need of validating a cluster before troubleshooting? Still, shouldn't the work flow be validating git repo or hub first? If that passed, check the compliance status. If there is any gap in our tool chain today for that kind of validation, shouldn't we address that rather than inventing a new tool (looks like moderate complexity at least). It would be nice to have a real time indicator that tells us if a cluster is still on reference config or not, maybe a du config policy set? Now I wonder why we don't do it with policy gen today :)

One of the goals of the MVP is to assess the compliance of a cluster across various stages of the lifecycle. By focusing on the actual configuration CRs as input it can be used to evaluate CRs from git, a live cluster, crs from the support archive (must-gather), etc. This enhancement is intended to be generic enough to be usable across various use cases, deployments and environments which may/may-not have support for Policy.

Signed-off-by: Ian Miller <[email protected]>

jholecek-rh · 2023-08-28T09:53:23Z

enhancements/cluster-validation-tool.md

+1. Extra content – the input configuration contains content not included in the reference
+   configuration. The tool displays this as informational output. This is not a drift.


Assuming that "content" means CR fields and their values rather than the whole CR objects, I don't see how the proposed implementation of comparing reference vs. rendered CRs would accomplish this. Perhaps the rendering step is meant to collect this information? If so, would it make sense to explicitly mention it there?

Or perhaps it is just me and the rendering step is meant to do more than just "Expected user variable content is pulled from the input CR, validated, and inserted into the rendered reference CR"?

You are correct that the "extra content" here is referring to fields in the user's CR that does not exist in the reference CR. When doing the diff the "left hand side" of the diff is based on the reference CR. The rendering step pulls "expected user variable" content from the user's CR into the reference CR to generate this left hand side. The right hand side is always the user's CR. So, by pulling fields from the user's CR into the rendered CR the "expected user variable" fields will be identical and thus not report as a difference.

Any content that is in the user CR but not in the reference (and not pulled over into the reference during rendering) is noted as being present but is not considered "drift".

I am not sure what I was thinking yesterday. It seems perfectly clear today. I guess I got lost in the terms. Thanks for the clarification and sorry for the noise.

jholecek-rh · 2023-08-28T10:04:10Z

enhancements/cluster-validation-tool.md

+configuration (see below for structure/contents of the reference) and the right hand side will be a
+collection of the user’s configuration CRs. The logical flow of the tool will be:
+1. User invokes the tool with the two inputs: `validationTool ./referenceConfig/ ./userConfig/`
+1. For each CR in `./userConfig`


In case the input config source is a live cluster, is the tool really supposed to go through all CRs on the cluster or there would be some boundaries? Would it perhaps make sense to limit the search to objects with apiVersion and kind mentioned (included) in the reference configuration?

Hi @jholecek-rh. Yes, it is poorly described here. When running against a live cluster the set of CRs for "userConfig" will be based on what is included in the reference configuration. I've updated the text to clarify. (thank you!)

Thanks. The updated text makes it clear.

Signed-off-by: Ian Miller <[email protected]>

leo8a · 2023-09-08T14:42:43Z

Wondering if such a tool should have a verb to validate the current firmware configuration (via RedFish, or other interfaces) of nodes running the cluster?

I'm aware that this may increase the tool's complexity greatly, especially with some HW vendor codifying features using different codes, and BMC interface implementations (e.g. Supermicro redfish endpoints)...

Just asking given the impact of running firmware configurations on the performance of OCP cluster nodes, particularly in latency-sensitive workloads.

Enhancement for Cluster Validation Tool

7ca3066

First draft of enhancement for a cluster validation tool and spec for reference configuration format. Developed in collaboration with @pixelsoccupied Signed-off-by: Ian Miller <[email protected]>

openshift-ci bot requested review from jc-rh and vitus133 July 13, 2023 18:34

imiller0 added 2 commits July 13, 2023 21:17

Fix formatting

3dc374f

Signed-off-by: Ian Miller <[email protected]>

Add missing required sections

cdf1c35

Added content regarding graduation criteria. Signed-off-by: Ian Miller <[email protected]>

imiller0 requested a review from MarSik July 14, 2023 12:18

MarSik reviewed Jul 14, 2023

View reviewed changes

Clarify and update with better example

fd8ba25

Clarify part of the description and update the example templating with more relevant example. Signed-off-by: Ian Miller <[email protected]>

imiller0 added 3 commits July 19, 2023 16:41

Add logical flow of program and high level diagram

759f7fb

Signed-off-by: Ian Miller <[email protected]>

Update image for clarity

765b7de

Signed-off-by: Ian Miller <[email protected]>

Fix indentation

33a6632

Signed-off-by: Ian Miller <[email protected]>

jholecek-rh reviewed Aug 28, 2023

View reviewed changes

Update to clarify the userConfig when pulling from live cluster

6f75cbb

Signed-off-by: Ian Miller <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement for Cluster Validation Tool #14

Enhancement for Cluster Validation Tool #14

imiller0 commented Jul 13, 2023

openshift-ci bot commented Jul 13, 2023

jc-rh commented Jul 13, 2023

imiller0 commented Jul 14, 2023

imiller0 commented Jul 14, 2023

MarSik Jul 14, 2023

imiller0 Jul 17, 2023

MarSik Jul 14, 2023

MarSik Jul 14, 2023 •

edited

Loading

imiller0 Jul 17, 2023

MarSik Jul 14, 2023

imiller0 Jul 17, 2023

MarSik Jul 14, 2023

imiller0 Jul 17, 2023

MarSik Jul 14, 2023 •

edited

Loading

imiller0 Jul 17, 2023

jc-rh commented Jul 14, 2023

imiller0 commented Jul 17, 2023

jholecek-rh Aug 28, 2023

jholecek-rh Aug 28, 2023

imiller0 Aug 28, 2023

jholecek-rh Aug 30, 2023

jholecek-rh Aug 28, 2023

imiller0 Aug 28, 2023

jholecek-rh Aug 30, 2023

leo8a commented Sep 8, 2023


		#### Reference CR Annotations

		Annotations attached to the CRs are used for providing additional data used during validation. In


		### Reference Configuration Specification

		The reference configuration consists of a set of CRs (yaml files with one CR per file) with

		1. Extra content – the input configuration contains content not included in the reference
		configuration. The tool displays this as informational output. This is not a drift.

Enhancement for Cluster Validation Tool #14

Are you sure you want to change the base?

Enhancement for Cluster Validation Tool #14

Conversation

imiller0 commented Jul 13, 2023

openshift-ci bot commented Jul 13, 2023

jc-rh commented Jul 13, 2023

imiller0 commented Jul 14, 2023

imiller0 commented Jul 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarSik Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarSik Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jc-rh commented Jul 14, 2023

imiller0 commented Jul 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leo8a commented Sep 8, 2023

MarSik Jul 14, 2023 •

edited

Loading

MarSik Jul 14, 2023 •

edited

Loading