Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement for Cluster Validation Tool #14

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

imiller0
Copy link
Contributor

First draft of enhancement for a cluster validation tool and spec for reference configuration format. Developed in collaboration with @pixelsoccupied

First draft of enhancement for a cluster validation tool and spec for
reference configuration format. Developed in collaboration with
@pixelsoccupied

Signed-off-by: Ian Miller <[email protected]>
@openshift-ci openshift-ci bot requested review from jc-rh and vitus133 July 13, 2023 18:34
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 13, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from imiller0. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jc-rh
Copy link
Member

jc-rh commented Jul 13, 2023

Should this tool support another mode that runs against the policies on the hub? Running against the actual cluster feels like inventing another tool just to verify an existing tool?

@imiller0
Copy link
Contributor Author

/retest

Signed-off-by: Ian Miller <[email protected]>
Added content regarding graduation criteria.

Signed-off-by: Ian Miller <[email protected]>
@imiller0
Copy link
Contributor Author

Should this tool support another mode that runs against the policies on the hub? Running against the actual cluster feels like inventing another tool just to verify an existing tool?

For the initial MVP we focused on the most common denominator for input (the collection of CRs). Your point is a good one that we could have an optional feature which extracts those CRs from Policy objects (on the hub, in git, etc) as well. Users could also use jsonpath/yaml processors to extract the contained CRs from policies and run them through this tool. For now I think we should defer a Policy input format until a later enhancement.

On the overlap question, the main difference is that the Policies are capturing what users have defined for their deployment. This tool allows verification of their work against the known reference configuration. Both Policy and this tool provide validation, but only this tool provides the verification against known reference.

@imiller0 imiller0 requested a review from MarSik July 14, 2023 12:18

#### Reference CR Annotations

Annotations attached to the CRs are used for providing additional data used during validation. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this much, because the annotations make it to the cluster when the validated config is taken as a boilerplate, edited and applied directly.

I would prefer a way that removes the annotations if I ever try simple oc apply -f using the validation pattern. Some other way (side-car files?) would also allow me to quickly create a new validated reference from a cluster dump (by just copying the meta files in).

Now don't take me wrong, it might be the only viable option atm, but I already overloaded by too many annotations when analysing bug reports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivating reason for putting the annotation into the CRs was to keep this data as close as possible (ie same file) to the CR itself. This aids in ongoing maintenance of the reference configuration by not having to update/edit additional content beyond the active CRs.

I agree that this annotation is likely to make it to clusters, but I'm not sure that's necessarily a bad thing. For example, when present it becomes a way to identify (programatically) the use case intended for the cluster and can be used to trigger automated rules checking during analysis of support archives.

If, as noted above, we have a top level yaml which identifies the name and version of the reference configuration, we could consider removing the "part-of" field here (for all CRs it should match the top level name). In that case a reference configuration that was completely missing this annotation (eg initially sourced from a captured dump of a cluster) would default to component-name: "" (meaning "baseline") and required: true which seem like the correct semantics for that reference.

```

#### Example Reference Configuration CR
User variable content is handled by golang formatted templating within the reference configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you reinventing kustomize here? Is it a good idea to integrate templating vs. use external templating and let the diff tool do just the diffing?

Copy link
Member

@MarSik MarSik Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reference configuration should be human readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of templating here is intended to give enough structure and functionality to capture a range of complexity in reference configurations. The templating allows the reference to communicate expected user variation, rules around acceptable/verified values, and optional content (which if included must fall within acceptable bounds as well). Templating languages have a couple of advantages:

  • There is already common use of them within the domain of k8s/ocp clusters
  • The languages are well understood, documented, and supported by various libraries.
  • They are feature rich and support extension when needed

We definitely aren't intending to reinvent kustomize, helm or any other deployment oriented tool flow. In this case the templating would be used within the validation tool to pull select expected user variable content from the user's input CRs directly into a "rendered" version of the reference configuration. This results in a CR that can be diff'd against the original input CR using standard diff tools/techniques. This happens within the context of the validation tool which is able to analyze both the input CRs and the reference configuration in order to do the correlation (which files get compared against one another). Once correlated the rendering is done (handling all user variable content) and then diff'd. The actual step of template processing is fairly trivial due to the library support available.

I agree that the reference configuration should be as human readable as possible, but that it should also allow:

  • enough structure to capture variations, validations, groups of optional content (with required values within)
  • flexibility to address future reference configurations without updating/modifying the validation tool
  • other tools/toolchains to be built around the reference (by using a standardized format for the notations)

{{- maxListLength 1 .spec.hugepages.pages }} #3-1 defined in Go
{{- range .spec.hugepages.pages }}
- size: {{block "validatesize" .}} {{ .size }} {{end}} #3-2 can be defined later during runtime or falls back to .size
count: {{ .count }} {{ if eq "float64" (printf "%T" .count) }}is a float64 {{ else }} is not a float64{{ end }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coding already and should never be exposed like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is not a good example of what we expect in the reference. It was included to show what can be done if/when required but this level of "coding" is not the mainline use or expectation. I'll update the enhancement to remove this and add a better example here.

1. Can consume input configuration CRs from live cluster
1. Can consume input configuration CRs from a support archive
1. Can consume input from local filesystem (files, directories, etc)
1. Will suppress diffs (not report as a drift) for runtime variable fields (status, managed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the list of runtime fields be made configurable / extensible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thanks for the really good review comments!

Initially I think we should follow the behavior of kubectl/oc diff in terms of fields skipped during comparison. If needed, yes we can extend the list and/or make it customizable (eg as a phase 2 enhancement).


### Reference Configuration Specification

The reference configuration consists of a set of CRs (yaml files with one CR per file) with
Copy link
Member

@MarSik MarSik Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we somehow encode the version of the reference config to the configuration itself? Just so we can always be sure what we got even if it is just a zip file from somewhere? Relying purely on git branches is probably cause a lot of mess.

In other words, I am looking for a file like /etc/redhat-release is in RHEL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would be quite valuable. It would likely be helpful to have a consistent "entry point" into the reference configuration which has a version and perhaps other metadata (eg reference configuration name).
Something like:

referenceConfig.yaml

name: "Some Reference Config"
version: x.y.z

@jc-rh
Copy link
Member

jc-rh commented Jul 14, 2023

Should this tool support another mode that runs against the policies on the hub? Running against the actual cluster feels like inventing another tool just to verify an existing tool?

For the initial MVP we focused on the most common denominator for input (the collection of CRs). Your point is a good one that we could have an optional feature which extracts those CRs from Policy objects (on the hub, in git, etc) as well. Users could also use jsonpath/yaml processors to extract the contained CRs from policies and run them through this tool. For now I think we should defer a Policy input format until a later enhancement.

On the overlap question, the main difference is that the Policies are capturing what users have defined for their deployment. This tool allows verification of their work against the known reference configuration. Both Policy and this tool provide validation, but only this tool provides the verification against known reference.

Thanks for the clarification. I guess this MVP is driven by the need of validating a cluster before troubleshooting? Still, shouldn't the work flow be validating git repo or hub first? If that passed, check the compliance status. If there is any gap in our tool chain today for that kind of validation, shouldn't we address that rather than inventing a new tool (looks like moderate complexity at least). It would be nice to have a real time indicator that tells us if a cluster is still on reference config or not, maybe a du config policy set? Now I wonder why we don't do it with policy gen today :)

Clarify part of the description and update the example templating with
more relevant example.

Signed-off-by: Ian Miller <[email protected]>
@imiller0
Copy link
Contributor Author

Thanks for the clarification. I guess this MVP is driven by the need of validating a cluster before troubleshooting? Still, shouldn't the work flow be validating git repo or hub first? If that passed, check the compliance status. If there is any gap in our tool chain today for that kind of validation, shouldn't we address that rather than inventing a new tool (looks like moderate complexity at least). It would be nice to have a real time indicator that tells us if a cluster is still on reference config or not, maybe a du config policy set? Now I wonder why we don't do it with policy gen today :)

One of the goals of the MVP is to assess the compliance of a cluster across various stages of the lifecycle. By focusing on the actual configuration CRs as input it can be used to evaluate CRs from git, a live cluster, crs from the support archive (must-gather), etc. This enhancement is intended to be generic enough to be usable across various use cases, deployments and environments which may/may-not have support for Policy.

Comment on lines +159 to +160
1. Extra content – the input configuration contains content not included in the reference
configuration. The tool displays this as informational output. This is not a drift.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that "content" means CR fields and their values rather than the whole CR objects, I don't see how the proposed implementation of comparing reference vs. rendered CRs would accomplish this. Perhaps the rendering step is meant to collect this information? If so, would it make sense to explicitly mention it there?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps it is just me and the rendering step is meant to do more than just "Expected user variable content is pulled from the input CR, validated, and inserted into the rendered reference CR"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct that the "extra content" here is referring to fields in the user's CR that does not exist in the reference CR. When doing the diff the "left hand side" of the diff is based on the reference CR. The rendering step pulls "expected user variable" content from the user's CR into the reference CR to generate this left hand side. The right hand side is always the user's CR. So, by pulling fields from the user's CR into the rendered CR the "expected user variable" fields will be identical and thus not report as a difference.

Any content that is in the user CR but not in the reference (and not pulled over into the reference during rendering) is noted as being present but is not considered "drift".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what I was thinking yesterday. It seems perfectly clear today. I guess I got lost in the terms. Thanks for the clarification and sorry for the noise.

configuration (see below for structure/contents of the reference) and the right hand side will be a
collection of the user’s configuration CRs. The logical flow of the tool will be:
1. User invokes the tool with the two inputs: `validationTool ./referenceConfig/ ./userConfig/`
1. For each CR in `./userConfig`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the input config source is a live cluster, is the tool really supposed to go through all CRs on the cluster or there would be some boundaries? Would it perhaps make sense to limit the search to objects with apiVersion and kind mentioned (included) in the reference configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jholecek-rh. Yes, it is poorly described here. When running against a live cluster the set of CRs for "userConfig" will be based on what is included in the reference configuration. I've updated the text to clarify. (thank you!)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The updated text makes it clear.

@leo8a
Copy link

leo8a commented Sep 8, 2023

Wondering if such a tool should have a verb to validate the current firmware configuration (via RedFish, or other interfaces) of nodes running the cluster?

I'm aware that this may increase the tool's complexity greatly, especially with some HW vendor codifying features using different codes, and BMC interface implementations (e.g. Supermicro redfish endpoints)...

Just asking given the impact of running firmware configurations on the performance of OCP cluster nodes, particularly in latency-sensitive workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants