-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement for Cluster Validation Tool #14
base: main
Are you sure you want to change the base?
Conversation
First draft of enhancement for a cluster validation tool and spec for reference configuration format. Developed in collaboration with @pixelsoccupied Signed-off-by: Ian Miller <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Should this tool support another mode that runs against the policies on the hub? Running against the actual cluster feels like inventing another tool just to verify an existing tool? |
/retest |
Signed-off-by: Ian Miller <[email protected]>
Added content regarding graduation criteria. Signed-off-by: Ian Miller <[email protected]>
For the initial MVP we focused on the most common denominator for input (the collection of CRs). Your point is a good one that we could have an optional feature which extracts those CRs from Policy objects (on the hub, in git, etc) as well. Users could also use jsonpath/yaml processors to extract the contained CRs from policies and run them through this tool. For now I think we should defer a Policy input format until a later enhancement. On the overlap question, the main difference is that the Policies are capturing what users have defined for their deployment. This tool allows verification of their work against the known reference configuration. Both Policy and this tool provide validation, but only this tool provides the verification against known reference. |
|
||
#### Reference CR Annotations | ||
|
||
Annotations attached to the CRs are used for providing additional data used during validation. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this much, because the annotations make it to the cluster when the validated config is taken as a boilerplate, edited and applied directly.
I would prefer a way that removes the annotations if I ever try simple oc apply -f using the validation pattern. Some other way (side-car files?) would also allow me to quickly create a new validated reference from a cluster dump (by just copying the meta files in).
Now don't take me wrong, it might be the only viable option atm, but I already overloaded by too many annotations when analysing bug reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivating reason for putting the annotation into the CRs was to keep this data as close as possible (ie same file) to the CR itself. This aids in ongoing maintenance of the reference configuration by not having to update/edit additional content beyond the active CRs.
I agree that this annotation is likely to make it to clusters, but I'm not sure that's necessarily a bad thing. For example, when present it becomes a way to identify (programatically) the use case intended for the cluster and can be used to trigger automated rules checking during analysis of support archives.
If, as noted above, we have a top level yaml which identifies the name and version of the reference configuration, we could consider removing the "part-of" field here (for all CRs it should match the top level name). In that case a reference configuration that was completely missing this annotation (eg initially sourced from a captured dump of a cluster) would default to component-name: ""
(meaning "baseline") and required: true
which seem like the correct semantics for that reference.
``` | ||
|
||
#### Example Reference Configuration CR | ||
User variable content is handled by golang formatted templating within the reference configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you reinventing kustomize here? Is it a good idea to integrate templating vs. use external templating and let the diff tool do just the diffing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the reference configuration should be human readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of templating here is intended to give enough structure and functionality to capture a range of complexity in reference configurations. The templating allows the reference to communicate expected user variation, rules around acceptable/verified values, and optional content (which if included must fall within acceptable bounds as well). Templating languages have a couple of advantages:
- There is already common use of them within the domain of k8s/ocp clusters
- The languages are well understood, documented, and supported by various libraries.
- They are feature rich and support extension when needed
We definitely aren't intending to reinvent kustomize, helm or any other deployment oriented tool flow. In this case the templating would be used within the validation tool to pull select expected user variable content from the user's input CRs directly into a "rendered" version of the reference configuration. This results in a CR that can be diff'd against the original input CR using standard diff tools/techniques. This happens within the context of the validation tool which is able to analyze both the input CRs and the reference configuration in order to do the correlation (which files get compared against one another). Once correlated the rendering is done (handling all user variable content) and then diff'd. The actual step of template processing is fairly trivial due to the library support available.
I agree that the reference configuration should be as human readable as possible, but that it should also allow:
- enough structure to capture variations, validations, groups of optional content (with required values within)
- flexibility to address future reference configurations without updating/modifying the validation tool
- other tools/toolchains to be built around the reference (by using a standardized format for the notations)
{{- maxListLength 1 .spec.hugepages.pages }} #3-1 defined in Go | ||
{{- range .spec.hugepages.pages }} | ||
- size: {{block "validatesize" .}} {{ .size }} {{end}} #3-2 can be defined later during runtime or falls back to .size | ||
count: {{ .count }} {{ if eq "float64" (printf "%T" .count) }}is a float64 {{ else }} is not a float64{{ end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coding already and should never be exposed like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This is not a good example of what we expect in the reference. It was included to show what can be done if/when required but this level of "coding" is not the mainline use or expectation. I'll update the enhancement to remove this and add a better example here.
1. Can consume input configuration CRs from live cluster | ||
1. Can consume input configuration CRs from a support archive | ||
1. Can consume input from local filesystem (files, directories, etc) | ||
1. Will suppress diffs (not report as a drift) for runtime variable fields (status, managed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the list of runtime fields be made configurable / extensible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thanks for the really good review comments!
Initially I think we should follow the behavior of kubectl/oc diff
in terms of fields skipped during comparison. If needed, yes we can extend the list and/or make it customizable (eg as a phase 2 enhancement).
|
||
### Reference Configuration Specification | ||
|
||
The reference configuration consists of a set of CRs (yaml files with one CR per file) with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we somehow encode the version of the reference config to the configuration itself? Just so we can always be sure what we got even if it is just a zip file from somewhere? Relying purely on git branches is probably cause a lot of mess.
In other words, I am looking for a file like /etc/redhat-release is in RHEL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this would be quite valuable. It would likely be helpful to have a consistent "entry point" into the reference configuration which has a version and perhaps other metadata (eg reference configuration name).
Something like:
referenceConfig.yaml
name: "Some Reference Config"
version: x.y.z
Thanks for the clarification. I guess this MVP is driven by the need of validating a cluster before troubleshooting? Still, shouldn't the work flow be validating git repo or hub first? If that passed, check the compliance status. If there is any gap in our tool chain today for that kind of validation, shouldn't we address that rather than inventing a new tool (looks like moderate complexity at least). It would be nice to have a real time indicator that tells us if a cluster is still on reference config or not, maybe a du config policy set? Now I wonder why we don't do it with policy gen today :) |
Clarify part of the description and update the example templating with more relevant example. Signed-off-by: Ian Miller <[email protected]>
One of the goals of the MVP is to assess the compliance of a cluster across various stages of the lifecycle. By focusing on the actual configuration CRs as input it can be used to evaluate CRs from git, a live cluster, crs from the support archive (must-gather), etc. This enhancement is intended to be generic enough to be usable across various use cases, deployments and environments which may/may-not have support for Policy. |
Signed-off-by: Ian Miller <[email protected]>
Signed-off-by: Ian Miller <[email protected]>
Signed-off-by: Ian Miller <[email protected]>
1. Extra content – the input configuration contains content not included in the reference | ||
configuration. The tool displays this as informational output. This is not a drift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that "content" means CR fields and their values rather than the whole CR objects, I don't see how the proposed implementation of comparing reference vs. rendered CRs would accomplish this. Perhaps the rendering step is meant to collect this information? If so, would it make sense to explicitly mention it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps it is just me and the rendering step is meant to do more than just "Expected user variable content is pulled from the input CR, validated, and inserted into the rendered reference CR"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that the "extra content" here is referring to fields in the user's CR that does not exist in the reference CR. When doing the diff the "left hand side" of the diff is based on the reference CR. The rendering step pulls "expected user variable" content from the user's CR into the reference CR to generate this left hand side. The right hand side is always the user's CR. So, by pulling fields from the user's CR into the rendered CR the "expected user variable" fields will be identical and thus not report as a difference.
Any content that is in the user CR but not in the reference (and not pulled over into the reference during rendering) is noted as being present but is not considered "drift".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what I was thinking yesterday. It seems perfectly clear today. I guess I got lost in the terms. Thanks for the clarification and sorry for the noise.
configuration (see below for structure/contents of the reference) and the right hand side will be a | ||
collection of the user’s configuration CRs. The logical flow of the tool will be: | ||
1. User invokes the tool with the two inputs: `validationTool ./referenceConfig/ ./userConfig/` | ||
1. For each CR in `./userConfig` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the input config source is a live cluster, is the tool really supposed to go through all CRs on the cluster or there would be some boundaries? Would it perhaps make sense to limit the search to objects with apiVersion and kind mentioned (included) in the reference configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jholecek-rh. Yes, it is poorly described here. When running against a live cluster the set of CRs for "userConfig" will be based on what is included in the reference configuration. I've updated the text to clarify. (thank you!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The updated text makes it clear.
Signed-off-by: Ian Miller <[email protected]>
Wondering if such a tool should have a verb to validate the current firmware configuration (via RedFish, or other interfaces) of nodes running the cluster? I'm aware that this may increase the tool's complexity greatly, especially with some HW vendor codifying features using different codes, and BMC interface implementations (e.g. Supermicro redfish endpoints)... Just asking given the impact of running firmware configurations on the performance of OCP cluster nodes, particularly in latency-sensitive workloads. |
First draft of enhancement for a cluster validation tool and spec for reference configuration format. Developed in collaboration with @pixelsoccupied