Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate features directly from TBD Cilium configuration schema #74

Open
briantopping opened this issue Dec 7, 2021 · 8 comments
Open
Assignees
Labels
area/networking Networking related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/4 Priority (lower number equals higher priority)

Comments

@briantopping
Copy link
Member

briantopping commented Dec 7, 2021

How to categorize this issue?
/area networking
/kind enhancement
/priority 4
/assign @briantopping

What would you like to be added:
Generate configuration option passthrough from some type of configuration schema such as https://github.com/cilium/cilium/blob/master/api/v1/openapi.yaml. Some build pattern needs to be defined so older versions of the configuration may be properly matched.

Why is this needed:
Cilium and gardener-extension-networking-cilium have a reasonable challenge with configuration matching between the two. Cilium configuration is in regular flux and the Gardener extension often cannot pass through new features that might be defined without changes to the container.

@briantopping briantopping added the kind/enhancement Enhancement, improvement, extension label Dec 7, 2021
@gardener-robot gardener-robot added area/networking Networking related priority/4 Priority (lower number equals higher priority) labels Dec 7, 2021
@briantopping briantopping changed the title Generate features directly from Cilium OpenAPI Generate features directly from TBD Cilium configuration schema Dec 7, 2021
@briantopping
Copy link
Member Author

briantopping commented Dec 7, 2021

Some implementation ideas, in order of increasing disruption to the Cilium team. I'm kind of partial to the flags validator idea, second in the list:

  1. We could just send all flags unvalidated to Cilium and let it crash if it doesn’t like them. This means taking out the existing flags parse logic.
  2. Some form of flags validator on the Cilium entry point would be like a "dry run" that only parses the flags. If the parse was bad, the existing cilium parser would reject it with output to stderr and and that would be passed through the provider to the user.
  3. Various structured exports from Cilium are possible:
    1. Use something like https://github.com/octago/sflags. The struct arguments would be reusable in the provider by pulling across the annotated structure from Cilium into the provider and recompiling it
    2. If the flags structure in Cilium was generated by a well-known schema that could also generate the structure in the provider. Then the shared content was more abstract than just an annotated Go structure.

@rfranzke
Copy link
Member

rfranzke commented Dec 8, 2021

High level comment: I'm unsure whether that's ("Generate features directly from TBD Cilium configuration schema") really a goal we should strive for. Typically, we consciously decide which of the configuration options are exposed via the APIs. This is since not all of the configuration switches (1) make sense for end-users to configure, or worse, (2) could potentially harm the system's robustness. I don't know how critical this is for Cilium and whether this also applies here, but for the Shoot API, for example, we - by far - do not expose all the configuration options for the Kubernetes components.
Obviously, this has the down-side that exposing each configuration flag means extra effort, OTOH, it also means that the risk for end-users is minimized.

@briantopping
Copy link
Member Author

Thanks Rafael, your additional issues are very helpful for consideration.

The primary use case I'm seeing is users can't configure something they need without code changes to the provider to pass them through. A savvy user would have to know how to push a PR and then wait for a release if they had an organizational requirement in that regard. In less ideal situations, they would be running from a local fork or simply choose the Calico provider for lack of options. It becomes something of a self-fulfilling outcome that the Cilium provider never gets a following with advanced users -- exactly who it's best for.

I had started a PR for accepting ConfigMap as providerConfig, but your response helped me realize that this should coexist with the current config code rather than replace it. This gives users a backdoor if they need it and could be removed at a later date once config mapping from the provider to Cilium is better featured. Does that sound agreeable?

@rfranzke
Copy link
Member

rfranzke commented Dec 9, 2021

I can't judge since I'm not very familiar with Cilium (actually, not at all), hence I'm not sure how "risky" this is w.r.t. user experience and damage that can be done to the users themselves. OTOH, what you propose sounds familiar to what is possible with CoreDNS today, though this depends on a very stable "API" as end-users cannot control the CoreDNS/Cilium versions themselves.
Anyways, I don't want to either block nor push this topic, I just added a few observations when reading the issue description. It's probably better if the @gardener/gardener-extension-networking-cilium-maintainers folks add their opinion and give some guidance here.

@ScheererJ
Copy link
Member

Just to add my two cents:
@briantopping could you please provide a couple of examples where you would have needed such a functionality?
In general, I agree with what Rafael already wrote. It is rather a non-goal for us to expose each and every little option/switch as this usually makes it more complicated for most users/operators to understand what they actually need to set. However, if there is a useful scenario we can expose almost everything on a case-by-case basis.
Rafael's CoreDNS example is good that it also shows the limits of this approach. You cannot configure everything you might want to using this approach, but it may serve well for certain scenarios.

@briantopping
Copy link
Member Author

Hello @ScheererJ, thanks for your feedback. The first time I needed this functionality, I went through the PR process in #46. This time, it is for a feature flag that might not even make it to production in supporting cilium/cilium#9207. As it stands, I am unaware how to support this flag without code changes and started asking myself where improvements could be made. Maybe I am missing something.

I completely agree it's a non-goal to manually support every switch, and they change over time in a project moving as fast as Cilium (or may never reach production as above). The original proposal was to make support automatic. The discussion has been valuable to realize validation may be harmful. Your questions expose requirements for security of a global delegated deployment for which RBAC on the config may be the only practical solution.

The current implementation of the provider does no validation of any switches, it just blindly passes them through to Cilium. Is there intent to add validation to avoid harmful combinations of configuration? Such an endeavor seems computationally infeasible unless the number of allowed switches was extremely small. And then we're back to the original problem of supporting smaller environments with a tiny attack surface and a dependency on bleeding-edge switches.

My k8s CNI experience is with the spec and Cilium. I assume Calico must have extensive permissions like Cilium for route maintenance. It seems plausible that any CNI instance could advertise 0.0.0.0/0 and MITM for any node that accepts the advertisement, leading to potential flow compromise of the entire garden (or more through BGP peering). Can the Calico provider currently protect against a rogue seed like that? Cilium's dependence on BPF maps should also be evaluated to ensure a rogue Cilium container could not be specified for a seed in a manner that the seed can modify root cluster maps.

There are just so many ways that this could all go wrong that I believe configurations should be trusted by RBAC and Gitops methodology and Gardener should treat this as a documentation problem rather than one of implementation. Promotion of any CNI configuration should require sign-off by global org security rather than depend on trust of the delegated local org. Smaller environments that can't handle this burden may be left vulnerable if they open up their services prematurely, I just don't think it can be handled by any validator that the Gardener team could keep current and flexible over time.

@ScheererJ
Copy link
Member

The cilium issue you referenced is indeed intriguing. As you have mentioned, we are at least affected with our coredns deployment, even though we should mostly have a somewhat working solution due to having at least two replicas. However, one replica will likely handle all the load as udp is much more common for dns than tcp. Therefore, in general I all for resolving the issue, helping you verify that the proposed solution from cilium works and getting the fix into a cilium release. (Unfortunately, I did not understand, which feature flag to set from the issue. Maybe, I have skipped too much through it.)
Anyway, in the past the provider config was more or less free form, which also meant what you are asking for would have been easier to achieve. With #40 strict validation of the provider config was enabled. Indeed, validation may not be complete, but is a good first step to ensure the provider config has valid content.
What is not yet clear to me is whether you want to pass the cilium options only to certain shoot clusters or to all of them. If you simply want a passthrough for all of them I guess the ControllerDeployment values could be used as parameters. The other open question for me is the following: Do you simply want to set these feature flags in the cilium config map or somewhere else?

@briantopping
Copy link
Member Author

I found this problem through RFC3126 support for external DNS. You point out DNS requests are typically UDP, but dynamic updates are TCP. So it's difficult to support rfc2136 without this fix. I have some code around to pick up once the other issues worked out.

Wasn't familiar with #40, but believe it will be undisturbed by this proposal. ConfigMap is valid if the data entities can be parsed as YAML. That sidesteps the intent of validated Cilium configuration, but it will work. If consensus is we should not step away from validating Cilium config at the individual option level, it seems like we revert back to the issues originally raised above. I'm sure there are many reasons I'm not aware of why #40 was created in the first place.

Regarding the config option source scopes: If I understand your question correctly and given the security considerations, it seems like the local config should be overwritten by KVs from the global. That would allow a garden admin to overwrite values from an RBAC scope that the less-trusted local user does not have access to. It seems like both local and global scopes should be supported and shouldn't be a problem if providerConfig from each scope both allow the type to be set as desired. Using a map[string]string as the backing store would be sufficient to normalize KVs from both source types.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 12, 2022
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Networking related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/4 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

4 participants