Skip to content

CNF-13731: Cert Manager HTTP01 Proxy #1773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sebrandon1
Copy link
Member

@sebrandon1 sebrandon1 commented Mar 28, 2025

Draft for adding an enhancement for the Cert Manager HTTP01 proxy.

Based on #1682 for inspiration. Thanks @swghosh for pointing me to that.

cc @mvazquezc

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 28, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 28, 2025

@sebrandon1: This pull request references CNF-13731 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

Draft for adding an enhancement for the Cert Manager HTTP01 proxy.

Based on #1682 for inspiration. Thanks @swghosh for pointing me to that.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from ashcrow and celebdor March 28, 2025 19:40
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 28, 2025
Copy link
Contributor

openshift-ci bot commented Mar 28, 2025

Hi @sebrandon1. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 28, 2025

@sebrandon1: This pull request references CNF-13731 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

Draft for adding an enhancement for the Cert Manager HTTP01 proxy.

Based on #1682 for inspiration. Thanks @swghosh for pointing me to that.

cc @mvazquezc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2025

### API Extensions

A new CR type may be created and can be applied to clusters. This new typed will be stored in the [openshift/api](https://github.com/openshift/api) repo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a new CR required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I discussed as a possibility with @mvazquezc that it would be possibly a new CR that you would apply to get the proxy applied to the nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What group will this new CRD be created under? All new APIs will need to be tied to a TPNU feature-gate and start as v1alpha1.

What is this API going to look like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a rough try at creating a new type here under v1alpha1 but haven't turned it into a PR yet because I figured the enhancement would be required first.

Should I create a PR there as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebrandon1 I think it is reasonable to link to a PR in the openshift/api repo as part of the enhancement for review purposes.

That being said, if you do link to a PR please share the YAML representation of how users would interact with the API as part of this section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR opened here: openshift/api#2318

And I will add an example CR in this section.

Copy link
Member Author

@sebrandon1 sebrandon1 May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CR has been added. I modified the spec section per a comment from @imiller0 a while ago, which basically makes it an empty section.

EDIT: I should clarify, because the proxy is either on or off just the existence of the CR would be the enable/disable switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example CR looks like a namespaced resource.

Does this mean that I can have multiple proxies running? What happens in that case?

Regarding empty spec, I'm a little skeptical of this. There is nothing that users may want to be able to configure for the challenge proxy? Nothing that we would require some user input for?

If there truly is nothing to include in a spec, is there a better location for an enable/disable-like API to be added?


### Drawbacks

1. **Dependency on nftables**: The solution relies on `nftables`, which may not be available or enabled on all environments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nftables is available on control plane (and worker) nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always available? What is the expected behavior if it is not available but someone attempts to use this feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a better question for @imiller0 or @mvazquezc if the nftables doesn't exist. I can edit the enhancement proposal with what is discussed.

@sebrandon1 sebrandon1 force-pushed the enhancement_cert_manager_http01_proxy branch from faf4430 to b77ce7e Compare April 28, 2025 20:38
@everettraven
Copy link
Contributor

It looks like the markdownlint job is, correctly, failing due to missing sections:

enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "### Goals"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Proposal"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "#### Hypershift / Hosted Control Planes"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "#### Standalone Clusters"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "#### Single-node Deployments or MicroShift"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Alternatives (Not Implemented)"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Test Plan"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Graduation Criteria"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "### Dev Preview -> Tech Preview"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "### Tech Preview -> GA"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "### Removing a deprecated feature"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Upgrade / Downgrade Strategy"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Version Skew Strategy"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Operational Aspects of API Extensions"
enhancements/cert-manager-proxy/http01_challenge_proxy.md missing "## Support Procedures"

I'm not sure what the policy is on removing certain sections all together, but at the very least I would expect to still see the following sections:

  • Hypershift / Hosted Control Planes
  • Standalone Clusters
  • Single-node Deployments or MicroShift
  • Test Plan
  • Graduation Criteria
  • Dev Preview -> Tech Preview
  • Tech Preview -> GA
  • Upgrade / Downgrade Strategy
  • Version Skew Strategy
  • Operational Aspects of API Extensions
  • Support Procedures

If you need a more explicit template to follow for this EP, there is one with some good general advice for what should be in each section here: https://github.com/openshift/enhancements/blob/master/guidelines/enhancement_template.md

@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2025
@sebrandon1 sebrandon1 force-pushed the enhancement_cert_manager_http01_proxy branch from ae29d68 to 21b454b Compare May 12, 2025 17:41
Copy link
Contributor

openshift-ci bot commented May 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented May 12, 2025

@sebrandon1: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint bc39b89 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


### API Extensions

A new CR type may be created and can be applied to clusters. This new typed will be stored in the [openshift/api](https://github.com/openshift/api) repo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example CR looks like a namespaced resource.

Does this mean that I can have multiple proxies running? What happens in that case?

Regarding empty spec, I'm a little skeptical of this. There is nothing that users may want to be able to configure for the challenge proxy? Nothing that we would require some user input for?

If there truly is nothing to include in a spec, is there a better location for an enable/disable-like API to be added?


### Implementation Details/Notes/Constraints

- The proxy will be deployed as a DaemonSet to ensure it runs on all nodes which may host the API VIP in the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will deploy the DaemonSet? Will this be a default component on OpenShift clusters? Does it need to be a part of the core payload or can it be a "layered product"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvazquezc can correct me if I'm wrong but this will be an optional feature that Telco customers can enable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What approaches have you considered for allowing this to be an opt-in feature?

### Implementation Details/Notes/Constraints

- The proxy will be deployed as a DaemonSet to ensure it runs on all nodes which may host the API VIP in the cluster.
- The nftables rules will be added to the nodes. The proxy will listen on port 8888 and redirect traffic to the OpenShift Ingress Routers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this happen? What are the potential impacts of making modifications to node configurations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's done via code. In OCP versions >4.17 adding nftables rules won't require node reboots if the MCO gets configured appropriately. In our PoC code you can see how the nftables rules are created here. Configuration templates are here.

In terms of potential impacts: Cause disruption if there are other services listening on the port we configured (since this is supposed to run on CP nodes, the impact should be limited as we don't expect anything running on CP we don't control).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's done via code.

Please explain, in sentences, the steps that will be done via code to do this. What interactions with other components will be needed? What considerations are in place for different scenarios?

The enhancement should be the source of record for decisions being made. Code changes, even PoC code and that may make future readers of the enhancement not able to follow along as easily with the historical decisions made. Readers of enhancements shouldn't have to dig through code anyways to get an understanding of what is happening with the system as part of this proposal.


- The proxy will be deployed as a DaemonSet to ensure it runs on all nodes which may host the API VIP in the cluster.
- The nftables rules will be added to the nodes. The proxy will listen on port 8888 and redirect traffic to the OpenShift Ingress Routers.
- The implementation relies on `nftables` for traffic redirection, which must be supported and enabled on the cluster nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if nftables is not enabled on the cluster nodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK nftables is required for OCP nodes to run. OVN-K requires it (maybe other CNI don't), but still I don't think it can be disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any references that state it cannot be disabled? Isn't nftables a systemd service that runs on the nodes? Can someone modify the configurations that MCO applies to mark it as disabled?

Maybe we should consult someone from the MCO team to verify whether or not nftables can be disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @yuqi-zhang ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCO doesn't directly manage nftables in any way on the nodes. I believe it's being shipped (and mostly configured) in RHCOS directly.

As a user, you can technically do whatever you want, including disabling nftables via a MachineConfig, but that would be an explicit user-driven decision and not something we manage.

message: "HTTP01ChallengeProxy is ready"
```

### Implementation Details/Notes/Constraints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments with some further questions on how this is being done, but I'd like to see a more descriptive section here that explains how certain aspects are going to be implemented.

For example, does this need to be a core component of the OpenShift payload that is shipped with every cluster? If so, you may want to consider the cluster operator approach: https://github.com/openshift/enhancements/blob/master/dev-guide/operators.md

If not, could this be implemented as an operator installable through OLM?

Should multiple instances of a proxy be allowed to run? If so, what are some issues that may be encountered when running multiple instances? How do you plan to mitigate these issues?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering only the last part, will let other chime in for the first part.

No, multiple instances of a proxy are not allowed to run. There will be port collisions since the port used (8888) will be listening on the host network namespace. This can be mitigated by checking if the port is in use when initializing the proxy. During upgrades, etc, the idea is use the recreate strategy so the existing proxy gets deleted before the new one gets created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to limit that only a singular instance of a proxy should run? Is this one proxy per node? one proxy per cluster?

If there is risk of port collision, would it make sense to allow configuration of the port from the user perspective?


- **Monitoring**: Logs and metrics will be exposed to help administrators monitor the proxy's behavior and troubleshoot issues.
- **Resource Usage**: The proxy's resource requirements will be minimal, as it only handles HTTP01 challenge traffic.
- **Failure Recovery**: Health checks will ensure that the proxy is running correctly, and failed pods will be automatically restarted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the proxy enters a CrashLoopBackoff? If there is a negative impact to the cluster, how can an admin/user disable or remove the proxy?


## Upgrade / Downgrade Strategy

Updated versions of the proxy can be applied to the cluster similar to initial deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond the proxy itself, how do you plan to handle typical upgrade situations where there may older and newer instances of the proxy running at the same time due to a rolling upgrade? Are there any special considerations needed for this scenario?

What happens if an upgrade fails midway through? What steps would a user need to take to get their cluster back to the previous state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have multiple instances of the proxy running. They listen on port :8888 in the host network. We can control that with a recreate policy for the deployment/daemonset. In terms of new/old versions the only thing that may be changing (and I doubt it) is the nftable rule that redirects traffic from port 80 to proxy port 8888. But still we should ensure backwards compatibility in the proxy code.


## Version Skew Strategy

Any changes to the proxy's behavior will be documented to ensure compatibility with older cluster versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does there need to be any special considerations for the interactions between the component that deploys the proxy and updates node configurations and the component that rolls out node configuration changes?

During upgrades there will always be version skew between components, so capturing how your component handles this is helpful.


## Support Procedures

Support for the proxy will be provided through standard OpenShift support channels. Administrators can refer to the deployment documentation and logs for troubleshooting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a bit more information here "standard support channels" isn't all that helpful in representing how someone providing support can evaluate if this component is behaving correctly or not and what steps they can take to mitigate issues that it may be causing on the system.

The template section has some good examples for things to take into consideration: https://github.com/openshift/enhancements/blob/master/guidelines/enhancement_template.md#support-procedures

name: example-http01challengeproxy
namespace: default
spec:
# Add fields here to specify the desired state of the HTTP01ChallengeProxy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation and non-goals states only one endpoint, api.cluster.example.com, is scoped for this work. Forward-thinking, is there any other endpoints that users may want to be able to proxy that we would want to support?

What other things might a user want to configure here, either now or long-term?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the only user-exposed endpoint exposed by the platform that isn't routed through the ingress controller, I don't think we need to support other endpoints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed is that there is specification around the proxy being hardcoded to port 8888 - why this port specifically? How should users handle a scenario where something else that is critical to their system needs to run on port 8888? Would it be worth making the port a configurable value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants