Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] User-specific routable Gefyra bridge ("user bridge") #733

Open
Schille opened this issue Oct 30, 2024 · 6 comments
Open

[Feature] User-specific routable Gefyra bridge ("user bridge") #733

Schille opened this issue Oct 30, 2024 · 6 comments
Labels
enhancement 🎉 New feature or request v2

Comments

@Schille
Copy link
Collaborator

Schille commented Oct 30, 2024

Intro

Gefyra currently supports "global bridge" only. See https://gefyra.dev/docs/run_vs_bridge/#bridge-operation to learn more.
In short, a container within a running Pod (or multiple replicas) is replaced by a Gefyra component called Carrier. This allows Gefyra, with some constraints, to route traffic to a local container originally targeted to the specified Pod within the cluster.

gefyra-bridge-action drawio

This capability helps debug local containers using traffic from the cluster, rather than synthetic or made-up local traffic. However, the bridge is currently globally effective for all traffic directed to the bridged Pod, which may sometimes be undesired. This means that only one bridge per Pod can exist, allowing only one user to bridge a Pod at a time. With this feature proposal, we aim to lift that limitation in a flexible yet robust way.

This feature addresses the following issues:

Remark: Remember that one of Gefyra's fundamental premises is not to interfere with Kubernetes objects from your workloads. The proposed feature draft does not involve modifying existing deployment artifacts. Why? If something goes wrong (as things often do), we want Gefyra users to be able to restore the original state simply by deleting Pods. However, there may be residual objects or other additions that should never disrupt the operations of the development cluster. Gefyra aims to minimize this risk by treating it as a bug.

What is the new feature about?

Gefyra's bridge operation will support specific routing configurations to intercept only matching traffic, allowing all unmatched traffic to be served from within the cluster. Multiple users will be able to intercept different traffic simultaneously, receiving it on their local container (started by gefyra run ...) to serve it with local code.

gefyra-personal-bridge drawio

Departure

The main components involved in establishing a Gefyra bridge are:

  • Gefyra Client: Requests a Gefyra bridge that is globally effective.
  • Gefyra Operator: Acts on the bridge request by setting up the target Pod with Carrier, establishing a chain of reverse proxies into the client network (including Kubernetes objects), and reporting back the result of the operation.
  • Gefyra Stowaway (connection provider): Dynamically creates a reverse proxy for the specific bridge into the client network.
  • Gefyra Carrier (bridge provider): Replaces running containers and proxies incoming TCP/UDP traffic to the reverse proxy chain set up by the Operator.

Remark: Gefyra's cluster component architecture consists of different interfaces. The connection provider and bridge provider are two abstract concepts with defined interfaces. "Stowaway" and "Carrier" are the current concrete implementations of these interfaces. However, depending on the results of this implementation, I expect at least the latter to be replaced by a new component (perhaps Carrier2?). For consistency, I will continue to use these component names.

Overview

gefyra-personal-bridge1 drawio

Carrier

gefyra-personal-bridge2 drawio

Currently, Carrier is installed into 1 to N Pods. Each instance upstreams any incoming traffic ("port x") to a single target endpoint ("upstream-1"). This process does not involve traffic introspection: IP packets come in and are sent out as-is. This setup is simple and fast.
Carrier is based on the Nginx server and thus is configured using the stream directive: https://nginx.org/en/docs/stream/ngx_stream_core_module.html#stream

Feature draft

Stage 1: Installation & keep original Pods around to serve unmatched traffic

When a compatible service is bridged, we need the original workloads to serve any unmatched traffic through the user bridge.
Consider the following example: a compatible workload <Y> is selected by a Kubernetes service object. This workload consists of 3 Pods.

gefyra-personal-bridge4 drawio

Once a user bridge is requested, Gefyra's Operator replicates all essential components (most importantly, the Pods and the service) by cloning and modifying them. Pod <Y1'> is modified on the fly so that it is selected by service <Y'>. The Pods <Y1'>, <Y2'> and <Y3'> must not be selected by service <Y>. Most other parameters - such as mounts, ports, probes, etc. - should remain unchanged.

gefyra-personal-bridge5 drawio

The cloned workload infrastructure remains active as long as at least one Gefyra user bridge is active.

The Gefyra Operator installs Carrier into the target Pods (<Y1>, <Y2> and <Y3>) and dynamically configures them to send all unmatched traffic to the cloned infrastructure <Y'>. This setup ensures:

  • No existing workloads are modified (except for temporary image changes).
  • Common traffic can still be served, with just one additional hop.
  • If the cluster setup is interrupted, it can be easily restored by re-rolling out the source of the ReplicationSet <Y>

gefyra-personal-bridge7 drawio

Of course, if there is a different replication factor or other deployment scenarios (e.g., Pod only), the Gefyra Operator adapts accordingly. I hope the idea makes sense.

Stage 2: Add a local upstream & redirect matching traffic

The Carrier component will require significant changes as we shift from a “stream”-based proxy to a more advanced proxy ruleset, incorporating path and header matching for HTTP, along with routing rules for other protocols in the future.
Fortunately, the required changes in the Gefyra Operator are not as extensive as those in Carrier. Several interfaces already support creating different routes within the connection provider ("Stowaway") and bridge provider abstractions.

gefyra-personal-bridge8 drawio

Interface reference for connection providers (Stowaway):

@abstractmethod
def add_destination(
self,
peer_id: str,
destination_ip: str,
destination_port: int,
parameters: Optional[Dict[Any, Any]] = None,
) -> str:
"""
Add a destintation route to this connection provider proxy, returns
the service URL
"""
raise NotImplementedError
@abstractmethod
def remove_destination(
self, peer_id: str, destination_ip: str, destination_port: int
):
"""
Remove a destintation route from this connection provider proxy
"""
raise NotImplementedError
@abstractmethod
def destination_exists(
self, peer_id: str, destination_ip: str, destination_port: int
) -> bool:
"""
Returns True if the destination exists, otherwise False
"""
raise NotImplementedError
@abstractmethod
def get_destination(
self, peer_id: str, destination_ip: str, destination_port: int
) -> str:
"""
Returns the service URL for the destination
"""
raise NotImplementedError

Interface reference for bridge providers (Carrier, Carrier2):

def add_proxy_route(
self,
container_port: int,
destination_host: str,
destination_port: int,
parameters: Optional[Dict[Any, Any]] = None,
):
"""
Add a new proxy_route to the bridge provider
"""
raise NotImplementedError
@abstractmethod
def remove_proxy_route(
self, container_port: int, destination_host: str, destination_port: int
):
"""
Remove a bridge from the bridge provider
:param proxy_route: the proxy_route to be removed in the form of IP:PORT
"""
raise NotImplementedError

Rules

The GefyraBridge CRD already supports an arbitrary set of additional configuration parameters for the bridge provider.

# provider specific parameters for this bridge
# a carrier example: {"type": "stream"} to proxy all traffic on TCP level
# to the destination (much like a nitro-speed global bridge)
# another carrier example: {"type": "http", "url": "/api"} to proxy all
# traffic on HTTP level to the destination
# sync_down_directories
"providerParameter": k8s.client.V1JSONSchemaProps(
type="object", x_kubernetes_preserve_unknown_fields=True
),

For HTTP traffic, the routing parameters appear to be quite obvious:

  • path matching (e.g. /api/objects/5)
  • header matching (e.g. owner: john)

Each user bridge adds a new entry to the upstream servers for Carrier, along with an additional (verified) matching rule. The operator's validating webhook should implement matching rule validation to catch common mistakes (e.g., a rule already applied by another user or a rule that never fires due to another bridge capturing all traffic). If a matching rule is invalid, the creation of the GefyraBridge is halted immediately.

Stage 3: Remove a user bridge

Removing a bridge is a two-phase process:

  1. The deletion request from the Gefyra Client prompts the Operator to initiate the bridge removal.
  2. Both the bridge provider and the connection provider are called upon to delete their respective routing configurations.

Remove the last user bridge & clean up

If Stage 3 removes the last active bridge for a Pod, the uninstallation procedure is triggered. This process includes resetting the patched Pods (<Y1>, <Y2> and <Y3>) to their original configuration and removing the cloned infrastructure (Pod <Y1'>, <Y2'>, <Y3'> and service <Y'>).

Closing remarks

  • I would like to retain the current implementation of the "global bridge," as it is a high-performance solution. Therefore, we should add a new flag, gefyra bridge ... --global to enable the global bridge with its current behavior.
  • Setting up the first user bridge for a Pod/Deployment will be more time-intensive than setting up subsequent user bridges (since Stage 1 is triggered), so we should handle this appropriately.
  • This concept is likely not applicable to RWO StatefulSets, as we cannot simply clone a StatefulSet Pod with an RWO mount; this limitation should be conveyed with an error message.
  • I currently don’t have a solution for HTTPS traffic interception by Carrier. One possibility could be to add a parameter in the bridge request to direct Carrier to the PKI, or to support the creation of custom bridge provider images. (The UID/GID issues remain as well.) With a custom image, we could integrate specific bridge parameters and designate this image for a particular Pod.

This feature is currently in the ideation phase. I would appreciate any external feedback on how to make this as useful and robust as possible. If you want to talk to me about this (or Gefyra in general), please find me around at our Discord server: https://discord.gg/Gb2MSRpChJ

I am also looking for a co-sponsor of this feature. If you or your team want to support this development, please contact me.

@liquidiert
Copy link
Collaborator

First off: Great RFC @Schille !
Just one quick question: What happens to the shadow infrastructure when the original might change while a bridge is active? I'm sure there's already handling for this case when using global bridge but what is the appropriate procedure here?

@Schille
Copy link
Collaborator Author

Schille commented Oct 31, 2024

@liquidiert Gotcha! A rollout of the original workloads would render the bridge useless since Gefyra's patch will be reset. The operator should reconcile all bridges, detect that situation, and take appropriate action (patching again, setting up user bridges to work again). Or declare existing user bridges stale and remove them.

@liquidiert
Copy link
Collaborator

@Schille that sounds like a good reconciliation tactic, thanks!

@Schille Schille changed the title [WIP][Feature] User-specific routable Gefyra bridge ("user bridge") [Feature] User-specific routable Gefyra bridge ("user bridge") Oct 31, 2024
@Schille Schille added enhancement 🎉 New feature or request v2 labels Oct 31, 2024
@crkurz
Copy link

crkurz commented Nov 4, 2024

This look terriffic, @Schille ! Thanks a lot !

Please allow me to add some questions

  1. Do we need to call out that multiple users can bridge multiple services?
  2. Nit: Terminology: does it make sense to change "and the removal of the phantom infrastructure..." to "and the removal of the cloned infrastructure"? (just to avoid an extra name)
  3. Are there any limitations which apply to infra cloning? Things a pod/service configuration must or must not have? E.g. node- or other-affinity? special session-handling/routing ? Should I try to get Anton's/Rohit's thoughts here? e.g. around special handling for WebSockets with their need for cross-user session handling ?
  4. Are there chances for any impact on validity of server certificates due to the traffic redirection?
  5. How long do we expect setup (or tear-down) of cloned infra to take? and for how much of this time do we expect the regular service to be non-responsive? - In case this could take a bit more time, do we need an option to preserve phantom infra even after removal of last bridge? Or even an option to explicitly install phantom infra independent of bridge setup?

Again, great feature! Thank you, @Schille

@Schille
Copy link
Collaborator Author

Schille commented Nov 7, 2024

@crkurz Thank you.

To your questions:

  1. You should already be able to bridge multiple services simultaneously. If that's unclear, we must add that bit to the docs.
  2. You are right. I changed it.
  3. I don't see more limitations than mentioned. Since we'll clone the pods with all attributes (except for the selector-relevant labels) I don't expect affinity issues. But the more people who join the party, the better it is: I would welcome it if you would take up Anton/Rohit's thoughts on this.
  4. Yes, that's not 100% clear as of now. We must find a solution to tell Carrier which certificates to use to introspect SSL traffic and decide on the route.
  5. That depends. Small apps - short setup time. Java - huge setup time. =) I thought about that too and I am tempted to agree to a concept that represents the bare installation of a bridge without actually having a single user to match traffic,

@Schille Schille pinned this issue Nov 18, 2024
@crkurz
Copy link

crkurz commented Dec 17, 2024

Hi @Schille ,
I had another pass and would like to share some additional points:

  1. Do we need to pad out what “compatible service” means?
  2. With the first user-bridge we duplicate Services, what kind of limitations/issues can we have? e.g. out-dated files? Because now these are long-running just like the original pods, unlike local containers. So we need to apply things like key rotation to these Gefyra created pods?
  3. What is our plan to deal with HPA? With these shared bigger cluster installations, HPA is something we may need to handle. Not in alpha1, but at some point.
  4. Does is make sense to change some image titles? I tend to misinterpret some of them.
    “Installing Personal Bridge 1” -> “Installing Personal Bridge - Setup before”
    “Installing Personal Bridge 2” -> “Installing Personal Bridge - Replicate existing Service”
    “Installing Personal Bridge 3” -> “Installing Personal Bridge - Install Carrier image for default route”
  5. I think we need some discussion for a plan for handling https?
  6. At some point we need detail down how to avoid conflicts between gefyra-bridge-setup (and its shadow deployment) and auto-deployments from CICD pipeline. And how to restore all gefyra-bridge-setup (and its shadow deployment) in case everything was overwritten by some redeployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🎉 New feature or request v2
Projects
None yet
Development

No branches or pull requests

3 participants