[Feature] User-specific routable Gefyra bridge ("user bridge") #733

Schille · 2024-10-30T12:18:29Z

Intro

Gefyra currently supports "global bridge" only. See https://gefyra.dev/docs/run_vs_bridge/#bridge-operation to learn more.
In short, a container within a running Pod (or multiple replicas) is replaced by a Gefyra component called Carrier. This allows Gefyra, with some constraints, to route traffic to a local container originally targeted to the specified Pod within the cluster.

This capability helps debug local containers using traffic from the cluster, rather than synthetic or made-up local traffic. However, the bridge is currently globally effective for all traffic directed to the bridged Pod, which may sometimes be undesired. This means that only one bridge per Pod can exist, allowing only one user to bridge a Pod at a time. With this feature proposal, we aim to lift that limitation in a flexible yet robust way.

This feature addresses the following issues:

Remark: Remember that one of Gefyra's fundamental premises is not to interfere with Kubernetes objects from your workloads. The proposed feature draft does not involve modifying existing deployment artifacts. Why? If something goes wrong (as things often do), we want Gefyra users to be able to restore the original state simply by deleting Pods. However, there may be residual objects or other additions that should never disrupt the operations of the development cluster. Gefyra aims to minimize this risk by treating it as a bug.

What is the new feature about?

Gefyra's bridge operation will support specific routing configurations to intercept only matching traffic, allowing all unmatched traffic to be served from within the cluster. Multiple users will be able to intercept different traffic simultaneously, receiving it on their local container (started by gefyra run ...) to serve it with local code.

Departure

The main components involved in establishing a Gefyra bridge are:

Gefyra Client: Requests a Gefyra bridge that is globally effective.
Gefyra Operator: Acts on the bridge request by setting up the target Pod with Carrier, establishing a chain of reverse proxies into the client network (including Kubernetes objects), and reporting back the result of the operation.
Gefyra Stowaway (connection provider): Dynamically creates a reverse proxy for the specific bridge into the client network.
Gefyra Carrier (bridge provider): Replaces running containers and proxies incoming TCP/UDP traffic to the reverse proxy chain set up by the Operator.

Remark: Gefyra's cluster component architecture consists of different interfaces. The connection provider and bridge provider are two abstract concepts with defined interfaces. "Stowaway" and "Carrier" are the current concrete implementations of these interfaces. However, depending on the results of this implementation, I expect at least the latter to be replaced by a new component (perhaps Carrier2?). For consistency, I will continue to use these component names.

Overview

Carrier

Currently, Carrier is installed into 1 to N Pods. Each instance upstreams any incoming traffic ("port x") to a single target endpoint ("upstream-1"). This process does not involve traffic introspection: IP packets come in and are sent out as-is. This setup is simple and fast.
Carrier is based on the Nginx server and thus is configured using the stream directive: https://nginx.org/en/docs/stream/ngx_stream_core_module.html#stream

Feature draft

Stage 1: Installation & keep original Pods around to serve unmatched traffic

When a compatible service is bridged, we need the original workloads to serve any unmatched traffic through the user bridge.
Consider the following example: a compatible workload <Y> is selected by a Kubernetes service object. This workload consists of 3 Pods.

Once a user bridge is requested, Gefyra's Operator replicates all essential components (most importantly, the Pods and the service) by cloning and modifying them. Pod <Y1'> is modified on the fly so that it is selected by service <Y'>. The Pods <Y1'>, <Y2'> and <Y3'> must not be selected by service <Y>. Most other parameters - such as mounts, ports, probes, etc. - should remain unchanged.

The cloned workload infrastructure remains active as long as at least one Gefyra user bridge is active.

The Gefyra Operator installs Carrier into the target Pods (<Y1>, <Y2> and <Y3>) and dynamically configures them to send all unmatched traffic to the cloned infrastructure <Y'>. This setup ensures:

No existing workloads are modified (except for temporary image changes).
Common traffic can still be served, with just one additional hop.
If the cluster setup is interrupted, it can be easily restored by re-rolling out the source of the ReplicationSet <Y>

Of course, if there is a different replication factor or other deployment scenarios (e.g., Pod only), the Gefyra Operator adapts accordingly. I hope the idea makes sense.

Stage 2: Add a local upstream & redirect matching traffic

The Carrier component will require significant changes as we shift from a “stream”-based proxy to a more advanced proxy ruleset, incorporating path and header matching for HTTP, along with routing rules for other protocols in the future.
Fortunately, the required changes in the Gefyra Operator are not as extensive as those in Carrier. Several interfaces already support creating different routes within the connection provider ("Stowaway") and bridge provider abstractions.

Interface reference for connection providers (Stowaway):

gefyra/operator/gefyra/connection/abstract.py

Lines 64 to 103 in 9fcbf7e

    
               @abstractmethod 
        
               def add_destination( 
        
                   self, 
        
                   peer_id: str, 
        
                   destination_ip: str, 
        
                   destination_port: int, 
        
                   parameters: Optional[Dict[Any, Any]] = None, 
        
               ) -> str: 
        
                   """ 
        
                   Add a destintation route to this connection provider proxy, returns 
        
                   the service URL 
        
                   """ 
        
                   raise NotImplementedError 
        
               @abstractmethod 
        
               def remove_destination( 
        
                   self, peer_id: str, destination_ip: str, destination_port: int 
        
               ): 
        
                   """ 
        
                   Remove a destintation route from this connection provider proxy 
        
                   """ 
        
                   raise NotImplementedError 
        
               @abstractmethod 
        
               def destination_exists( 
        
                   self, peer_id: str, destination_ip: str, destination_port: int 
        
               ) -> bool: 
        
                   """ 
        
                   Returns True if the destination exists, otherwise False 
        
                   """ 
        
                   raise NotImplementedError 
        
               @abstractmethod 
        
               def get_destination( 
        
                   self, peer_id: str, destination_ip: str, destination_port: int 
        
               ) -> str: 
        
                   """ 
        
                   Returns the service URL for the destination 
        
                   """ 
        
                   raise NotImplementedError

Interface reference for bridge providers (Carrier, Carrier2):

gefyra/operator/gefyra/bridge/abstract.py

Lines 40 to 61 in 9fcbf7e

    
               def add_proxy_route( 
        
                   self, 
        
                   container_port: int, 
        
                   destination_host: str, 
        
                   destination_port: int, 
        
                   parameters: Optional[Dict[Any, Any]] = None, 
        
               ): 
        
                   """ 
        
                   Add a new proxy_route to the bridge provider 
        
                   """ 
        
                   raise NotImplementedError 
        
               @abstractmethod 
        
               def remove_proxy_route( 
        
                   self, container_port: int, destination_host: str, destination_port: int 
        
               ): 
        
                   """ 
        
                   Remove a bridge from the bridge provider 
        
                   :param proxy_route: the proxy_route to be removed in the form of IP:PORT 
        
                   """ 
        
                   raise NotImplementedError

Rules

The GefyraBridge CRD already supports an arbitrary set of additional configuration parameters for the bridge provider.

gefyra/operator/gefyra/resources/crds.py

Lines 17 to 25 in 9fcbf7e

    
           # provider specific parameters for this bridge 
        
           # a carrier example: {"type": "stream"} to proxy all traffic on TCP level 
        
           # to the destination (much like a nitro-speed global bridge) 
        
           # another carrier example: {"type": "http", "url": "/api"} to proxy all 
        
           # traffic on HTTP level to the destination 
        
           # sync_down_directories 
        
           "providerParameter": k8s.client.V1JSONSchemaProps( 
        
               type="object", x_kubernetes_preserve_unknown_fields=True 
        
           ),

For HTTP traffic, the routing parameters appear to be quite obvious:

path matching (e.g. /api/objects/5)
header matching (e.g. owner: john)

Each user bridge adds a new entry to the upstream servers for Carrier, along with an additional (verified) matching rule. The operator's validating webhook should implement matching rule validation to catch common mistakes (e.g., a rule already applied by another user or a rule that never fires due to another bridge capturing all traffic). If a matching rule is invalid, the creation of the GefyraBridge is halted immediately.

Stage 3: Remove a user bridge

Removing a bridge is a two-phase process:

The deletion request from the Gefyra Client prompts the Operator to initiate the bridge removal.
Both the bridge provider and the connection provider are called upon to delete their respective routing configurations.

Remove the last user bridge & clean up

If Stage 3 removes the last active bridge for a Pod, the uninstallation procedure is triggered. This process includes resetting the patched Pods (<Y1>, <Y2> and <Y3>) to their original configuration and removing the cloned infrastructure (Pod <Y1'>, <Y2'>, <Y3'> and service <Y'>).

Closing remarks

I would like to retain the current implementation of the "global bridge," as it is a high-performance solution. Therefore, we should add a new flag, gefyra bridge ... --global to enable the global bridge with its current behavior.
Setting up the first user bridge for a Pod/Deployment will be more time-intensive than setting up subsequent user bridges (since Stage 1 is triggered), so we should handle this appropriately.
This concept is likely not applicable to RWO StatefulSets, as we cannot simply clone a StatefulSet Pod with an RWO mount; this limitation should be conveyed with an error message.
I currently don’t have a solution for HTTPS traffic interception by Carrier. One possibility could be to add a parameter in the bridge request to direct Carrier to the PKI, or to support the creation of custom bridge provider images. (The UID/GID issues remain as well.) With a custom image, we could integrate specific bridge parameters and designate this image for a particular Pod.

This feature is currently in the ideation phase. I would appreciate any external feedback on how to make this as useful and robust as possible. If you want to talk to me about this (or Gefyra in general), please find me around at our Discord server: https://discord.gg/Gb2MSRpChJ

I am also looking for a co-sponsor of this feature. If you or your team want to support this development, please contact me.

The text was updated successfully, but these errors were encountered:

liquidiert · 2024-10-31T09:34:38Z

First off: Great RFC @Schille !
Just one quick question: What happens to the shadow infrastructure when the original might change while a bridge is active? I'm sure there's already handling for this case when using global bridge but what is the appropriate procedure here?

Schille · 2024-10-31T09:45:03Z

@liquidiert Gotcha! A rollout of the original workloads would render the bridge useless since Gefyra's patch will be reset. The operator should reconcile all bridges, detect that situation, and take appropriate action (patching again, setting up user bridges to work again). Or declare existing user bridges stale and remove them.

liquidiert · 2024-10-31T09:48:27Z

@Schille that sounds like a good reconciliation tactic, thanks!

crkurz · 2024-11-04T10:47:32Z

This look terriffic, @Schille ! Thanks a lot !

Please allow me to add some questions

Do we need to call out that multiple users can bridge multiple services?
Nit: Terminology: does it make sense to change "and the removal of the phantom infrastructure..." to "and the removal of the cloned infrastructure"? (just to avoid an extra name)
Are there any limitations which apply to infra cloning? Things a pod/service configuration must or must not have? E.g. node- or other-affinity? special session-handling/routing ? Should I try to get Anton's/Rohit's thoughts here? e.g. around special handling for WebSockets with their need for cross-user session handling ?
Are there chances for any impact on validity of server certificates due to the traffic redirection?
How long do we expect setup (or tear-down) of cloned infra to take? and for how much of this time do we expect the regular service to be non-responsive? - In case this could take a bit more time, do we need an option to preserve phantom infra even after removal of last bridge? Or even an option to explicitly install phantom infra independent of bridge setup?

Again, great feature! Thank you, @Schille

Schille · 2024-11-07T16:51:14Z

@crkurz Thank you.

To your questions:

You should already be able to bridge multiple services simultaneously. If that's unclear, we must add that bit to the docs.
You are right. I changed it.
I don't see more limitations than mentioned. Since we'll clone the pods with all attributes (except for the selector-relevant labels) I don't expect affinity issues. But the more people who join the party, the better it is: I would welcome it if you would take up Anton/Rohit's thoughts on this.
Yes, that's not 100% clear as of now. We must find a solution to tell Carrier which certificates to use to introspect SSL traffic and decide on the route.
That depends. Small apps - short setup time. Java - huge setup time. =) I thought about that too and I am tempted to agree to a concept that represents the bare installation of a bridge without actually having a single user to match traffic,

crkurz · 2024-12-17T18:02:08Z

Hi @Schille ,
I had another pass and would like to share some additional points:

Do we need to pad out what “compatible service” means?
With the first user-bridge we duplicate Services, what kind of limitations/issues can we have? e.g. out-dated files? Because now these are long-running just like the original pods, unlike local containers. So we need to apply things like key rotation to these Gefyra created pods?
What is our plan to deal with HPA? With these shared bigger cluster installations, HPA is something we may need to handle. Not in alpha1, but at some point.
Does is make sense to change some image titles? I tend to misinterpret some of them.
“Installing Personal Bridge 1” -> “Installing Personal Bridge - Setup before”
“Installing Personal Bridge 2” -> “Installing Personal Bridge - Replicate existing Service”
“Installing Personal Bridge 3” -> “Installing Personal Bridge - Install Carrier image for default route”
I think we need some discussion for a plan for handling https?
At some point we need detail down how to avoid conflicts between gefyra-bridge-setup (and its shadow deployment) and auto-deployments from CICD pipeline. And how to restore all gefyra-bridge-setup (and its shadow deployment) in case everything was overwritten by some redeployment.

Schille changed the title ~~[WIP][Feature] User-specific routable Gefyra bridge ("user bridge")~~ [Feature] User-specific routable Gefyra bridge ("user bridge") Oct 31, 2024

Schille added enhancement 🎉 New feature or request v2 labels Oct 31, 2024

Schille pinned this issue Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] User-specific routable Gefyra bridge ("user bridge") #733

[Feature] User-specific routable Gefyra bridge ("user bridge") #733

Schille commented Oct 30, 2024 •

edited

Loading

liquidiert commented Oct 31, 2024

Schille commented Oct 31, 2024

liquidiert commented Oct 31, 2024

crkurz commented Nov 4, 2024

Schille commented Nov 7, 2024

crkurz commented Dec 17, 2024 •

edited

Loading

[Feature] User-specific routable Gefyra bridge ("user bridge") #733

[Feature] User-specific routable Gefyra bridge ("user bridge") #733

Comments

Schille commented Oct 30, 2024 • edited Loading

Intro

What is the new feature about?

Departure

Overview

Carrier

Feature draft

Stage 1: Installation & keep original Pods around to serve unmatched traffic

Stage 2: Add a local upstream & redirect matching traffic

Rules

Stage 3: Remove a user bridge

Remove the last user bridge & clean up

Closing remarks

liquidiert commented Oct 31, 2024

Schille commented Oct 31, 2024

liquidiert commented Oct 31, 2024

crkurz commented Nov 4, 2024

Schille commented Nov 7, 2024

crkurz commented Dec 17, 2024 • edited Loading

Schille commented Oct 30, 2024 •

edited

Loading

crkurz commented Dec 17, 2024 •

edited

Loading