Allow tunnel server upgrade without disconnecting user environments #233

royra · 2023-09-20T07:01:39Z

Currently, when deploying the tunnel server, user environments will be briefly disconnected while the CTA (agent) reconnects to the new instance. This can cause incoming requests to the environments to fail with 502 "environment not found" errors.

Suggested solution - a cooperative rollout flow, compatible with Kubernetes rolling update (although it's quite generic and can be used with other orchestration infra).

The tunnel server will handle SIGTERM to start a graceful shutdown flow. It will notify its connected clients to reconnect (see below). It will then wait for all its client connections to end, or a configurable timeout has passed, then exit.
When the CTA (agent) is notified of the pending tunnel server shutdown, it will:
- Create a new SSH client connection to its configured tunnel server URL. The new connection will be routed to the new tunnel server instance by the infra (e.g, K8s).
- Once the new SSH client connection is established, all existing forwards will be established on it. This will cause new requests to come in through the new SSH connection.
- Existing TCP forward connections from the old SSH connection will be allowed to complete. This assumes they are short-lived HTTP requests. Long-lived connections (e.g, websockets) will eventually be terminated from the remote side (the tunnel server timeout expiring), but are assumed to be designed to recover from disconnections.
- Once all the old TCP connections are closed, the old SSH connection will be closed.

Currently there is no simple way for the SSH server to notify its clients of an event. An applicative "server events" channel can be created by having the CTA initiate a specific "control" command session (exec) on its client connection, and wait for it to end as a signal. Alternatively, instead of using the SSH connection, the CTA can accept an HTTP request on its own API endpoint. However, this requires the tunnel server to identify the specific tunnel for each connected CTA.

The text was updated successfully, but these errors were encountered:

Yshayy · 2023-09-20T07:34:55Z

I think a problem we have is once there's two new tunnel servers (old and new) and:

New is getting all incoming requests (due to k8s service behavior), but has not yet established connection with CTA.
In this case, either the routing layer should be aware which CTA have connected, or the new tunnel server should pass the traffic to the old instance.
This solution can support multiple-instances but require some sort of mechanism for the tunnel server to be aware of other tunneling servers.

If we want to support only the case of upgrade and HA, An alternative solution can be to use sort-of modified blue-green deployment of two instances with swap mechanisms:

CTA connect to two instances with two different external URLs for ssh (discovery can be done using DNS SRV record)
Both tunnel server produce the same external URLs for traffic
A Kubernetes service only forward traffic to the active deployment
During upgrade, we upgrade the other deployment
After upgrade we're switching the active deployment (using service+labels)
Kubernetes forward external traffic to the new active deployment

During this time, the old deployment is still alive and forward traffic to the CTA.
The trick here is that we're not shutting down active deployment, only switching traffic.
But this solution is more specific to the case of HA during upgrade.

royra · 2023-09-20T07:58:56Z

You're right, it won't work as I suggested.

I like the blue/green idea, but I think there's a way to do it without two URLs at the CTA.

By extracting the stunnel/sslh to a separate deployment, we can define different k8s services for the SSH and HTTP endpoints. Normally they will point to the same deployment. When upgrading:

create the new deployment, and wait for it to be healthy
point the SSH service to it
send a SIGTERM to the old deployment's tunnel server (not sure how to do that nicely, but you can script it)
wait for CTAs to switch
update the HTTP service to point to the new deployment
delete the old deployment

royra · 2023-09-20T07:59:37Z

This sounds a bit painful tho and if it's covered by the distributed tunnel server solution, maybe it's best to wait for that.

Yshayy · 2023-09-20T08:22:58Z

We can use DNS SRV records so the CTA will need to know only one URL, it can be optional for simplicity.
In practice, it'll query the dns for SRV records and get tunnel servers to connect to if such records are found.

Deployment itself shouldn't be too difficult with K8S ->
It's two deployments with different labels and one service with selector on blue/green.

Yshayy · 2023-09-20T08:24:07Z

The different services approach (stunnel/sslh) with a single tunnel server deployment is a bit tricky because there are multiple CTAs here, so there isn't a single switch.

royra added bug Something isn't working enhancement New feature or request need spec points: 5 Very high complexity labels Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow tunnel server upgrade without disconnecting user environments #233

Allow tunnel server upgrade without disconnecting user environments #233

royra commented Sep 20, 2023 •

edited

Loading

Yshayy commented Sep 20, 2023 •

edited

Loading

royra commented Sep 20, 2023

royra commented Sep 20, 2023

Yshayy commented Sep 20, 2023 •

edited

Loading

Yshayy commented Sep 20, 2023 •

edited

Loading

Allow tunnel server upgrade without disconnecting user environments #233

Allow tunnel server upgrade without disconnecting user environments #233

Comments

royra commented Sep 20, 2023 • edited Loading

Yshayy commented Sep 20, 2023 • edited Loading

royra commented Sep 20, 2023

royra commented Sep 20, 2023

Yshayy commented Sep 20, 2023 • edited Loading

Yshayy commented Sep 20, 2023 • edited Loading

royra commented Sep 20, 2023 •

edited

Loading

Yshayy commented Sep 20, 2023 •

edited

Loading

Yshayy commented Sep 20, 2023 •

edited

Loading

Yshayy commented Sep 20, 2023 •

edited

Loading