Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better reconnection for nonvnc on challenge restart #478

Open
zardus opened this issue Jul 23, 2024 · 7 comments
Open

Better reconnection for nonvnc on challenge restart #478

zardus opened this issue Jul 23, 2024 · 7 comments

Comments

@zardus
Copy link
Contributor

zardus commented Jul 23, 2024

We generate a new VNC password when we restart a challenge, and as a result, novnc can't automatically reconnect across challenge starts. This annoyance will get more noticeable as we streamline the UX (e.g., #473). We should probably patch novnc to refresh the page when a reconnection fails (which will also fix the DOS issue).

@ConnorNelson
Copy link
Member

ConnorNelson commented Jul 24, 2024

This should have already fixed DOS:

reconnectPatch = pkgs.writeText "reconnect_patch.diff" ''
--- a/share/webapps/novnc/app/ui.js
+++ b/share/webapps/novnc/app/ui.js
@@ -1,3 +1,3 @@
- if (UI.getSetting('reconnect', false) === true && !UI.inhibitReconnect) {
+ else if (UI.getSetting('reconnect', false) === true && !UI.inhibitReconnect) {
'';
novnc = pkgs.novnc.overrideAttrs (oldAttrs: {
postInstall = (oldAttrs.postInstall or "") + ''
patch -p1 -d $out < ${reconnectPatch}
'';
});

I think what you are proposing does not make sense as a "noVNC patch". Instead what we should do is put javascript on the top level page that decides where the iframe will point at. Somehow, we need the ability for that javascript to be informed that it should change the iframe location.

For example:

  • I receive an event that a new workspace has started and initialized; this JS knows the user is interested in the Desktop (because we are on /workspace/desktop)
  • Start a sweet spinning "loading" animation
  • Tell the backend to start the desktop service
  • When that has finished, kill the "loading" animation and set the iframe location

We should do the same thing for all top-level /workspace/* routes.

The tricky thing is receiving that event. We could for instance poll every 30s to see if I have a new workspace (e.g. container id changed and is in an initialized state). But that's gonna be a decent amount of increases requests and also be slow to do what we want: we want the new desktop to switch over as fast as possible. What we probably need to figure out is websockets or SSE. The critical thing is making sure this plays nice by not holding an available flask worker hostage and not holding db thread connections hostage. I believe CTFd actually has some support for this already that we've disabled in the docker-compose:
https://github.com/CTFd/CTFd/blob/bbe1fce4f27a290056614241d74d821625aef602/CTFd/api/v1/notifications.py#L151

@spencerpogo
Copy link
Contributor

I would lean towards SSE as it has way less overhead: it's a normal HTTP request that stays open and streams a text-based protocol. Websocket requires an upgrade handshake between the client and server and every message has to be XORed.

I took a look at the CTFd notification logic. A couple important files are the route the clients connect to and the events_manager classes. Here's my understanding of how it works. Redis has a built in pub/sub mechanism. The event manager's job is to spawn a gevent task (basically a thread) to subscribe to redis. When an HTTP request comes through to the SSE endpoint, it returns a generator which is how flask streams stuff. The generator hooks into the event manager, encodes the events in the SSE format, and forwards them to the client.

It seems like this would lock up a flask worker, but I'd have to test to be sure. Maybe the only solution is to add more flask workers.

@ConnorNelson
Copy link
Member

I think I also lean towards SSE, especially since we don't have a need for bidirectional communication, and probably even more importantly, CTFd already has logic for this!

I see a gevent.spawn(_listen) here:
https://github.com/CTFd/CTFd/blob/8ef0cdd916eab4226956d9d2d70adeccdc89f441/CTFd/utils/events/__init__.py#L100

My surface level understanding is that CTFd is using gevent by default, and we continue to use this in the DOJO, a "lightweight thread", as you mention. We specify 8 workers, but each worker actually also spins up real (linux) threads:

root@ctfd:/opt/CTFd# ps aux | grep gunicorn | grep -v grep | wc -l
9
root@ctfd:/opt/CTFd# ps aux -T | grep gunicorn | grep -v grep | wc -l
57

I think that 9th process is a a supervisor process that manages the 8 workers, maybe?
The number of threads fluctuates a bit.
My best guess would be that these gevent threads are backed by real threads to some extent; and that regardless by using spawn, we aren't completely taking over the worker's ability to process more requests, but this should probably be researched/tested to confirm.

@spencerpogo
Copy link
Contributor

listen is only called once in init_events. This single lightweight thread that is spawned is responsible for forwarding messages from redis to the entries of the clients dict. The clients dict entries are populated by the generator returned by the request handler.

The question is where flask iterates the generator from.

@ConnorNelson
Copy link
Member

I think I see what you're saying.

I think it should be iterating here in the /events endpoint:
https://github.com/CTFd/CTFd/blob/8ef0cdd916eab4226956d9d2d70adeccdc89f441/CTFd/events/__init__.py#L16

Which pulls from:
https://github.com/CTFd/CTFd/blob/8ef0cdd916eab4226956d9d2d70adeccdc89f441/CTFd/utils/events/__init__.py#L111-L112

I think gevent should be taking care of making sure the worker process is context switching when it gets blocked on that .get().

I can see that gevent is willing to monkey patch SimpleQueues (https://www.gevent.org/api/gevent.monkey.html#gevent.monkey.patch_queue), but this is a regular Queue. Not sure how it allows that to function, but I can't imagine something so core isn't supported, it must be context switching somehow (maybe simply with the assistance of regular threads).

@ConnorNelson
Copy link
Member

Just throwing it out there, but I think this idea will also have nice implications for #426, we can get some sweet interfacing if we can cleanly stream solves/awards/etc.

@JensHouses
Copy link
Contributor

we could also use jquery:
working:
$("iframe").contents().find("#noVNC_status").attr("class")
-> "noVNC_status_normal"
vs
not working:
$("iframe").contents().find("#noVNC_status").attr("class")
-> "noVNC_status_error noVNC_open"

we could check every 2 seconds on the client if the status is "noVNC_status_error noVNC_open" and if so trigger a reload of the page or get the new iframe src from a specific endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants