Note: Rockfish is a compute cluster running at the Johns Hopkins University in Baltimore. Its documentation can be found at: https://www.arch.jhu.edu/guide/
Note: This documentation is rockfish-specific, but the concepts should be applicable to any other compute cluster. So feel free to read on anyways.
Note: All the used ssh commands will use my
clsp-mayer
username. You need to replace these with your username in all occurences.
We want to arrive to a final setup, where the backend will be running inside of an interactive job on a GPU machine (for example gpu13
), the frontend will be running on your laptop, and the frontend will access the backend through a two-step SSH tunnel that we set up (your laptop ->
login node ->
the interactive job). This setup is captured in the following diagram:
If you have the backend server already installed, jump right to the Setting up the SSH tunnel section. Otherwise read from start to the end, or navigate to a specific section:
- Installing the backend server
- Setting up the SSH tunnel
- Setting up frontend
- Managing conda environments
For the backend we only need Python 3.10
, which we can obtain from the anaconda
module that's installed on Rockfish. We can do all of the following tasks from a login node, no need to start any interactive job.
Connect to Rockfish:
Note: Rockfish has three login nodes (
login01
,login02
,login03
) and we cannot choose which we connect to. We get one assigned automatically when we connect tologin.rockfish.jhu.edu
.
Pick a folder, say ~/jsalt/demo
and clone the source code:
git clone [email protected]:JSALT2024/demo.git ~/jsalt/demo
Now we need to set up the python virtual environment in the backend folder (i.e. we need to create the ~/jsalt/demo/.venv
folder). For that we will use the anaconda
module that's installed on rockfish.
From within the login node, activate the anaconda
module and create a conda environment for the demo with Python 3.10
(not to use the environment via conda, but just to get a stable access to a Python 3.10
interpreter). See the Managing conda environments section on how to do that.
Warning: Do not use the
python3
command that is available by default. That command uses a different python installation for each machine, because it's the python that's intrinsic to that machine's operating system.
After you create the conda environment and activate it, you should see the following prompt:
(signllava-demo) [clsp-mayer@login01 ~]$
Now you can create the .venv
folder in the cloned repository:
# enter the backend folder
cd ~/jsalt/demo/backend
# create the virtual environment in the `.venv` folder
python3 -m venv .venv
After this point, you no longer need to use conda. You can deactivate the environment by running conda deactivate
. All the following commands will use the python3
executable that has been created inside the .venv
folder and therefore will use the conda environment's python without needing to activate anything.
If you ever need to use the backend's python with all of its packages yourself, just type in the full python executable path:
.venv/bin/python3
Or possibly activate the virtual environment:
source .venv/bin/activate
Before we actually download the model checkpoints for LLaMa and other models, you might want to create a symlink for the backend/checkpoints
folder and the ~/.cache/huggingface
folder. This is because your home-folder quota might be relatively low and it's better to download these models to some other filesystem designed for large file storage.
I pointed the symlinks to the scratch4
filesystem via the scratch4
symlink I had in my home folder. You need to check and update these commands to use them for your setup (use your username in a few places):
# remove the real folders if they exist (especially the huggingface cache)
rm -rf ~/.cache/huggingface
rm -rf ~/jsalt/demo/backend/checkpoints
# create the target folders
mkdir -p ~/scratch4/clsp-mayer/huggingface_cache
mkdir -p ~/scratch4/clsp-mayer/demo_backend_checkpoints
# create the symlinks
ln -s ~/scratch4/clsp-mayer/huggingface_cache ~/.cache/huggingface
ln -s ~/scratch4/clsp-mayer/demo_backend_checkpoints ~/jsalt/demo/backend/checkpoints
The following tasks will be more CPU intensive and very last ones will actually test the machine learning models so we need to have GPU access.
This command depends very much on your account rights on Rockfish, but this is the command I used to develop the system. It should work for you as well for at least 6 months after the workshop:
srun -N 1-1 -n 12 --gpus 1 -p a100 --time=3-00:00:00 --pty bash
It does the following:
- starts an interactive
bash
session (i.e.srun --pty bash
) - on a single machine
-N 1-1
- with 12 CPU cores
-n 12
- with one GPU card
--gpus 1
- on the A100 partition
-p a100
- with time limit increased to 3 days
--time=3-00:00:00
After the job starts you should have a shell inside a GPU node, something like this:
[clsp-mayer@gpu13 ~]$
Continue with the rest of the tutorial inside this job.
Note: If you're adapting this to your own cluster, see the hardware requirements file to see how many resources you need for the job.
Now, just like in the backend server's README complete the backend installation by running the make
commands:
make install-requirements # basically just `pip install`
make install-all-models # many `git clone` and `wget` for models and ckpts
Create the .env
file based on the .env.example
and fill out your Huggingface token. Set the .env
file to be private to you, so that nobody can steal your token. Then make sure you have pull access to LLaMa and Sign2Vec from your huggingface account (if not, you will get a message when you test models and you can debug from there). Then test the models to perform all huggingface pulls AND to test inference:
# setup the .env
cp .env.example .env
chmod 600 .env
# fill out your HF token
nano .env
# tests all models (plus triggers huggingface downloads)
make test-all-models
The models tests should finish without crashing. If an exception appears, or a huggingface complaint about repository access appears, you need to resolve the issue.
You can run each test individually to debug the source of the issue:
# for example Sign2Vec
.venv/bin/python3 -m app.debug.test_s2v
Once all tests pass, you can safely start the backend server, knowing it is unlikely it will crash during runtime.
Congratulations! Now you can start the server like this:
make start
The uvicorn
server will be started on port 1817
, waiting for incomming HTTP requests:
INFO: Will watch for changes in these directories: ['/home/clsp-mayer/jsalt/demo/backend']
INFO: Uvicorn running on http://127.0.0.1:1817 (Press CTRL+C to quit)
INFO: Started reloader process [1236247] using WatchFiles
You kill it by pressing Ctrl + C
.
Note: On my laptop, running
make start
watches for changes in the.py
files and when one occurs it reloads the server. On rockfish (despite the logs stating the oposie), this behavior seems to be broken, so you need to manually kill and start the server whenever you make a change to the backend code.
Running the backend server is only the first part of the setup. Next we need to tunnel the port 1817 via SSH from your laptop, through a login node, into the running job. This is so that typing http://localhost:1817 into your browser actually accesses the running backend server.
Start by running the laptop -> login node
port forward first, because we don't have control over which login we connect to when we are outside the Rocfish cluster. Run this command in a new terminal from your laptop:
# we are at:
# jirka@my-laptop:~$
ssh -L 1817:localhost:1817 [email protected]
After typing in the password you will get a shell on some login node. Remember the node name, we will use it later. I got connected to login01
.
Now we need to start the other half of the tunnel - from the running job, into the login01
node. But this time, we open the tunnel from the other end (see the first diagram, we open the tunnel from the tip of the arrow, not the base), so there's a sligtly different command to be used. Run this command from the interactive GPU job that you used to setup the backend server before:
# we are at:
# [clsp-mayer@gpu13 backend]$
# replace login01 with the specific head node number the first tunnel openned to
ssh -NR 1817:localhost:1817 login01 &
The last ampersand (&
) makes sure the command runs in the background, so that we still have access to the shell to run the backend server.
Now start the backend server:
# we are at:
# [clsp-mayer@gpu13 backend]$
make start
Now if you open the browser on your laptop and open http://localhost:1817, you should see the welcome page of the backend server.
If this works, you have your backend set up and you can move on to set up the frontend on your laptop.
Let me put the diagram here again:
The two halves of the SSH tunnel can be started independently, but they need to meet at the same login node in order to work.
The first half tunnels your browser HTTP request into the login node, where it appears as-if the browser was actually running on the login node itself (that's why there are those localhost
s everywhere - each jump between processes happens inside a single machine). The second half tunnels this request further into the gpu13
machine where the backend server acually runs. So the end result is a setup where the browser thinks the backend server is running on your laptop, and the backend server thinks it's getting requests from a browser running on the gpu13
machine (i.e. the localhost).
When you try to start one of the commands and you get an error that the port is already occupied, chances are you are already running that command in some terminal somewhere. On your local laptop, just go through all terminals and kill that other process.
The other clash can happen inside the interactive GPU job. If that happens, try running the ps
command to see all the processes you've started:
PID TTY TIME CMD
55017 pts/0 00:00:00 bash
1270375 pts/0 00:00:00 ssh
1270376 pts/0 00:00:00 ps
You can see the ssh
command running with the process ID 1270375
. It's running in the background because we started the ssh
with the ampersand (&
) at the end. You need to kill that command manually:
kill 1270375
Note: Stopping the slurm job will send a cascade of interrupt/kill commands to all child processes, which will also terminate the
ssh
process. Don't worry about polluting thegpu13
machine with free-roaming processes.
Running frontend from your laptop requires only installing Node.js
and then starting the frontend hosting server via the npm run start
command. All of this is described in the Frontend README file.
Once that's set up, the frontend application is configured to access the localhost:1817
endpoint whenever it itself is running on localhost:1234
, therefore if the SSH tunnel is in place, everything should be working.
Check that the SSH tunnel did not become frozen or stuck and restart it if so. Sometimes temporary network outages cause the tunnel to break (at least the laptop-to-login half). If you can access the backend via browser, the frontend should as well.
Before using conda, you need to first activate the anaconda
module. This must be done from the login node - it does not work inside a running job.
module load anaconda
Note: This command sets the
$PATH
variable to something like/data/apps/...centos.../..gcc../...anaconda.../bin
, where these cluster-wide apps are installed.
You can verify that the conda
command works:
conda -V
You can list the existing conda environments:
conda info --envs
Note: There are a few pre-defined environments, such as
jupyter
orant-1.6.0
. We don't care about these. Our own environments have paths that lead into our home directory into the/home/clsp-mayer/.conda/envs
folder.
We want to create an empty conda environment, where we only specify the python version. We don't need any pip packages or any other thing - those will be installed into the .venv
folder we create later.
conda create --name signllava-demo python=3.10
Listing the environments again should show the environment being present.
You can activate the new conda environment like this:
conda activate signllava-demo
This changes the prompt to the following:
(signllava-demo) [clsp-mayer@login01 ~]$
You can verify that the python3
command now actually points to our conda environment's python executable by running:
which python3
You should get a response like:
~/.conda/envs/signllava-demo/bin/python3
You can see that it points inside our signllava-demo
environment.
Now we are ready to use the python3
command to create our virtual environment for the backend server.
Later, when you want to uninstall the whole setup, you can remove the conda environment by running:
conda remove --name signllava-demo --all