Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for installing client sites on HPC systems #2595

Open
dirkpetersen opened this issue May 23, 2024 · 8 comments
Open

Support for installing client sites on HPC systems #2595

dirkpetersen opened this issue May 23, 2024 · 8 comments
Assignees

Comments

@dirkpetersen
Copy link
Contributor

dirkpetersen commented May 23, 2024

Is your feature request related to a problem? Please describe.

Some organizations have all their GPUs allocated in HPC systems and find it difficult to allocate dedicated GPU servers to NVFlare. Currently the use on HPC systems is undocumented.

Describe the solution you'd like
In an ideal world, a client would be installed on a virtual machine which then submits jobs to an HPC system to prevent that the GPU is allocated for long periods of time without being used.

Describe alternatives you've considered
I currently use this workaround and describe some of the issues with running on HPC systems (Slurm in this case)
https://github.com/dirkpetersen/nvflare-cancer#install-a-client-on-hpc

@YuanTingHsieh
Copy link
Collaborator

Hi @dirkpetersen thanks for bringing this topic up!

One workaround as you said is to run a NVFlare client process (client_train.py) directly in your SLURM cluster node.
This is the client monitoring process (CP) and it usually just waiting for jobs to come and spawn a job process (CJ) to handle this job.
As you observed, this CP will be waiting for jobs and it is not using the GPU.
It is the CJ that might be using the GPU, so ideally, we want to run CP outside of the GPU node, and only starts the process to be run in the GPU node when we need to.

We do have several ways to achieve that in HPC cluster.

  1. Use 2.4.1:
  • Run a NVFlare client process (client_train.py) at the HPC login node (the node that can submit job)
  • Prepare a training script (read in from input and writes out result to a file) to be run in slurm GPU node
  • Write a custom "SlurmExecutor", extending Executor (https://github.com/NVIDIA/NVFlare/blob/main/nvflare/apis/executor.py) implement "execute" that does the following: (1) writes out the input weights out to file system (2) submit a slurm job "sbatch XXX" with the path to the input weights (3) waits for slurm job to writes out result (4) read in the result and return the result
  • You will need to handle Shareable -> model weights -> Shareable serialization
  • We do have a bunch of executors already implemented that you can take a look at
  • Configure to use SlurmExecutor in the NVFlare job config (config_fed_client)
  1. Use main branch:

As we are seeing more interests, we might write out a whole tutorial/reference implementation on this later.

@dirkpetersen
Copy link
Contributor Author

Awesome, this is very helpful and I will try that !

@dirkpetersen
Copy link
Contributor Author

@YuanTingHsieh, it seems that both options require that client_train.py will run on the HPC login node, I think that is a reasonable assumption at least for many life sciences HPC systems that tend to have beefy login nodes and tolerant HPC admins. There are other disciplines where login nodes are guarded more strictly and it may not be allowed to run an agent. For those it would be better to have client_train.py run on a system adjacent to HPC and then submit the job via ssh and sbatch (for example using paramiko) . Perhaps a lower priority right now as I understand that FL use cases are focusing on life sciences right now ?

@YuanTingHsieh
Copy link
Collaborator

@dirkpetersen thanks for the discussion!

Yes, as you said, if you have a mechanism to submit the job via ssh and sbatch from machine A to your HPC system.
Then you can run the NVFlare client (client_train.py) on that machine A.

Note that, you can also start NVFlare client using the "start.sh" in our startup kits, as you can see we add some restarting mechanisms inside that script as well.

@dirkpetersen
Copy link
Contributor Author

Hi @YuanTingHsieh,

I am finally looking at this in more detail and wonder if for option 2 (ipcagent/ipcexachanger) the client needs to listen on a port or would all communication through the nvflare server?. If it has to listen on a port the challenge is that on multi-user HPC environments you cannot guarantee any port being free so one has to write workarounds like the below and then abuse the slurm job queue metadata as message queue by storing the port there.

Thanks

This script will create a TCP socket with port 0. The kernel will
assign an unused userspace port. That port number is then printed
out. A calling application can then bind to that port number.
"""

# https://docs.python.org/3/library/socket.html
import socket

# Prepare a TCP socket (AF_INET family and SOCK_STREAM type)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Bind the socket giving a tuple in the form of (IPaddr, port)
# 0th tuple element (IPaddr):
#   '' represents INADDR_ANY, which is used to bind to all interfaces
# 1th tuple element (port):
#    If port is 0 ... the OS default behavior will be used (assigns high port)
s.bind(('', 0))

# getsockname returns the address in the tuple form of (IPaddr, port)
s_name = s.getsockname()

# We print just the port, the 1th element of the tuple
print(s_name[1])

s.close()

@YuanTingHsieh
Copy link
Collaborator

Hi @dirkpetersen thanks for checking back!

We got more easy API now.

You can write your client side training script following examples in: https://github.com/NVIDIA/NVFlare/tree/main/examples/hello-world/ml-to-fl

Then you just use this job template: https://github.com/NVIDIA/NVFlare/tree/main/job_templates/sag_np_cell_pipe

You can just change the script from "python3 -u custom/{app_script} {app_config} " to the submit job command of SLURM: something like "sbatch xxx"

Don't worry about this free port, the NVFlare client (Client Control Process/Client Parent Process) run on this "main/control machine" (the one that can sbatch submit job) will find that free port and the client job process running the client side training script needs to be able to connect to this port.

@dirkpetersen
Copy link
Contributor Author

dirkpetersen commented Nov 1, 2024

Thanks @YuanTingHsieh this was very helpful and is indeed a lot easier. I decided to use FilePipe instead of CellPipe because CellPipe cannot work with Simulator / POC (I think) because they only listen on localhost and the Slurm node would not know to which host to talk to. Almost all Slurm clusters have a fast shared file system so FilePipe is probably the default usage pattern for HPC people, even if it has perceived bottlenecks.

instead of sbatch --wrap="python3 -u custom/{app_script} {app_config}" I used srun python3 -u custom/{app_script} {app_config} because unlike sbatch, srun executes synchronous and simply waits until the python script for that specific job is finished. Of course you can run hundreds of srun sessions in parallel.

as sag_np template uses FilePipe, I ran this in the cloned NVFlare git repos:

dp@box:~/NVFlare$ nvflare job create -w sag_np -j ./myjobs/slurm_job_fp -sd examples/hello-world/ml-to-fl/np/src/

in ./myjobs/slurm_job_fp/app/config/config_fed_client.conf I replaced app_script = "cifar10.py" with app_script = "train_full.py" and script = " python3 -u custom.... with script = "srun python3 -u custom....

then I added a few debugging statements to ./myjobs/slurm_job_fp/app/custom/train_full.py

and tried the simulator first:

nvflare simulator -w ~/temp/nv-work -n 2 ./myjobs/slurm_job_fp

The simulator started 10-ish Slurm jobs that never finish. Here is the output

Test-FilePipe-Slurm-Simulator.txt

Then I tried POC. Since the POC default workdir is /tmp/nvflare/poc we need to change it to a workdir in the shared file system so the compute node that is executing train_full.py can also reach the work space . In a separate 2nd terminal on the Slurm login node I run:

export NVFLARE_POC_WORKSPACE=~/temp/nv-work
nvflare poc prepare -n 2
nvflare poc start

in a 3rd terminal on the same login node I ask squeue to show all my jobs and refresh every second.

squeue --me -i 1

back to my first terminal I run :

dp@box:~/NVFlare$ nvflare job submit -j ./myjobs/slurm_job_fp
trying to connect to the server
job: '39be4acb-80b3-4e9b-b0e4-31f206578092 was submitted

In the squeue terminal i see that 2 jobs are starting and seem to finish normally. Here is the output of the POC process :

Test-FilePipe-Slurm-POC.txt

Is this working the way you would expect ?

@YuanTingHsieh
Copy link
Collaborator

@dirkpetersen thanks for sharing your detail steps and results!

From the log you provided here, both simulator and POC seems to be running fine.
As you can see the server is getting back results from each client and keep proceeding to the next round without any errors.

Do you mind sharing your final job folder? (slurm_job_fp)

@YuanTingHsieh YuanTingHsieh self-assigned this Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants