[Q&A] Improve turnaround time for quick computations #2389

nickgautier · 2024-03-06T12:02:42Z

nickgautier
Mar 6, 2024

Python version (`python3 -V`)

3.11

NVFlare version (`python3 -m pip list | grep "nvflare"`)

2.4.0

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

No response

Operating system

Docker (slimed down image)

Have you successfully run any of the following examples?

hello-numpy-sag with simulator
hello-pt with simulator
hello-numpy-sag with POC
hello-pt with POC

Please describe your question

Hello!

We've noticed that under optimal circumstances:

No network calls
Trivial computation
No pre-processing
High ressources

A full computation, from submit to results takes about ~10 seconds.

We have use cases that involve statistical operation that are very quick. Consequently our users have an expectation of performance, typically below 1 second computation times.
We would be keen for any insight into optimizing for the above use case.

Thank you!

chesterxgchen · 2024-03-06T16:49:35Z

chesterxgchen
Mar 6, 2024
Maintainer

Thank for raise the issue, we will look into this.

0 replies

YuanTingHsieh · 2024-04-30T18:56:16Z

YuanTingHsieh
Apr 30, 2024
Maintainer

@nickgautier thanks for the interest.

For a nvflare system, each site will have a monitoring process, let's call it "client parent" (CP) and server parent (SP)

Right now when NVFlare starts a job.

We will start a client job process at each site and a server job process on the server side.

And the corresponding communication mechanism needs to be setup on each of those processes as well, so that would take some time.

From your use case, looks like you just want to calculate something quickly.
One optimization we could do is to NOT start a different process for each job, instead we just run the computation directly in the SP or CP process.

This way we can avoid the overhead of starting additional processes.

Note that by doing this we will make the system more vulnerable as the job execution code is not isolated, it might crash this site if the job code has error.

Another thing we can check is the setup overhead for each job, that will require more details of your workload,

Is your job work like this:

Submit job
Each site do calculations (in the mean time no information needs to be exchange)
Each site just sent back the result to server
Server just aggregate the result

2 replies

chesterxgchen Apr 30, 2024
Maintainer

not sure we recommend that approach though. As that means one job crash will bring down the whole system.

nickgautier Aug 8, 2024
Author

Thanks! Indeed the workflow you described is accurate. Crashing the process is not an issue, the clients/server will just restart in our setup.

chesterxgchen · 2024-04-30T19:32:09Z

chesterxgchen
Apr 30, 2024
Maintainer

@yhwen @yanchengnv maybe you can take a look when you get a chance

2 replies

yhwen May 3, 2024
Maintainer

@nickgautier The operations sequence for a job run is like this:
submit_job from admin to server --> server gets the job package and stores in the job storage --> job schedulers check the system resources deciding if the job can be run --> server deploy the job to all the participate sites to run (remotely) --> all sites got the job and start the job in a separate process --> all sites sends the result back to the server

There are quite a lot of operations and network communication from "submit_job" to get the execution result. As you mentioned, it may take up to ~10 seconds. However, this is just the first round result time. For all the following rounds of job execution, there's no other job deploy operations any more. Only the sites get the task from the server, then submit the result back to server. These following rounds should be very quick.

Could you please check your application. Do these kind of behaviors apply to your case?

nickgautier Aug 8, 2024
Author

Thank you, I have not noticed any significant improvment in runtime for subsequent jobs. There is of course always a little variance. What would be ways of speeding up/disabling these checks? We would be comfortable with sacrificing some robustness/stability for speed.

YuanTingHsieh · 2024-05-28T21:47:30Z

YuanTingHsieh
May 28, 2024
Maintainer

close due to inactivity, feel free to re-open

0 replies

yanchengnv · 2024-08-08T14:53:42Z

yanchengnv
Aug 8, 2024
Maintainer

The job submit/start/end lifecycle has overhead, as you observed. Once a job is started, it'll keep running until it's done, and won't have much overhead.

So to speed up your overall throughput, please consider using one long-running job for all your computation needs. Flare 2.5 (will be released in Sept) has a feature that lets you send custom commands to a running job. This way, you can just start one job and let it run "forever" and then use custom commands to ask it to do whatever computation you need, and the computation will be performed instantly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] Improve turnaround time for quick computations #2389

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Q&A] Improve turnaround time for quick computations #2389

nickgautier Mar 6, 2024

Python version (python3 -V)

NVFlare version (python3 -m pip list | grep "nvflare")

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, git branch)

Operating system

Have you successfully run any of the following examples?

Please describe your question

Replies: 5 comments · 4 replies

chesterxgchen Mar 6, 2024 Maintainer

YuanTingHsieh Apr 30, 2024 Maintainer

chesterxgchen Apr 30, 2024 Maintainer

nickgautier Aug 8, 2024 Author

chesterxgchen Apr 30, 2024 Maintainer

yhwen May 3, 2024 Maintainer

nickgautier Aug 8, 2024 Author

YuanTingHsieh May 28, 2024 Maintainer

yanchengnv Aug 8, 2024 Maintainer

nickgautier
Mar 6, 2024

Python version (`python3 -V`)

NVFlare version (`python3 -m pip list | grep "nvflare"`)

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

Replies: 5 comments 4 replies

chesterxgchen
Mar 6, 2024
Maintainer

YuanTingHsieh
Apr 30, 2024
Maintainer

chesterxgchen Apr 30, 2024
Maintainer

nickgautier Aug 8, 2024
Author

chesterxgchen
Apr 30, 2024
Maintainer

yhwen May 3, 2024
Maintainer

nickgautier Aug 8, 2024
Author

YuanTingHsieh
May 28, 2024
Maintainer

yanchengnv
Aug 8, 2024
Maintainer