FedScale Runtime is an scalable and extensible deployment and evaluation platform. It simplifies and standardizes the deployment and experimental setups and enables model evaluation under practical settings through user-friendly APIs. It evolved from our prior system, Oort Oort project, which has been shown to scale well and can emulate FL training of thousands of clients in each round.
During benchmarking, FedScale relies on a distributed setting of GPUs/CPUs via the Parameter-Server (PS) architecture. We have evaluated it using up to 68 GPUs to simulate FL aggregation of 1300 participants in each round. Each training experiment is somewhat time consuming, as each GPU has to run multiple clients (1300/68 in our case) for each round.
Note: Due to the high computation load on each GPU, we recommend limiting each GPU to simulate around 20 clients; i.e., if the number of participants in each round is K, then we would better use at least K/20 GPUs.
The following are estimated prices on Google Cloud for training the ShuffleNet model on the OpenImage dataset using FedScale (may become inaccurate over time):
Setting | Time to Target Accuracy | Time to Converge |
---|---|---|
YoGi | 53 GPU hours (~$97) | 121 GPU hours (~$230) |
Note: Please assure that these paths are consistent across all nodes so that FedScale simulator can find the right path.
-
Coordinator node: Make sure that the coodinator (master node) has access to other worker nodes via
ssh
. -
All nodes: Follow this to install all necessary libs, and then download the datasets following this.
We provide an example of submitting a training job, whereby the user can submit jobs on the master node.
-
fedscale driver submit [conf.yml]
(orpython docker/driver.py submit [conf.yml]
) will submit a job with parameters specified in conf.yml on both the PS and worker nodes. We provide some exampleconf.yml
inFedScale/benchmark/configs
for each dataset. Comments in our example will help you quickly understand how to specify these parameters. -
fedscale driver stop [job_name]
(orpython docker/driver.py stop [job_name]
) will terminate the runningjob_name
(specified in yml) on the used nodes.
Note: FedScale are supporting 20+ datasets and 70+ models.
We have integrated Tensorboard for the visualization of experiment results. To track the experiment with [log_path]
(e.g., ./FedScale/benchmark/logs/cifar10/0209_141336
), please try tensorboard --logdir=[log_path] --bind_all
, and all the results will be available at: http://[ip_of_coordinator]:6006/
.
We also support WandB Dashboard for job parameter logging, the visualization of training/testing results, system metrics collection, and model weights checkpointing. To use WandB, add wandb_token: "YOUR_WANDB_TOKEN"
under job_conf
of the job configs. All the metrics and experiment results will be uploaded to the WandB account connected with your token. To enable model weights checkpointing, add save_checkpoint: True
under job_conf
of your job configs.
Meanwhile, all logs are dumped to log_path
(specified in the config file) on each node.
testing_perf
locates at the master node under this path, and the user can load it with pickle
to check the time-to-accuracy performance. The user can also check /benchmark/[job_name]_logging
to see whether the job is moving on.