Skip to content

Commit

Permalink
Merge branch 'NVIDIA:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
nvidianz authored Aug 12, 2024
2 parents 3704a06 + 6a7e145 commit 65a44e5
Show file tree
Hide file tree
Showing 27 changed files with 432 additions and 114 deletions.
4 changes: 4 additions & 0 deletions docs/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,10 @@ Using any text editor to edit the Dockerfile and paste the following:
.. literalinclude:: resources/Dockerfile
:language: dockerfile

.. note::

For nvflare version 2.3 set PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:23.02-py3

We can then build the new container by running docker build in the directory containing
this Dockerfile, for example tagging it nvflare-pt:

Expand Down
1 change: 1 addition & 0 deletions docs/programming_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Please refer to :ref:`application` for more details.
.. toctree::
:maxdepth: 1

programming_guide/fed_job_api
programming_guide/workflows_and_controllers
programming_guide/execution_api_type
programming_guide/fl_model
Expand Down
2 changes: 1 addition & 1 deletion docs/programming_guide/component_configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ For example:
{
"id": "shareable_generator",
"name": "PTFedOptModelShareableGenerator",
"path": "nvflare.app_opt.pt.fedopt.PTFedOptModelShareableGenerator",
"args": {
"device": "cpu",
"source_model": "model",
Expand Down
3 changes: 2 additions & 1 deletion docs/programming_guide/execution_api_type.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ The :ref:`client_api` provides the most straightforward way to write FL code,
and can easily be used to convert centralized code with minimal code changes.
The Client API uses the :class:`FLModel<nvflare.app_common.abstract.fl_model.FLModel>`
object for data transfer and supports common tasks such as train, validate, and submit_model.
Additionally, options for using decorators or PyTorch Lightning are also available.
Options for using decorators or PyTorch Lightning are also available.
For Client API executors, the in-process and external-process executors are provided for different use cases.

We recommend users start with the Client API, and to consider the other types
for more specific cases as required.
Expand Down
65 changes: 36 additions & 29 deletions docs/programming_guide/execution_api_type/client_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,20 +167,26 @@ Client API communication patterns

We offer various implementations of Client APIs tailored to different scenarios, each linked with distinct communication patterns.

Broadly, we present in-process and sub-process executors. The in-process executor, slated for release in NVFlare 2.5.0,
entails both training scripts and client executor operating within the same process. The training scripts will be launched once
at the event of START_RUN. The training scripts keep on running till the END_RUN event. Communication between them occurs
through an in-memory databus.
In-process Client API
---------------------

On the other hand, the LauncherExecutor employs a sub-process to execute training scripts, leading to the client executor
and training scripts residing in separate processes. The "launch_once" option is provided to the SubprocessLauncher to control
The in-process executor entails both the training script and client executor operating within the same process.
The training script will be launched once at the event of START_RUN and will keep on running till the END_RUN event.
Communication between them occurs through an efficient in-memory databus.

When the training process involves either a single GPU or no GPUs, and the training script doesn't integrate third-party
training systems, the in-process executor is preferable (when available).

Sub-process Client API
----------------------

On the other hand, the LauncherExecutor employs the SubprocessLauncher to use a sub-process to execute the training script. This results in the client executor
and training script residing in separate processes. The "launch_once" option is provided to the SubprocessLauncher to control
whether to launch the external script everytime when getting the task from server, or just launch the script once at the event
of START_RUN and keeps running till the END_RUN event. Communication between them is facilitated by either CellPipe
(default) or FilePipe.

When the training process involves either a single GPU or no GPUs, and the training script doesn't integrate third-party
training systems, the in-process executor is preferable (when available). For scenarios involving multi-GPU training or
the utilization of external training infrastructure, opting for the Launcher executor might be more suitable.
For scenarios involving multi-GPU training or the utilization of external training infrastructure, opting for the Launcher executor might be more suitable.


Choice of different Pipes
Expand All @@ -203,34 +209,35 @@ Configuration

Different configurations are available for each type of executor.

Definition lists:

in-process executor configuration
.. literalinclude:: ../../../job_templates/sag_pt_in_proc/config_fed_client.conf
---------------------------------
This configuration specifically caters to PyTorch applications, providing serialization and deserialization
(aka Decomposers) for commonly used PyTorch objects. For non-PyTorch applications, the generic
:class:`InProcessClientAPIExecutor<nvflare.app_common.executors.in_process_client_api_executor.InProcessClientAPIExecutor>` can be employed.

.. literalinclude:: ../../../job_templates/sag_pt_in_proc/config_fed_client.conf

This configuration specifically caters to PyTorch applications, providing serialization and deserialization
(aka Decomposers) for commonly used PyTorch objects. For non-PyTorch applications, the generic
``InProcessClientAPIExecutor`` can be employed.

subprocess launcher Executor configuration
In the config_fed_client in the FLARE app, in order to launch the training script we use the
:class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>` component.
The defined ``script`` is invoked, and ``launch_once`` can be set to either
launch once for the whole job (launch_once = True), or launch a process for each task received from the server (launch_once = False)
------------------------------------------
In the config_fed_client in the FLARE app, in order to launch the training script we use the
:class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>` component.
The defined ``script`` is invoked, and ``launch_once`` can be set to either
launch once for the whole job (launch_once = True), or launch a process for each task received from the server (launch_once = False)

``launch_once`` dictates how many times the training scripts are invoked during the overall training process.
When set to False, the executor essentially invokes ``python <training scripts>.py`` every round of training.
Typically, launch_once is set to True.
``launch_once`` dictates how many times the training scripts are invoked during the overall training process.
When set to False, the executor essentially invokes ``python <training scripts>.py`` every round of training.
Typically, launch_once is set to True.

A corresponding :class:`LauncherExecutor<nvflare.app_common.executors.launcher_executor.LauncherExecutor>`
is used as the executor to handle the tasks and perform the data exchange using the pipe.
For the Pipe component we provide implementations of :class:`FilePipe<nvflare.fuel.utils.pipe.file_pipe>`
and :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe>`.
A corresponding :class:`ClientAPILauncherExecutor<nvflare.app_common.executors.client_api_launcher_executor.ClientAPILauncherExecutor>`
is used as the executor to handle the tasks and perform the data exchange using the pipe.
For the Pipe component we provide implementations of :class:`FilePipe<nvflare.fuel.utils.pipe.file_pipe>`
and :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe>`.

.. literalinclude:: ../../../job_templates/sag_pt/config_fed_client.conf
.. literalinclude:: ../../../job_templates/sag_pt/config_fed_client.conf

For example configurations, take a look at the :github_nvflare_link:`job_templates <job_templates>`
directory for templates using the launcher and Client API.
For example configurations, take a look at the :github_nvflare_link:`job_templates <job_templates>`
directory for templates using the launcher and Client API.

.. note::
In that case that the user does not need to launch the process and instead
Expand Down
2 changes: 1 addition & 1 deletion docs/programming_guide/execution_api_type/executor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ processes to use.
"local_epochs": 5,
"steps_aggregation": 0,
"model_reader_writer": {
"name": "PTModelReaderWriter"
"path": "nvflare.app_opt.pt.model_reader_writer.PTModelReaderWriter"
}
}
}
Expand Down
252 changes: 252 additions & 0 deletions docs/programming_guide/fed_job_api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
.. _fed_job_api:

##########
FedJob API
##########

The FLARE :class:`FedJob<nvflare.job_config.fed_job.FedJob>` API allows users to Pythonically define and create job configurations.

Core Concepts
=============

* Use the :func:`to<nvflare.job_config.fed_job.FedJob.to>` routine to assign objects (e.g. controllers, executor, models, filters, components etc.) to the server or clients.
* Export the job to a configuration with :func:`export_job<nvflare.job_config.fed_job.FedJob.export_job>`.
* Run the job in the simulator with :func:`simulator_run<nvflare.job_config.fed_job.FedJob.simulator_run>`.

Table overview of the :class:`FedJob<nvflare.job_config.fed_job.FedJob>` API:

.. list-table:: FedJob
:widths: 25 35 50
:header-rows: 1

* - API
- Description
- API Doc Link
* - to
- Assign object to target.
- :func:`to<nvflare.job_config.fed_job.FedJob.to>`
* - to_server
- Assign object to server.
- :func:`to_server<nvflare.job_config.fed_job.FedJob.to_server>`
* - to_clients
- Assign object to all clients.
- :func:`to_clients<nvflare.job_config.fed_job.FedJob.to_clients>`
* - as_id
- Return generated uuid of object. Object will be added as component if referenced.
- :func:`as_id<nvflare.job_config.fed_job.FedJob.as_id>`
* - simulator_run
- Run the job with the simulator.
- :func:`simulator_run<nvflare.job_config.fed_job.FedJob.simulator_run>`
* - export_job
- Export the job configuration.
- :func:`export_job<nvflare.job_config.fed_job.FedJob.export_job>`


Here is an example of how to create a simple cifar10_fedavg job using the :class:`FedJob<nvflare.job_config.fed_job.FedJob>` API.
We assign a FedAvg controller and the initial PyTorch model to the server, and assign a ScriptExecutor for our training script to the clients.
Then we use the simulator to run the job:

.. code-block:: python
from src.net import Net
from nvflare import FedAvg, FedJob, ScriptExecutor
if __name__ == "__main__":
n_clients = 2
num_rounds = 2
train_script = "src/cifar10_fl.py"
job = FedJob(name="cifar10_fedavg")
# Define the controller workflow and send to server
controller = FedAvg(
num_clients=n_clients,
num_rounds=num_rounds,
)
job.to_server(controller)
# Define the initial global model and send to server
job.to_server(Net())
# Send executor to all clients
executor = ScriptExecutor(
task_script_path=train_script, task_script_args="" # f"--batch_size 32 --data_path /tmp/data/site-{i}"
)
job.to_clients(executor)
# job.export_job("/tmp/nvflare/jobs/job_config")
job.simulator_run("/tmp/nvflare/jobs/workdir", n_clients=n_clients)
Initializing the FedJob
=======================

Initialize the :class:`FedJob<nvflare.job_config.fed_job.FedJob>` object with the following arguments:

* ``name`` (str): for job name.
* ``min_clients`` (int): required for the job, will be set in the ``meta.json``.
* ``mandatory_clients`` (List[str]): to run the job, will be set in the ``meta.json``.
* ``key_metric`` (str): the metric used for global model selection, will be used by the preconfigured :class:`IntimeModelSelector<nvflare.app_common.widgets.intime_model_selector.IntimeModelSelector>`.

Example:

.. code-block:: python
job = FedJob(name="cifar10_fedavg", min_clients=2, mandatory_clients=["site-1", "site-2"], key_metric="accuracy")
Assigning objects with :func:`to<nvflare.job_config.fed_job.FedJob.to>`
=======================================================================

Assign objects with :func:`to<nvflare.job_config.fed_job.FedJob.to>` for a specific ``target``,
:func:`to_server<nvflare.job_config.fed_job.FedJob.to_server>` for the server, and
:func:`to_clients<nvflare.job_config.fed_job.FedJob.to_clients>` for all the clients.

These functions have the following parameters which are used depending on the type of object:

* ``obj`` (any): The object to be assigned. The obj will be given a default id if none is provided based on its type.
* ``target`` (str): (For :func:`to<nvflare.job_config.fed_job.FedJob.to>`) The target location of the object. Can be “server” or a client name, e.g. “site-1”.
* ``tasks`` (List[str]): If object is an Executor or Filter, optional list of tasks that should be handled. Defaults to None. If None, all tasks will be handled using [*].
* ``gpu`` (int | List[int]): GPU index or list of GPU indices used for simulating the run on that target.
* ``filter_type`` (FilterType): The type of filter used. Either FilterType.TASK_RESULT or FilterType.TASK_DATA.
* ``id`` (int): Optional user-defined id for the object. Defaults to None and ID will automatically be assigned.

.. note::

In order for the FedJob to use the values of arguments passed into the ``obj``, the arguments must be set as instance variables of the same name (or prefixed with ``_``) in the constructor.

Below we cover in-depth how different types of objects are handled when using :func:`to<nvflare.job_config.fed_job.FedJob.to>`:

Controller
----------

If the object is a :class:`Controller<nvflare.apis.impl.controller.Controller>` sent to the server, the controller is added to the server app workflows.

* If the ``key_metric`` is defined in the FedJob (see initialization), an :class:`IntimeModelSelector<nvflare.app_common.widgets.intime_model_selector.IntimeModelSelector>` widget will be added for best model selection.
* A :class:`ValidationJsonGenerator<nvflare.app_common.widgets.validation_json_generator.ValidationJsonGenerator>` is automatically added for creating json validation results.
* If PyTorch and TensorBoard are supported, then :class:`TBAnalyticsReceiver<nvflare.app_common.pt.tb_receiver.TBAnalyticsReceiver>` is automatically added to receives analytics data to save to TensorBoard. Other types of receivers can be added as components with :func:`to<nvflare.job_config.fed_job.FedJob.to>`.

Example:

.. code-block:: python
controller = FedAvg(
num_clients=n_clients,
num_rounds=num_rounds,
)
job.to(controller, "server")
If the object is a :class:`Controller<nvflare.apis.impl.controller.Controller>` sent to a client, the controller is added to the client app components as a client-side controller.
The controller can then be used by the :class:`ClientControllerExecutor<nvflare.app_common.ccwf.client_controller_executor.ClientControllerExecutor>`.


Executor
--------

If the object is an :class:`Executor<nvflare.apis.executor.Executor>`, it must be sent to a client. The executor is added to the client app executors.

* The ``tasks`` parameter specifies the tasks that the executor is defined the handle.
* The ``gpu`` parameter specifies which gpus to use for simulating the run on the target.
* If the object is a :class:`ScriptExecutor<nvflare.app_common.executors.script_executor.ScriptExecutor>`, the task_script_path will be added to the external scripts to be included in the custom directory.
* If the object is a :class:`ScriptLauncherExecutor<nvflare.app_common.executors.script_launcher_executor.ScriptLauncherExecutor>`, the launch_script will be launched in a subprocess. Corresponding :class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>`, :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe.CellPipe>`, :class:`MetricRelay<nvflare.app_common.widgets.metric_relay.MetricRelay>`, and :class:`ExternalConfigurator<nvflare.app_common.widgets.external_configurator.ExternalConfigurator>` components will be automatically configured.
* The :class:`ConvertToFedEvent<nvflare.app_common.widgets.convert_to_fed_event.ConvertToFedEvent>` widget is automatically added to convert local events to federated events.

Example:

.. code-block:: python
executor = ScriptExecutor(task_script_path="src/cifar10_fl.py", task_script_args="")
job.to(executor, "site-1", tasks=["train"], gpu=0)
Script (str)
------------

If the object is a str, it is treated as an external script and will be included in the custom directory.

Example:

.. code-block:: python
job.to("src/cifar10_fl.py", "site-1")
Filter
------

If the object is a :class:`Filter<nvflare.apis.filter.Filter>`, users must specify the ``filter_type``
as either FilterType.TASK_RESULT (flow from executor to controller) or FilterType.TASK_DATA (flow from controller to executor).

The filter will be added task_data_filters and task_result_filters accordingly and be applied to the specified ``tasks``.

Example:

.. code-block:: python
pp_filter = PercentilePrivacy(percentile=10, gamma=0.01)
job.to(pp_filter, "site-1", tasks=["train"], filter_type=FilterType.TASK_RESULT)
Model
-----
If the object is a common model type, a corresponding persistor will automatically be configured with the model.

For PyTorch models (``torch.nn.Module``) we add a :class:`PTFileModelPersistor<nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor>` and
:class:`PTFileModelLocator<nvflare.app_opt.pt.file_model_locator.PTFileModelLocator>`, and for TensorFlow models (``tf.keras.Model``) we add a :class:`TFModelPersistor<nvflare.app_opt.tf.model_persistor.TFModelPersistor>`.

Example:

.. code-block:: python
job.to(Net(), "server")
For unsupported models, the model and persistor can be added as components.


Components
----------
For any object that does not fall under any of the previous types, it is added as a component with ``id``.
The ``id`` can be either specified as a parameter, or it will be automatically assigned.Components may reference other components by id

If an id generated by :func:`as_id<nvflare.job_config.fed_job.FedJob.as_id>`, is referenced by another added object, this the referenced object will also be added as a component.
In the example below, comp2 is assigned to the server. Since comp1 was referenced in comp2 with :func:`as_id<nvflare.job_config.fed_job.FedJob.as_id>`, comp1 will also be added as a component to the server.

Example:

.. code-block:: python
comp1 = Component1()
comp2 = Component2(sub_component_id=job.as_id(comp1))
job.to(comp2, "server")
Running the Job
===============

Simulator
---------

Run the FedJob with the simulator with :func:`simulator_run<nvflare.job_config.fed_job.FedJob.simulator_run>` in the ``workspace`` with ``n_clients`` and ``threads``.
(Note: only set ``n_clients`` if you have not specified clients using :func:`to<nvflare.job_config.fed_job.FedJob.to>`)

Example:

.. code-block:: python
job.simulator_run(workspace="/tmp/nvflare/jobs/workdir", n_clients=2, threads=2)
Export Configuration
--------------------
We can export the job configuration with :func:`export_job<nvflare.job_config.fed_job.FedJob.export_job>` to the ``job_root`` directory.

Example:

.. code-block:: python
job.export_job(job_root="/tmp/nvflare/jobs/job_config")
Examples
========

To see examples of how the FedJob API can be used for different applications, refer the :github_nvflare_link:`Getting Started <examples/getting_started>` examples.
Loading

0 comments on commit 65a44e5

Please sign in to comment.