Merge branch 'NVIDIA:main' into main

nvidianz · Aug 12, 2024 · 65a44e5 · 65a44e5
2 parents 3704a06 + 6a7e145
commit 65a44e5
Show file tree

Hide file tree

Showing 27 changed files with 432 additions and 114 deletions.
diff --git a/docs/getting_started.rst b/docs/getting_started.rst
@@ -175,6 +175,10 @@ Using any text editor to edit the Dockerfile and paste the following:
 .. literalinclude:: resources/Dockerfile
     :language: dockerfile
 
+.. note::
+
+    For nvflare version 2.3 set PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:23.02-py3
+
 We can then build the new container by running docker build in the directory containing
 this Dockerfile, for example tagging it nvflare-pt:
 

diff --git a/docs/programming_guide.rst b/docs/programming_guide.rst
@@ -35,6 +35,7 @@ Please refer to :ref:`application` for more details.
 .. toctree::
    :maxdepth: 1
 
+   programming_guide/fed_job_api
    programming_guide/workflows_and_controllers
    programming_guide/execution_api_type
    programming_guide/fl_model

diff --git a/docs/programming_guide/component_configuration.rst b/docs/programming_guide/component_configuration.rst
@@ -139,7 +139,7 @@ For example:
 
     {
         "id": "shareable_generator",
-        "name": "PTFedOptModelShareableGenerator",
+        "path": "nvflare.app_opt.pt.fedopt.PTFedOptModelShareableGenerator",
         "args": {
             "device": "cpu",
             "source_model": "model",

diff --git a/docs/programming_guide/execution_api_type.rst b/docs/programming_guide/execution_api_type.rst
@@ -35,7 +35,8 @@ The :ref:`client_api` provides the most straightforward way to write FL code,
 and can easily be used to convert centralized code with minimal code changes.
 The Client API uses the :class:`FLModel<nvflare.app_common.abstract.fl_model.FLModel>`
 object for data transfer and supports common tasks such as train, validate, and submit_model.
-Additionally, options for using decorators or PyTorch Lightning are also available.
+Options for using decorators or PyTorch Lightning are also available.
+For Client API executors, the in-process and external-process executors are provided for different use cases.
 
 We recommend users start with the Client API, and to consider the other types
 for more specific cases as required.

diff --git a/docs/programming_guide/execution_api_type/client_api.rst b/docs/programming_guide/execution_api_type/client_api.rst
@@ -167,20 +167,26 @@ Client API communication patterns
 
 We offer various implementations of Client APIs tailored to different scenarios, each linked with distinct communication patterns.
 
-Broadly, we present in-process and sub-process executors. The in-process executor, slated for release in NVFlare 2.5.0,
-entails both training scripts and client executor operating within the same process. The training scripts will be launched once
-at the event of START_RUN. The training scripts keep on running till the END_RUN event. Communication between them occurs
-through an in-memory databus.
+In-process Client API
+---------------------
 
-On the other hand, the LauncherExecutor employs a sub-process to execute training scripts, leading to the client executor
-and training scripts residing in separate processes. The "launch_once" option is provided to the SubprocessLauncher to control
+The in-process executor entails both the training script and client executor operating within the same process.
+The training script will be launched once at the event of START_RUN and will keep on running till the END_RUN event.
+Communication between them occurs through an efficient in-memory databus.
+
+When the training process involves either a single GPU or no GPUs, and the training script doesn't integrate third-party
+training systems, the in-process executor is preferable (when available).
+
+Sub-process Client API
+----------------------
+
+On the other hand, the LauncherExecutor employs the SubprocessLauncher to use a sub-process to execute the training script. This results in the client executor
+and training script residing in separate processes. The "launch_once" option is provided to the SubprocessLauncher to control
 whether to launch the external script everytime when getting the task from server, or just launch the script once at the event
 of START_RUN and keeps running till the END_RUN event. Communication between them is facilitated by either CellPipe
 (default) or FilePipe.
 
-When the training process involves either a single GPU or no GPUs, and the training script doesn't integrate third-party
-training systems, the in-process executor is preferable (when available). For scenarios involving multi-GPU training or
-the utilization of external training infrastructure, opting for the Launcher executor might be more suitable.
+For scenarios involving multi-GPU training or the utilization of external training infrastructure, opting for the Launcher executor might be more suitable.
 
 
 Choice of different Pipes
@@ -203,34 +209,35 @@ Configuration
 
 Different configurations are available for each type of executor.
 
-Definition lists:
-
 in-process executor configuration
-    .. literalinclude:: ../../../job_templates/sag_pt_in_proc/config_fed_client.conf
+---------------------------------
+This configuration specifically caters to PyTorch applications, providing serialization and deserialization
+(aka Decomposers) for commonly used PyTorch objects. For non-PyTorch applications, the generic
+:class:`InProcessClientAPIExecutor<nvflare.app_common.executors.in_process_client_api_executor.InProcessClientAPIExecutor>` can be employed.
+
+.. literalinclude:: ../../../job_templates/sag_pt_in_proc/config_fed_client.conf
 
-    This configuration specifically caters to PyTorch applications, providing serialization and deserialization
-    (aka Decomposers) for commonly used PyTorch objects. For non-PyTorch applications, the generic
-    ``InProcessClientAPIExecutor`` can be employed.
 
 subprocess launcher Executor configuration
-    In the config_fed_client in the FLARE app, in order to launch the training script we use the
-    :class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>` component.
-    The defined ``script`` is invoked, and ``launch_once`` can be set to either
-    launch once for the whole job (launch_once = True), or launch a process for each task received from the server (launch_once = False)
+------------------------------------------
+In the config_fed_client in the FLARE app, in order to launch the training script we use the
+:class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>` component.
+The defined ``script`` is invoked, and ``launch_once`` can be set to either
+launch once for the whole job (launch_once = True), or launch a process for each task received from the server (launch_once = False)
 
-   ``launch_once`` dictates how many times the training scripts are invoked during the overall training process.
-    When set to False, the executor essentially invokes ``python <training scripts>.py`` every round of training.
-    Typically, launch_once is set to True.
+``launch_once`` dictates how many times the training scripts are invoked during the overall training process.
+When set to False, the executor essentially invokes ``python <training scripts>.py`` every round of training.
+Typically, launch_once is set to True.
 
-    A corresponding :class:`LauncherExecutor<nvflare.app_common.executors.launcher_executor.LauncherExecutor>`
-    is used as the executor to handle the tasks and perform the data exchange using the pipe.
-    For the Pipe component we provide implementations of :class:`FilePipe<nvflare.fuel.utils.pipe.file_pipe>`
-    and :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe>`.
+A corresponding :class:`ClientAPILauncherExecutor<nvflare.app_common.executors.client_api_launcher_executor.ClientAPILauncherExecutor>`
+is used as the executor to handle the tasks and perform the data exchange using the pipe.
+For the Pipe component we provide implementations of :class:`FilePipe<nvflare.fuel.utils.pipe.file_pipe>`
+and :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe>`.
 
-    .. literalinclude:: ../../../job_templates/sag_pt/config_fed_client.conf
+.. literalinclude:: ../../../job_templates/sag_pt/config_fed_client.conf
 
-    For example configurations, take a look at the :github_nvflare_link:`job_templates <job_templates>`
-    directory for templates using the launcher and Client API.
+For example configurations, take a look at the :github_nvflare_link:`job_templates <job_templates>`
+directory for templates using the launcher and Client API.
 
 .. note::
    In that case that the user does not need to launch the process and instead

diff --git a/docs/programming_guide/execution_api_type/executor.rst b/docs/programming_guide/execution_api_type/executor.rst
@@ -93,7 +93,7 @@ processes to use.
                     "local_epochs": 5,
                     "steps_aggregation": 0,
                     "model_reader_writer": {
-                      "name": "PTModelReaderWriter"
+                      "path": "nvflare.app_opt.pt.model_reader_writer.PTModelReaderWriter"
                     }
                   }
                 }

diff --git a/docs/programming_guide/fed_job_api.rst b/docs/programming_guide/fed_job_api.rst
@@ -0,0 +1,252 @@
+.. _fed_job_api:
+
+##########
+FedJob API
+##########
+
+The FLARE :class:`FedJob<nvflare.job_config.fed_job.FedJob>` API allows users to Pythonically define and create job configurations.
+
+Core Concepts
+=============
+
+* Use the :func:`to<nvflare.job_config.fed_job.FedJob.to>` routine to assign objects (e.g. controllers, executor, models, filters, components etc.) to the server or clients.
+* Export the job to a configuration with :func:`export_job<nvflare.job_config.fed_job.FedJob.export_job>`.
+* Run the job in the simulator with :func:`simulator_run<nvflare.job_config.fed_job.FedJob.simulator_run>`.
+
+Table overview of the :class:`FedJob<nvflare.job_config.fed_job.FedJob>` API:
+
+.. list-table:: FedJob
+   :widths: 25 35 50
+   :header-rows: 1
+
+   * - API
+     - Description
+     - API Doc Link
+   * - to
+     - Assign object to target.
+     - :func:`to<nvflare.job_config.fed_job.FedJob.to>`
+   * - to_server
+     - Assign object to server.
+     - :func:`to_server<nvflare.job_config.fed_job.FedJob.to_server>`
+   * - to_clients
+     - Assign object to all clients.
+     - :func:`to_clients<nvflare.job_config.fed_job.FedJob.to_clients>`
+   * - as_id
+     - Return generated uuid of object. Object will be added as component if referenced.
+     - :func:`as_id<nvflare.job_config.fed_job.FedJob.as_id>`
+   * - simulator_run
+     - Run the job with the simulator.
+     - :func:`simulator_run<nvflare.job_config.fed_job.FedJob.simulator_run>`
+   * - export_job
+     - Export the job configuration.
+     - :func:`export_job<nvflare.job_config.fed_job.FedJob.export_job>`
+
+
+Here is an example of how to create a simple cifar10_fedavg job using the :class:`FedJob<nvflare.job_config.fed_job.FedJob>` API.
+We assign a FedAvg controller and the initial PyTorch model to the server, and assign a ScriptExecutor for our training script to the clients.
+Then we use the simulator to run the job:
+
+.. code-block:: python
+
+  from src.net import Net
+
+  from nvflare import FedAvg, FedJob, ScriptExecutor
+
+  if __name__ == "__main__":
+      n_clients = 2
+      num_rounds = 2
+      train_script = "src/cifar10_fl.py"
+
+      job = FedJob(name="cifar10_fedavg")
+
+      # Define the controller workflow and send to server
+      controller = FedAvg(
+          num_clients=n_clients,
+          num_rounds=num_rounds,
+      )
+      job.to_server(controller)
+
+      # Define the initial global model and send to server
+      job.to_server(Net())
+
+      # Send executor to all clients
+      executor = ScriptExecutor(
+          task_script_path=train_script, task_script_args=""  # f"--batch_size 32 --data_path /tmp/data/site-{i}"
+      )
+      job.to_clients(executor)
+
+      # job.export_job("/tmp/nvflare/jobs/job_config")
+      job.simulator_run("/tmp/nvflare/jobs/workdir", n_clients=n_clients)
+
+
+Initializing the FedJob
+=======================
+
+Initialize the :class:`FedJob<nvflare.job_config.fed_job.FedJob>` object with the following arguments:
+
+* ``name`` (str): for job name.
+* ``min_clients`` (int): required for the job, will be set in the ``meta.json``.
+* ``mandatory_clients`` (List[str]): to run the job, will be set in the ``meta.json``.
+* ``key_metric`` (str): the metric used for global model selection, will be used by the preconfigured :class:`IntimeModelSelector<nvflare.app_common.widgets.intime_model_selector.IntimeModelSelector>`.
+
+Example:
+
+.. code-block:: python
+
+  job = FedJob(name="cifar10_fedavg", min_clients=2, mandatory_clients=["site-1", "site-2"], key_metric="accuracy")
+
+Assigning objects with :func:`to<nvflare.job_config.fed_job.FedJob.to>`
+=======================================================================
+
+Assign objects with :func:`to<nvflare.job_config.fed_job.FedJob.to>` for a specific ``target``,
+:func:`to_server<nvflare.job_config.fed_job.FedJob.to_server>` for the server, and
+:func:`to_clients<nvflare.job_config.fed_job.FedJob.to_clients>` for all the clients.
+
+These functions have the following parameters which are used depending on the type of object:
+
+* ``obj`` (any): The object to be assigned. The obj will be given a default id if none is provided based on its type.
+* ``target`` (str): (For :func:`to<nvflare.job_config.fed_job.FedJob.to>`) The target location of the object. Can be “server” or a client name, e.g. “site-1”.
+* ``tasks`` (List[str]): If object is an Executor or Filter, optional list of tasks that should be handled. Defaults to None. If None, all tasks will be handled using [*].
+* ``gpu`` (int | List[int]): GPU index or list of GPU indices used for simulating the run on that target.
+* ``filter_type`` (FilterType): The type of filter used. Either FilterType.TASK_RESULT or FilterType.TASK_DATA.
+* ``id`` (int): Optional user-defined id for the object. Defaults to None and ID will automatically be assigned.
+
+.. note::
+
+    In order for the FedJob to use the values of arguments passed into the ``obj``, the arguments must be set as instance variables of the same name (or prefixed with ``_``) in the constructor.
+
+Below we cover in-depth how different types of objects are handled when using :func:`to<nvflare.job_config.fed_job.FedJob.to>`:
+
+Controller
+----------
+
+If the object is a :class:`Controller<nvflare.apis.impl.controller.Controller>` sent to the server, the controller is added to the server app workflows.
+
+* If the ``key_metric`` is defined in the FedJob (see initialization), an :class:`IntimeModelSelector<nvflare.app_common.widgets.intime_model_selector.IntimeModelSelector>` widget will be added for best model selection.
+* A :class:`ValidationJsonGenerator<nvflare.app_common.widgets.validation_json_generator.ValidationJsonGenerator>` is automatically added for creating json validation results.
+* If PyTorch and TensorBoard are supported, then :class:`TBAnalyticsReceiver<nvflare.app_common.pt.tb_receiver.TBAnalyticsReceiver>` is automatically added to receives analytics data to save to TensorBoard. Other types of receivers can be added as components with :func:`to<nvflare.job_config.fed_job.FedJob.to>`.
+
+Example:
+
+.. code-block:: python
+
+  controller = FedAvg(
+      num_clients=n_clients,
+      num_rounds=num_rounds,
+  )
+  job.to(controller, "server")
+
+If the object is a :class:`Controller<nvflare.apis.impl.controller.Controller>` sent to a client, the controller is added to the client app components as a client-side controller.
+The controller can then be used by the :class:`ClientControllerExecutor<nvflare.app_common.ccwf.client_controller_executor.ClientControllerExecutor>`.
+
+
+Executor
+--------
+
+If the object is an :class:`Executor<nvflare.apis.executor.Executor>`, it must be sent to a client. The executor is added to the client app executors.
+
+* The ``tasks`` parameter specifies the tasks that the executor is defined the handle.
+* The ``gpu`` parameter specifies which gpus to use for simulating the run on the target.
+* If the object is a :class:`ScriptExecutor<nvflare.app_common.executors.script_executor.ScriptExecutor>`, the task_script_path will be added to the external scripts to be included in the custom directory.
+* If the object is a :class:`ScriptLauncherExecutor<nvflare.app_common.executors.script_launcher_executor.ScriptLauncherExecutor>`, the launch_script will be launched in a subprocess. Corresponding :class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>`, :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe.CellPipe>`, :class:`MetricRelay<nvflare.app_common.widgets.metric_relay.MetricRelay>`, and :class:`ExternalConfigurator<nvflare.app_common.widgets.external_configurator.ExternalConfigurator>` components will be automatically configured.
+* The :class:`ConvertToFedEvent<nvflare.app_common.widgets.convert_to_fed_event.ConvertToFedEvent>` widget is automatically added to convert local events to federated events.
+
+Example:
+
+.. code-block:: python
+
+  executor = ScriptExecutor(task_script_path="src/cifar10_fl.py", task_script_args="")
+  job.to(executor, "site-1", tasks=["train"], gpu=0)
+
+
+Script (str)
+------------
+
+If the object is a str, it is treated as an external script and will be included in the custom directory.
+
+Example:
+
+.. code-block:: python
+
+  job.to("src/cifar10_fl.py", "site-1")
+
+
+Filter
+------
+
+If the object is a :class:`Filter<nvflare.apis.filter.Filter>`, users must specify the ``filter_type``
+as either FilterType.TASK_RESULT (flow from executor to controller) or FilterType.TASK_DATA (flow from controller to executor).
+
+The filter will be added task_data_filters and task_result_filters accordingly and be applied to the specified ``tasks``.
+
+Example:
+
+.. code-block:: python
+
+  pp_filter = PercentilePrivacy(percentile=10, gamma=0.01)
+  job.to(pp_filter, "site-1", tasks=["train"], filter_type=FilterType.TASK_RESULT)
+
+
+Model
+-----
+If the object is a common model type, a corresponding persistor will automatically be configured with the model.
+
+For PyTorch models (``torch.nn.Module``) we add a :class:`PTFileModelPersistor<nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor>` and
+:class:`PTFileModelLocator<nvflare.app_opt.pt.file_model_locator.PTFileModelLocator>`, and for TensorFlow models (``tf.keras.Model``) we add a :class:`TFModelPersistor<nvflare.app_opt.tf.model_persistor.TFModelPersistor>`.
+
+Example:
+
+.. code-block:: python
+
+  job.to(Net(), "server")
+
+For unsupported models, the model and persistor can be added as components.
+
+
+Components
+----------
+For any object that does not fall under any of the previous types, it is added as a component with ``id``.
+The ``id`` can be either specified as a parameter, or it will be automatically assigned.Components may reference other components by id
+
+If an id generated by :func:`as_id<nvflare.job_config.fed_job.FedJob.as_id>`, is referenced by another added object, this the referenced object will also be added as a component.
+In the example below, comp2 is assigned to the server. Since comp1 was referenced in comp2 with :func:`as_id<nvflare.job_config.fed_job.FedJob.as_id>`, comp1 will also be added as a component to the server.
+
+Example:
+
+.. code-block:: python
+
+  comp1 = Component1()
+  comp2 = Component2(sub_component_id=job.as_id(comp1))
+  job.to(comp2, "server")
+
+
+Running the Job
+===============
+
+Simulator
+---------
+
+Run the FedJob with the simulator with :func:`simulator_run<nvflare.job_config.fed_job.FedJob.simulator_run>` in the ``workspace`` with ``n_clients`` and ``threads``.
+(Note: only set ``n_clients`` if you have not specified clients using :func:`to<nvflare.job_config.fed_job.FedJob.to>`)
+
+Example:
+
+.. code-block:: python
+
+  job.simulator_run(workspace="/tmp/nvflare/jobs/workdir", n_clients=2, threads=2)
+
+
+Export Configuration
+--------------------
+We can export the job configuration with :func:`export_job<nvflare.job_config.fed_job.FedJob.export_job>` to the ``job_root`` directory.
+
+Example:
+
+.. code-block:: python
+
+  job.export_job(job_root="/tmp/nvflare/jobs/job_config")
+
+Examples
+========
+
+To see examples of how the FedJob API can be used for different applications, refer the :github_nvflare_link:`Getting Started <examples/getting_started>` examples.