add docs for Flower integration (NVIDIA#2862)

nvidianz · Aug 28, 2024 · 8ef0ec4 · 8ef0ec4
1 parent 9737f53
commit 8ef0ec4
Show file tree

Hide file tree

Showing 13 changed files with 405 additions and 2 deletions.
diff --git a/docs/resources/FLARE_as_flower_communicator.png b/docs/resources/FLARE_as_flower_communicator.png
diff --git a/docs/resources/flare_flower_communication.png b/docs/resources/flare_flower_communication.png
diff --git a/docs/resources/system_architecture.png b/docs/resources/system_architecture.png
diff --git a/docs/user_guide.rst b/docs/user_guide.rst
@@ -24,3 +24,4 @@ please refer to the :ref:`programming_guide`.
    user_guide/confidential_computing
    user_guide/hierarchy_unification_bridge
    user_guide/federated_xgboost
+   user_guide/flower_integration
diff --git a/docs/user_guide/federated_xgboost/reliable_xgboost_timeout.rst b/docs/user_guide/federated_xgboost/reliable_xgboost_timeout.rst
@@ -1,10 +1,13 @@
+.. _reliable_xgboost_timeout:
+
 ############################################
 Reliable Federated XGBoost Timeout Mechanism
 ############################################
 
 NVFlare introduces a tightly-coupled integration between XGBoost and NVFlare.
-NVFlare implements the ReliableMessage mechanism to make XGBoost’s server/client
-interactions more robust over unstable internet connections.
+NVFlare implements the :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>`
+mechanism to make XGBoost's server/client interactions more robust over
+unstable internet connections.
 
 Unstable internet connection is the situation where the connections between
 the communication endpoints have random disconnects/reconnects and unstable speed.

diff --git a/docs/user_guide/federated_xgboost/secure_xgboost_user_guide.rst b/docs/user_guide/federated_xgboost/secure_xgboost_user_guide.rst
@@ -194,6 +194,8 @@ See :ref:`provisioning` for details.
 
 Job Configuration
 =================
+.. _secure_xgboost_controller:
+
 Controller
 ----------
 

diff --git a/docs/user_guide/flower_integration.rst b/docs/user_guide/flower_integration.rst
@@ -0,0 +1,23 @@
+####################################################
+Integration of Flower Applications with NVIDIA FLARE
+####################################################
+
+`Flower <https://flower.ai>`_ is an open-source project that implements a unified approach
+to federated learning, analytics, and evaluation. Flower has developed a large set of
+strategies and algorithms for FL application development and a healthy FL research community. 
+
+FLARE, on the other hand, has been focusing on providing an enterprise-ready, robust runtime
+environment for FL applications. 
+
+With the integration of Flower and FLARE, applications developed with the Flower framework
+will run easily in FLARE runtime without needing to make any changes. All the user needs to do
+is configure the Flower application into a FLARE job and submit the job to the FLARE system.
+
+
+.. toctree::
+   :maxdepth: 1
+
+   flower_integration/flower_initial_integration
+   flower_integration/flare_multi_job_architecture
+   flower_integration/flower_detailed_design
+   flower_integration/flower_reliable_messaging
diff --git a/docs/user_guide/flower_integration/flare_multi_job_architecture.rst b/docs/user_guide/flower_integration/flare_multi_job_architecture.rst
@@ -0,0 +1,23 @@
+****************************
+FLARE Multi-Job Architecture
+****************************
+
+To maximize the utilization of compute resources, FLARE supports multiple jobs running at the
+same time, where each job is an independent FL experiment.
+
+.. image:: ../../resources/system_architecture.png
+
+As shown in the diagram above, there is the Server Control Process (SCP) on the Server host, and there is a
+Client Control Process (CCP) on each client host. The SCP communicates with CCPs to manage jobs (schedule,
+deploy, monitor, and abort jobs). When a job is scheduled by the SCP, the job is sent to the CCPs of all sites,
+which creates separate processes for the job. These processes form a “Job Network” for the job. This network
+goes away when the job is finished.
+
+The diagram shows 3 jobs (J1, J2, J3) in different colors on server and client(s). For example, all J1 processes
+form the “job network” for Job 1.
+
+By default, processes of the same job network are not connected directly. Instead, they only connect to the SCP,
+and all messages between job processes are relayed through the SCP. However, if network policy permits, direct
+P2P connections could be established automatically between the job processes to obtain maximum communication
+speed. The underlying communication path is transparent to applications and only requires config changes to
+enable direct communication.
diff --git a/docs/user_guide/flower_integration/flower_detailed_design.rst b/docs/user_guide/flower_integration/flower_detailed_design.rst
@@ -0,0 +1,26 @@
+***************
+Detailed Design
+***************
+
+Flower uses gRPC as the communication protocol. To use FLARE  as the communicator, we route Flower's gRPC
+messages through FLARE. To do so, we change the server-endpoint of each Flower client to a local gRPC
+server (LGS) within the FLARE client.
+
+.. image:: ../../resources/FLARE_as_flower_communicator.png
+
+As shown in this diagram, there is a Local GRPC server (LGS) for each site that serves as the
+server-endpoint for the Flower client on the site. Similarly, there is a Local GRPC Client (LGC) on the
+FLARE Server that interacts with the Flower Server. The message path between the Flower Client and the Flower
+Server is as follows:
+
+   - The Flower client generates a gRPC message and sends it to the LGS in the FLARE Client
+   - FLARE Client forwards the message to the FLARE Server. This is a reliable FLARE message.
+   - FLARE Server uses the LGC to send the message to the Flower Server.
+   - Flower Server sends the response back to the LGC in the FLARE Server.
+   - FLARE Server sends the response back to the FLARE Client.
+   - FLARE Client sends the response back to the Flower Client via the LGS.
+
+Please note that the Flower Client could be running as a separate process or within the same process as the FLARE Client.
+
+This will enable users to directly deploy Flower ServerApps and ClientsApps developed within the
+NVFlare Runtime Environment. No code changes are necessary!
diff --git a/docs/user_guide/flower_integration/flower_initial_integration.rst b/docs/user_guide/flower_integration/flower_initial_integration.rst
@@ -0,0 +1,28 @@
+*******************
+Initial Integration
+*******************
+
+Architecturally, Flower uses client/server communication. Clients communicate with the server
+via gRPC. FLARE uses the same architecture with the enhancement that multiple jobs can run at
+the same time (each job requires one set of clients/server) without requiring multiple ports to
+be open on the server host.
+
+Since both frameworks follow the same communication architecture, it is fairly easy to make a
+Flower application a FLARE job by using FLARE as the communicator for the Flower app, as shown below.
+
+.. image:: ../../resources/FLARE_as_flower_communicator.png
+
+In this approach, Flower Clients no longer directly interact with the Flower Server, instead all
+communications are through FLARE.
+
+The integration with FLARE-based communication has some unique benefits:
+
+   - Provisioning of startup kits, including certificates
+   - Deployment of custom code (apps)
+   - User authentication and authorization
+   - :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>` mechanism to counter connection stability issues
+   - Multiple communication schemes (gRPC, HTTP, TCP, Redis, etc.) are available
+   - P2P communication: anyone can talk to anyone else without needing topology changes
+   - Support of P2P communication encryption (on top of SSL)
+   - Multi-job system that allows multiple Flower apps to run at the same time without needing extra ports on the server host
+   - Use additional NVFlare features like experiment tracking
diff --git a/docs/user_guide/flower_integration/flower_job_structure.rst b/docs/user_guide/flower_integration/flower_job_structure.rst
@@ -0,0 +1,145 @@
+********************
+Flower Job Structure
+********************
+Even though Flower Programming is out of the scope of FLARE/Flower integration, you need to have a good
+understanding of the Flower Job Structure when submitting to FLARE.
+
+A Flower job is a regular FLARE job with special requirements for the ``custom`` directory, as shown below.
+
+.. code-block:: none
+
+    ├── flwr_pt
+    │   ├── client.py   # <-- contains `ClientApp`
+    │   ├── __init__.py # <-- to register the python module
+    │   ├── server.py   # <-- contains `ServerApp`
+    │   └── task.py     # <-- task-specific code (model, data)
+    └── pyproject.toml  # <-- Flower project file
+
+Project Folder
+==============
+All Flower app code must be placed in a subfolder in the ``custom`` directory of the job. This subfolder is called
+the project folder of the app. In this example, the project folder is named ``flwr_pt``. Typically, this folder
+contains ``server.py``, ``client.py``, and the ``__init__.py``. Though you could organize them differently (see discussion
+below), we recommend always including the ``__init__.py`` so that the project folder is guaranteed to be a valid Python
+package, regardless of Python versions.
+
+Pyproject.toml
+==============
+The ``pyproject.toml`` file exists in the job's ``custom`` folder. It is an important file that contains server and
+client app definition and configuration information. Such information is used by the Flower system to find the
+server app and the client app, and to pass app-specific configuration to the apps.
+
+Here is an example of ``pyproject.toml``, taken from https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-flower/jobs/hello-flwr-pt/app/custom/pyproject.toml.
+
+.. code-block:: toml
+
+    [build-system]
+    requires = ["hatchling"]
+    build-backend = "hatchling.build"
+
+    [project]
+    name = "flwr_pt"
+    version = "1.0.0"
+    description = ""
+    license = "Apache-2.0"
+    dependencies = [
+        "flwr[simulation]>=1.11.0,<2.0",
+        "nvflare~=2.5.0rc",
+        "torch==2.2.1",
+        "torchvision==0.17.1",
+    ]
+
+    [tool.hatch.build.targets.wheel]
+    packages = ["."]
+
+    [tool.flwr.app]
+    publisher = "nvidia"
+
+    [tool.flwr.app.components]
+    serverapp = "flwr_pt.server:app"
+    clientapp = "flwr_pt.client:app"
+
+    [tool.flwr.app.config]
+    num-server-rounds = 3
+
+    [tool.flwr.federations]
+    default = "local-simulation"
+
+    [tool.flwr.federations.local-simulation]
+    options.num-supernodes = 2
+
+
+.. note:: Note that the information defined in pyproject.toml must match the code in the project folder!
+
+Project Name
+------------
+The project name should match the name of the project folder, though not a requirement. In this example, it is ``flwr_pt``. 
+Serverapp Specification
+
+This value is specified following this format:
+
+.. code-block:: toml
+
+    <server_app_module>:<server_app_var_name>
+
+where:
+
+    - The <server_app_module> is the module that contains the server app code. This module is usually defined as ``server.py`` in the project folder (flwr_pt in this example). 
+    - The <server_app_var_name> is the name of the variable that holds the ServerApp object in the <server_app_module>. This variable is usually defined as ``app``:
+
+.. code-block:: python
+
+    app = ServerApp(server_fn=server_fn)
+
+
+Clientapp Specification
+------------------------
+This value is specified following this format:
+
+.. code-block:: toml
+
+	<client_app_module>:<client_app_var_name>
+
+where:
+
+	- The <client_app_module> is the module that contains the client app code. This module is usually defined as ``client.py`` in the project folder (flwr_pt in this example). 
+	- The <client_app_var_name> is the name of the variable that holds the ClientApp object in the <client_app_module>. This variable is usually defined as ``app``:
+
+.. code-block:: python
+
+    app = ClientApp(client_fn=client_fn)
+
+
+App Configuration
+-----------------
+The pyproject.toml file can contain app config information, in the ``[tool.flwr.app.config]`` section. In this example,
+it defines the number of rounds:
+
+.. code-block:: toml
+
+    [tool.flwr.app.config]
+    num-server-rounds = 3
+
+The content of this section is specific to the server app code. The ``server.py`` in the example shows how this is used:
+
+.. code-block:: python
+
+    def server_fn(context: Context):
+        # Read from config
+        num_rounds = context.run_config["num-server-rounds"]
+
+        # Define config
+        config = ServerConfig(num_rounds=num_rounds)
+
+        return ServerAppComponents(strategy=strategy, config=config)
+
+Supernode Count
+---------------
+If you run the Flower job with its simulation (not as a FLARE job), you need to specify how many clients (supernodes) to use
+for the simulation in the ``[tool.flwr.federations.local-simulation]`` section, like this:
+
+.. code-block:: toml
+
+    options.num-supernodes = 2
+
+But this does not apply when submitting it as a FLARE job.
diff --git a/docs/user_guide/flower_integration/flower_reliable_messaging.rst b/docs/user_guide/flower_integration/flower_reliable_messaging.rst
@@ -0,0 +1,19 @@
+******************
+Reliable Messaging
+******************
+
+The interaction between the FLARE Clients and Server is through reliable messaging. 
+First, the requester tries to send the request to the peer. If it fails to send it, it will retry a moment later.
+This process keeps repeating until the request is sent successfully or the amount of time has passed (which will
+cause the job to abort).
+
+Secondly, once the request is sent, the requester waits for the response. Once the peer finishes processing, it
+sends the result to the requester immediately (which could be successful or unsuccessful). At the same time, the
+requester repeatedly sends queries to get the result from the peer, until the result is received or the max amount
+of time has passed (which will cause the job to abort). The result could be received in one of the following ways:
+
+    - The result is received from the response message sent by the peer when it finishes the processing
+    - The result is received from the response to the query message of the requester
+
+For details of :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>`,
+see :ref:`ReliableMessage Timeout <reliable_xgboost_timeout>`.