Skip to content

Commit

Permalink
add docs for Flower integration (NVIDIA#2862)
Browse files Browse the repository at this point in the history
  • Loading branch information
nvkevlu authored Aug 28, 2024
1 parent 9737f53 commit 8ef0ec4
Show file tree
Hide file tree
Showing 13 changed files with 405 additions and 2 deletions.
Binary file added docs/resources/FLARE_as_flower_communicator.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/resources/flare_flower_communication.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/resources/system_architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@ please refer to the :ref:`programming_guide`.
user_guide/confidential_computing
user_guide/hierarchy_unification_bridge
user_guide/federated_xgboost
user_guide/flower_integration
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
.. _reliable_xgboost_timeout:

############################################
Reliable Federated XGBoost Timeout Mechanism
############################################

NVFlare introduces a tightly-coupled integration between XGBoost and NVFlare.
NVFlare implements the ReliableMessage mechanism to make XGBoost’s server/client
interactions more robust over unstable internet connections.
NVFlare implements the :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>`
mechanism to make XGBoost's server/client interactions more robust over
unstable internet connections.

Unstable internet connection is the situation where the connections between
the communication endpoints have random disconnects/reconnects and unstable speed.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,8 @@ See :ref:`provisioning` for details.

Job Configuration
=================
.. _secure_xgboost_controller:

Controller
----------

Expand Down
23 changes: 23 additions & 0 deletions docs/user_guide/flower_integration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
####################################################
Integration of Flower Applications with NVIDIA FLARE
####################################################

`Flower <https://flower.ai>`_ is an open-source project that implements a unified approach
to federated learning, analytics, and evaluation. Flower has developed a large set of
strategies and algorithms for FL application development and a healthy FL research community.

FLARE, on the other hand, has been focusing on providing an enterprise-ready, robust runtime
environment for FL applications.

With the integration of Flower and FLARE, applications developed with the Flower framework
will run easily in FLARE runtime without needing to make any changes. All the user needs to do
is configure the Flower application into a FLARE job and submit the job to the FLARE system.


.. toctree::
:maxdepth: 1

flower_integration/flower_initial_integration
flower_integration/flare_multi_job_architecture
flower_integration/flower_detailed_design
flower_integration/flower_reliable_messaging
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
****************************
FLARE Multi-Job Architecture
****************************

To maximize the utilization of compute resources, FLARE supports multiple jobs running at the
same time, where each job is an independent FL experiment.

.. image:: ../../resources/system_architecture.png

As shown in the diagram above, there is the Server Control Process (SCP) on the Server host, and there is a
Client Control Process (CCP) on each client host. The SCP communicates with CCPs to manage jobs (schedule,
deploy, monitor, and abort jobs). When a job is scheduled by the SCP, the job is sent to the CCPs of all sites,
which creates separate processes for the job. These processes form a “Job Network” for the job. This network
goes away when the job is finished.

The diagram shows 3 jobs (J1, J2, J3) in different colors on server and client(s). For example, all J1 processes
form the “job network” for Job 1.

By default, processes of the same job network are not connected directly. Instead, they only connect to the SCP,
and all messages between job processes are relayed through the SCP. However, if network policy permits, direct
P2P connections could be established automatically between the job processes to obtain maximum communication
speed. The underlying communication path is transparent to applications and only requires config changes to
enable direct communication.
26 changes: 26 additions & 0 deletions docs/user_guide/flower_integration/flower_detailed_design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
***************
Detailed Design
***************

Flower uses gRPC as the communication protocol. To use FLARE as the communicator, we route Flower's gRPC
messages through FLARE. To do so, we change the server-endpoint of each Flower client to a local gRPC
server (LGS) within the FLARE client.

.. image:: ../../resources/FLARE_as_flower_communicator.png

As shown in this diagram, there is a Local GRPC server (LGS) for each site that serves as the
server-endpoint for the Flower client on the site. Similarly, there is a Local GRPC Client (LGC) on the
FLARE Server that interacts with the Flower Server. The message path between the Flower Client and the Flower
Server is as follows:

- The Flower client generates a gRPC message and sends it to the LGS in the FLARE Client
- FLARE Client forwards the message to the FLARE Server. This is a reliable FLARE message.
- FLARE Server uses the LGC to send the message to the Flower Server.
- Flower Server sends the response back to the LGC in the FLARE Server.
- FLARE Server sends the response back to the FLARE Client.
- FLARE Client sends the response back to the Flower Client via the LGS.

Please note that the Flower Client could be running as a separate process or within the same process as the FLARE Client.

This will enable users to directly deploy Flower ServerApps and ClientsApps developed within the
NVFlare Runtime Environment. No code changes are necessary!
28 changes: 28 additions & 0 deletions docs/user_guide/flower_integration/flower_initial_integration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
*******************
Initial Integration
*******************

Architecturally, Flower uses client/server communication. Clients communicate with the server
via gRPC. FLARE uses the same architecture with the enhancement that multiple jobs can run at
the same time (each job requires one set of clients/server) without requiring multiple ports to
be open on the server host.

Since both frameworks follow the same communication architecture, it is fairly easy to make a
Flower application a FLARE job by using FLARE as the communicator for the Flower app, as shown below.

.. image:: ../../resources/FLARE_as_flower_communicator.png

In this approach, Flower Clients no longer directly interact with the Flower Server, instead all
communications are through FLARE.

The integration with FLARE-based communication has some unique benefits:

- Provisioning of startup kits, including certificates
- Deployment of custom code (apps)
- User authentication and authorization
- :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>` mechanism to counter connection stability issues
- Multiple communication schemes (gRPC, HTTP, TCP, Redis, etc.) are available
- P2P communication: anyone can talk to anyone else without needing topology changes
- Support of P2P communication encryption (on top of SSL)
- Multi-job system that allows multiple Flower apps to run at the same time without needing extra ports on the server host
- Use additional NVFlare features like experiment tracking
145 changes: 145 additions & 0 deletions docs/user_guide/flower_integration/flower_job_structure.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
********************
Flower Job Structure
********************
Even though Flower Programming is out of the scope of FLARE/Flower integration, you need to have a good
understanding of the Flower Job Structure when submitting to FLARE.

A Flower job is a regular FLARE job with special requirements for the ``custom`` directory, as shown below.

.. code-block:: none
├── flwr_pt
│ ├── client.py # <-- contains `ClientApp`
│ ├── __init__.py # <-- to register the python module
│ ├── server.py # <-- contains `ServerApp`
│ └── task.py # <-- task-specific code (model, data)
└── pyproject.toml # <-- Flower project file
Project Folder
==============
All Flower app code must be placed in a subfolder in the ``custom`` directory of the job. This subfolder is called
the project folder of the app. In this example, the project folder is named ``flwr_pt``. Typically, this folder
contains ``server.py``, ``client.py``, and the ``__init__.py``. Though you could organize them differently (see discussion
below), we recommend always including the ``__init__.py`` so that the project folder is guaranteed to be a valid Python
package, regardless of Python versions.

Pyproject.toml
==============
The ``pyproject.toml`` file exists in the job's ``custom`` folder. It is an important file that contains server and
client app definition and configuration information. Such information is used by the Flower system to find the
server app and the client app, and to pass app-specific configuration to the apps.

Here is an example of ``pyproject.toml``, taken from https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-flower/jobs/hello-flwr-pt/app/custom/pyproject.toml.

.. code-block:: toml
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "flwr_pt"
version = "1.0.0"
description = ""
license = "Apache-2.0"
dependencies = [
"flwr[simulation]>=1.11.0,<2.0",
"nvflare~=2.5.0rc",
"torch==2.2.1",
"torchvision==0.17.1",
]
[tool.hatch.build.targets.wheel]
packages = ["."]
[tool.flwr.app]
publisher = "nvidia"
[tool.flwr.app.components]
serverapp = "flwr_pt.server:app"
clientapp = "flwr_pt.client:app"
[tool.flwr.app.config]
num-server-rounds = 3
[tool.flwr.federations]
default = "local-simulation"
[tool.flwr.federations.local-simulation]
options.num-supernodes = 2
.. note:: Note that the information defined in pyproject.toml must match the code in the project folder!

Project Name
------------
The project name should match the name of the project folder, though not a requirement. In this example, it is ``flwr_pt``.
Serverapp Specification

This value is specified following this format:

.. code-block:: toml
<server_app_module>:<server_app_var_name>
where:

- The <server_app_module> is the module that contains the server app code. This module is usually defined as ``server.py`` in the project folder (flwr_pt in this example).
- The <server_app_var_name> is the name of the variable that holds the ServerApp object in the <server_app_module>. This variable is usually defined as ``app``:

.. code-block:: python
app = ServerApp(server_fn=server_fn)
Clientapp Specification
------------------------
This value is specified following this format:

.. code-block:: toml
<client_app_module>:<client_app_var_name>
where:

- The <client_app_module> is the module that contains the client app code. This module is usually defined as ``client.py`` in the project folder (flwr_pt in this example).
- The <client_app_var_name> is the name of the variable that holds the ClientApp object in the <client_app_module>. This variable is usually defined as ``app``:

.. code-block:: python
app = ClientApp(client_fn=client_fn)
App Configuration
-----------------
The pyproject.toml file can contain app config information, in the ``[tool.flwr.app.config]`` section. In this example,
it defines the number of rounds:

.. code-block:: toml
[tool.flwr.app.config]
num-server-rounds = 3
The content of this section is specific to the server app code. The ``server.py`` in the example shows how this is used:

.. code-block:: python
def server_fn(context: Context):
# Read from config
num_rounds = context.run_config["num-server-rounds"]
# Define config
config = ServerConfig(num_rounds=num_rounds)
return ServerAppComponents(strategy=strategy, config=config)
Supernode Count
---------------
If you run the Flower job with its simulation (not as a FLARE job), you need to specify how many clients (supernodes) to use
for the simulation in the ``[tool.flwr.federations.local-simulation]`` section, like this:

.. code-block:: toml
options.num-supernodes = 2
But this does not apply when submitting it as a FLARE job.
19 changes: 19 additions & 0 deletions docs/user_guide/flower_integration/flower_reliable_messaging.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
******************
Reliable Messaging
******************

The interaction between the FLARE Clients and Server is through reliable messaging.
First, the requester tries to send the request to the peer. If it fails to send it, it will retry a moment later.
This process keeps repeating until the request is sent successfully or the amount of time has passed (which will
cause the job to abort).

Secondly, once the request is sent, the requester waits for the response. Once the peer finishes processing, it
sends the result to the requester immediately (which could be successful or unsuccessful). At the same time, the
requester repeatedly sends queries to get the result from the peer, until the result is received or the max amount
of time has passed (which will cause the job to abort). The result could be received in one of the following ways:

- The result is received from the response message sent by the peer when it finishes the processing
- The result is received from the response to the query message of the requester

For details of :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>`,
see :ref:`ReliableMessage Timeout <reliable_xgboost_timeout>`.
Loading

0 comments on commit 8ef0ec4

Please sign in to comment.