forked from NVIDIA/NVFlare
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add docs for Flower integration (NVIDIA#2862)
- Loading branch information
Showing
13 changed files
with
405 additions
and
2 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 5 additions & 2 deletions
7
docs/user_guide/federated_xgboost/reliable_xgboost_timeout.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
#################################################### | ||
Integration of Flower Applications with NVIDIA FLARE | ||
#################################################### | ||
|
||
`Flower <https://flower.ai>`_ is an open-source project that implements a unified approach | ||
to federated learning, analytics, and evaluation. Flower has developed a large set of | ||
strategies and algorithms for FL application development and a healthy FL research community. | ||
|
||
FLARE, on the other hand, has been focusing on providing an enterprise-ready, robust runtime | ||
environment for FL applications. | ||
|
||
With the integration of Flower and FLARE, applications developed with the Flower framework | ||
will run easily in FLARE runtime without needing to make any changes. All the user needs to do | ||
is configure the Flower application into a FLARE job and submit the job to the FLARE system. | ||
|
||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
flower_integration/flower_initial_integration | ||
flower_integration/flare_multi_job_architecture | ||
flower_integration/flower_detailed_design | ||
flower_integration/flower_reliable_messaging |
23 changes: 23 additions & 0 deletions
23
docs/user_guide/flower_integration/flare_multi_job_architecture.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
**************************** | ||
FLARE Multi-Job Architecture | ||
**************************** | ||
|
||
To maximize the utilization of compute resources, FLARE supports multiple jobs running at the | ||
same time, where each job is an independent FL experiment. | ||
|
||
.. image:: ../../resources/system_architecture.png | ||
|
||
As shown in the diagram above, there is the Server Control Process (SCP) on the Server host, and there is a | ||
Client Control Process (CCP) on each client host. The SCP communicates with CCPs to manage jobs (schedule, | ||
deploy, monitor, and abort jobs). When a job is scheduled by the SCP, the job is sent to the CCPs of all sites, | ||
which creates separate processes for the job. These processes form a “Job Network” for the job. This network | ||
goes away when the job is finished. | ||
|
||
The diagram shows 3 jobs (J1, J2, J3) in different colors on server and client(s). For example, all J1 processes | ||
form the “job network” for Job 1. | ||
|
||
By default, processes of the same job network are not connected directly. Instead, they only connect to the SCP, | ||
and all messages between job processes are relayed through the SCP. However, if network policy permits, direct | ||
P2P connections could be established automatically between the job processes to obtain maximum communication | ||
speed. The underlying communication path is transparent to applications and only requires config changes to | ||
enable direct communication. |
26 changes: 26 additions & 0 deletions
26
docs/user_guide/flower_integration/flower_detailed_design.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
*************** | ||
Detailed Design | ||
*************** | ||
|
||
Flower uses gRPC as the communication protocol. To use FLARE as the communicator, we route Flower's gRPC | ||
messages through FLARE. To do so, we change the server-endpoint of each Flower client to a local gRPC | ||
server (LGS) within the FLARE client. | ||
|
||
.. image:: ../../resources/FLARE_as_flower_communicator.png | ||
|
||
As shown in this diagram, there is a Local GRPC server (LGS) for each site that serves as the | ||
server-endpoint for the Flower client on the site. Similarly, there is a Local GRPC Client (LGC) on the | ||
FLARE Server that interacts with the Flower Server. The message path between the Flower Client and the Flower | ||
Server is as follows: | ||
|
||
- The Flower client generates a gRPC message and sends it to the LGS in the FLARE Client | ||
- FLARE Client forwards the message to the FLARE Server. This is a reliable FLARE message. | ||
- FLARE Server uses the LGC to send the message to the Flower Server. | ||
- Flower Server sends the response back to the LGC in the FLARE Server. | ||
- FLARE Server sends the response back to the FLARE Client. | ||
- FLARE Client sends the response back to the Flower Client via the LGS. | ||
|
||
Please note that the Flower Client could be running as a separate process or within the same process as the FLARE Client. | ||
|
||
This will enable users to directly deploy Flower ServerApps and ClientsApps developed within the | ||
NVFlare Runtime Environment. No code changes are necessary! |
28 changes: 28 additions & 0 deletions
28
docs/user_guide/flower_integration/flower_initial_integration.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
******************* | ||
Initial Integration | ||
******************* | ||
|
||
Architecturally, Flower uses client/server communication. Clients communicate with the server | ||
via gRPC. FLARE uses the same architecture with the enhancement that multiple jobs can run at | ||
the same time (each job requires one set of clients/server) without requiring multiple ports to | ||
be open on the server host. | ||
|
||
Since both frameworks follow the same communication architecture, it is fairly easy to make a | ||
Flower application a FLARE job by using FLARE as the communicator for the Flower app, as shown below. | ||
|
||
.. image:: ../../resources/FLARE_as_flower_communicator.png | ||
|
||
In this approach, Flower Clients no longer directly interact with the Flower Server, instead all | ||
communications are through FLARE. | ||
|
||
The integration with FLARE-based communication has some unique benefits: | ||
|
||
- Provisioning of startup kits, including certificates | ||
- Deployment of custom code (apps) | ||
- User authentication and authorization | ||
- :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>` mechanism to counter connection stability issues | ||
- Multiple communication schemes (gRPC, HTTP, TCP, Redis, etc.) are available | ||
- P2P communication: anyone can talk to anyone else without needing topology changes | ||
- Support of P2P communication encryption (on top of SSL) | ||
- Multi-job system that allows multiple Flower apps to run at the same time without needing extra ports on the server host | ||
- Use additional NVFlare features like experiment tracking |
145 changes: 145 additions & 0 deletions
145
docs/user_guide/flower_integration/flower_job_structure.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
******************** | ||
Flower Job Structure | ||
******************** | ||
Even though Flower Programming is out of the scope of FLARE/Flower integration, you need to have a good | ||
understanding of the Flower Job Structure when submitting to FLARE. | ||
|
||
A Flower job is a regular FLARE job with special requirements for the ``custom`` directory, as shown below. | ||
|
||
.. code-block:: none | ||
├── flwr_pt | ||
│ ├── client.py # <-- contains `ClientApp` | ||
│ ├── __init__.py # <-- to register the python module | ||
│ ├── server.py # <-- contains `ServerApp` | ||
│ └── task.py # <-- task-specific code (model, data) | ||
└── pyproject.toml # <-- Flower project file | ||
Project Folder | ||
============== | ||
All Flower app code must be placed in a subfolder in the ``custom`` directory of the job. This subfolder is called | ||
the project folder of the app. In this example, the project folder is named ``flwr_pt``. Typically, this folder | ||
contains ``server.py``, ``client.py``, and the ``__init__.py``. Though you could organize them differently (see discussion | ||
below), we recommend always including the ``__init__.py`` so that the project folder is guaranteed to be a valid Python | ||
package, regardless of Python versions. | ||
|
||
Pyproject.toml | ||
============== | ||
The ``pyproject.toml`` file exists in the job's ``custom`` folder. It is an important file that contains server and | ||
client app definition and configuration information. Such information is used by the Flower system to find the | ||
server app and the client app, and to pass app-specific configuration to the apps. | ||
|
||
Here is an example of ``pyproject.toml``, taken from https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-flower/jobs/hello-flwr-pt/app/custom/pyproject.toml. | ||
|
||
.. code-block:: toml | ||
[build-system] | ||
requires = ["hatchling"] | ||
build-backend = "hatchling.build" | ||
[project] | ||
name = "flwr_pt" | ||
version = "1.0.0" | ||
description = "" | ||
license = "Apache-2.0" | ||
dependencies = [ | ||
"flwr[simulation]>=1.11.0,<2.0", | ||
"nvflare~=2.5.0rc", | ||
"torch==2.2.1", | ||
"torchvision==0.17.1", | ||
] | ||
[tool.hatch.build.targets.wheel] | ||
packages = ["."] | ||
[tool.flwr.app] | ||
publisher = "nvidia" | ||
[tool.flwr.app.components] | ||
serverapp = "flwr_pt.server:app" | ||
clientapp = "flwr_pt.client:app" | ||
[tool.flwr.app.config] | ||
num-server-rounds = 3 | ||
[tool.flwr.federations] | ||
default = "local-simulation" | ||
[tool.flwr.federations.local-simulation] | ||
options.num-supernodes = 2 | ||
.. note:: Note that the information defined in pyproject.toml must match the code in the project folder! | ||
|
||
Project Name | ||
------------ | ||
The project name should match the name of the project folder, though not a requirement. In this example, it is ``flwr_pt``. | ||
Serverapp Specification | ||
|
||
This value is specified following this format: | ||
|
||
.. code-block:: toml | ||
<server_app_module>:<server_app_var_name> | ||
where: | ||
|
||
- The <server_app_module> is the module that contains the server app code. This module is usually defined as ``server.py`` in the project folder (flwr_pt in this example). | ||
- The <server_app_var_name> is the name of the variable that holds the ServerApp object in the <server_app_module>. This variable is usually defined as ``app``: | ||
|
||
.. code-block:: python | ||
app = ServerApp(server_fn=server_fn) | ||
Clientapp Specification | ||
------------------------ | ||
This value is specified following this format: | ||
|
||
.. code-block:: toml | ||
<client_app_module>:<client_app_var_name> | ||
where: | ||
|
||
- The <client_app_module> is the module that contains the client app code. This module is usually defined as ``client.py`` in the project folder (flwr_pt in this example). | ||
- The <client_app_var_name> is the name of the variable that holds the ClientApp object in the <client_app_module>. This variable is usually defined as ``app``: | ||
|
||
.. code-block:: python | ||
app = ClientApp(client_fn=client_fn) | ||
App Configuration | ||
----------------- | ||
The pyproject.toml file can contain app config information, in the ``[tool.flwr.app.config]`` section. In this example, | ||
it defines the number of rounds: | ||
|
||
.. code-block:: toml | ||
[tool.flwr.app.config] | ||
num-server-rounds = 3 | ||
The content of this section is specific to the server app code. The ``server.py`` in the example shows how this is used: | ||
|
||
.. code-block:: python | ||
def server_fn(context: Context): | ||
# Read from config | ||
num_rounds = context.run_config["num-server-rounds"] | ||
# Define config | ||
config = ServerConfig(num_rounds=num_rounds) | ||
return ServerAppComponents(strategy=strategy, config=config) | ||
Supernode Count | ||
--------------- | ||
If you run the Flower job with its simulation (not as a FLARE job), you need to specify how many clients (supernodes) to use | ||
for the simulation in the ``[tool.flwr.federations.local-simulation]`` section, like this: | ||
|
||
.. code-block:: toml | ||
options.num-supernodes = 2 | ||
But this does not apply when submitting it as a FLARE job. |
19 changes: 19 additions & 0 deletions
19
docs/user_guide/flower_integration/flower_reliable_messaging.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
****************** | ||
Reliable Messaging | ||
****************** | ||
|
||
The interaction between the FLARE Clients and Server is through reliable messaging. | ||
First, the requester tries to send the request to the peer. If it fails to send it, it will retry a moment later. | ||
This process keeps repeating until the request is sent successfully or the amount of time has passed (which will | ||
cause the job to abort). | ||
|
||
Secondly, once the request is sent, the requester waits for the response. Once the peer finishes processing, it | ||
sends the result to the requester immediately (which could be successful or unsuccessful). At the same time, the | ||
requester repeatedly sends queries to get the result from the peer, until the result is received or the max amount | ||
of time has passed (which will cause the job to abort). The result could be received in one of the following ways: | ||
|
||
- The result is received from the response message sent by the peer when it finishes the processing | ||
- The result is received from the response to the query message of the requester | ||
|
||
For details of :class:`ReliableMessage<nvflare.apis.utils.reliable_message.ReliableMessage>`, | ||
see :ref:`ReliableMessage Timeout <reliable_xgboost_timeout>`. |
Oops, something went wrong.