[WIP] Federated learning plugin. #10410

trivialfis · 2024-06-12T17:56:19Z

The underlying logic is the same as #10231 , with a couple of changes in the interface. This PR is still extremely early, opening it for collabration and discussion. The code is not ready for merge.

At the moment, vertical learning is blocked by a IPC issue in nvflare where the message size can change for the allgather V call. I haven't tested horizontal yet.

Difference

A C interface

The original PR proposes a C++ interface. This appears to be convenient but has some drawbacks:

Cross DLL problem. C++ is not good at defining DLLs, it's extremely difficult to make good loadable DLL. For instance, memory allocated by the plugin should never be freed by the host. The malloc and free calls (along with new and delete) should come from the same version of libc and C++ runtime, along with the C++ compiler. XGBoost is distributed in binary format in PyPI, which makes this requirement difficult to fulfill. This rule applies to CUDA memory as well.
In addition, exceptions should rarely escape DLL, meaning C++ exceptions in the plugin should never reach XGBoost. One can look up the cross DLL problem online for more info. Using a C interface is the simplest solution to these errors.
libraries loading each other are brittle. It can cause issues with static initialization and recursive load. GRPC is running into issues with cyclic deps.
Remove circular dependency.
The life time of the plugin is now tied into the communicator instead of a new global variable.

Split up the histogram implementation

The original PR packs the federated learning into the existing histogram builder. I split it into the plugin directory as a policy class. The policy class will be a playground for people who are working on it without affecting the main XGBoost logic. Also, change will have less resistance from reviewers.

Remove secure data split

The collective::IsFederated() will be the only source of information about whether training is performed in an encrypted space. If we don't remove the secure split in data, we will have 8 different states: row-split/col-split x secure/insecure x federated/normal, which is difficult to maintain.

Rebased to the new collective implementation

This PR uses the latest GRPC backend.

Interface

The plugin interface I have at the moment has two major steps for both horizontal and vertical:

Build
Takes in raw data (vertical) or histogram (horizontal), and returns an encrypted/encoded histogram.
Sync
Takes an encrypted histogram, and returns a decrypted one.

The plugin itself owns all the buffers, and is freed after the plugin is closed. In addition, minimal data copying is required from the interface's perspective. For instance, we can directly reuse the row partition as input for the plugin. Lastly, since the histogram procedure is split into a policy class, feel free to modify the plugin interface in the future as you see fit.

trivialfis · 2024-06-13T21:00:57Z

Many existing federated tests have to be disabled with the plugin:

GPU related tests.
multi-class tests
column sampling and constraint tests have to be disabled.
quantile based objectives

Do we really want to merge it at this stage?

cc @rongou @ZiyueXu77 @nvidianz

rongou · 2024-06-13T22:23:24Z

Sounds like a no? :)

ZiyueXu77 · 2024-06-14T15:03:46Z

Like the idea of limiting the datasplit to horizontal and vertical, while controlling the encryption with plugin (passby v.s. encrypt).

Some notes on secure vertical v.s. our existing well-established vertical (@rongou 's implementation):

The sync logic / stage is different: existing vertical lets passive parties compute to the end (local best splits) then sync, while secure vertical let passive parties only compute to the G/H
Because of this, the setting of feature histogram on each party is different: existing vertical only holds its own features, while secure vertical pre-allocates and syncs the slots to get a global view of feature set.
This would be the root cause for column sampling etc. We shall also examine the efficiency difference
Of course, anther thing is we do not have GPU for secure schemes

trivialfis · 2024-06-14T15:57:33Z

We shall also examine the efficiency difference

This is indeed concerning. The current implementation of this PR should be significantly slower than the existing one. Histogram build is the bottleneck in XGBoost, and the PR switches it into a single thread build from the existing parallel build. We can optimize it, of course. But it would be great if we could refactor them to use more shared code instead of having two different histogram build code paths.

trivialfis · 2024-06-14T17:13:07Z

@rongou Please help review when you are available. We will be targeting a feather branch instead of the master branch for this PR.

trivialfis · 2024-06-17T18:52:10Z

Brought back the tests by @rongou . Here's the configuration:

federated but not encrypted:

    communicator_env = {
        "dmlc_communicator": "federated",
        "federated_server_address": f"localhost:{port}",
        "federated_world_size": world_size,
        "federated_rank": rank,
    }

federated encrypted:

    communicator_env = {
        "dmlc_communicator": "federated",
        "federated_server_address": f"localhost:{port}",
        "federated_world_size": world_size,
        "federated_rank": rank,
        "federated_plugin": {
            "name": "nvflare",
            "path": "/tmp/libnvflare_proc.so",
        }
    }

federated encrypted with mock

    communicator_env = {
        "dmlc_communicator": "federated",
        "federated_server_address": f"localhost:{port}",
        "federated_world_size": world_size,
        "federated_rank": rank,
        "federated_plugin": {"name": "mock"}
    }

All of them have the column-split and row-split variant.

I think this is getting out of hand now. We need a better way to manage them.

trivialfis added 2 commits June 13, 2024 01:19

Work on vertical federated learning.

07f1131

Revert.

de3cd32

trivialfis mentioned this pull request Jun 12, 2024

Wrap the xgboost plugin into a C library. NVIDIA/NVFlare#2639

Merged

6 tasks

trivialfis added 4 commits June 13, 2024 19:48

pull in the mock.

3a93776

lint.

5b30280

Move the implementation.

1f2b548

Work on build.

72a6141

trivialfis marked this pull request as ready for review June 14, 2024 17:12

trivialfis changed the title ~~[WIP] Federated learning plugin.~~ Federated learning plugin. Jun 14, 2024

trivialfis added 6 commits June 15, 2024 16:41

Doc.

8295629

Fix.

238bc58

Add getter.

0c057a1

is encrypted.

c1730e3

revert.

185ef4b

Add tests.

dc57ef8

trivialfis closed this Jun 18, 2024

trivialfis reopened this Jul 2, 2024

trivialfis changed the title ~~Federated learning plugin.~~ [WIP] Federated learning plugin. Jul 2, 2024

trivialfis marked this pull request as draft July 2, 2024 16:49

trivialfis closed this Jul 2, 2024

This was referenced Jul 2, 2024

Federated plugin for histogram. #10534

Merged

[fed] Provide the initial federated plugin interface. #10540

Merged

trivialfis mentioned this pull request Jul 24, 2024

Add heu processor library files (for secure boost scheme phase 2) #10571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Federated learning plugin. #10410

[WIP] Federated learning plugin. #10410

trivialfis commented Jun 12, 2024 •

edited

Loading

trivialfis commented Jun 13, 2024 •

edited

Loading

rongou commented Jun 13, 2024

ZiyueXu77 commented Jun 14, 2024

trivialfis commented Jun 14, 2024

trivialfis commented Jun 14, 2024

trivialfis commented Jun 17, 2024

[WIP] Federated learning plugin. #10410

[WIP] Federated learning plugin. #10410

Conversation

trivialfis commented Jun 12, 2024 • edited Loading

Difference

A C interface

Split up the histogram implementation

Remove secure data split

Rebased to the new collective implementation

Interface

trivialfis commented Jun 13, 2024 • edited Loading

rongou commented Jun 13, 2024

ZiyueXu77 commented Jun 14, 2024

trivialfis commented Jun 14, 2024

trivialfis commented Jun 14, 2024

trivialfis commented Jun 17, 2024

trivialfis commented Jun 12, 2024 •

edited

Loading

trivialfis commented Jun 13, 2024 •

edited

Loading