Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Federated learning plugin. #10410

Closed

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jun 12, 2024

The underlying logic is the same as #10231 , with a couple of changes in the interface. This PR is still extremely early, opening it for collabration and discussion. The code is not ready for merge.

At the moment, vertical learning is blocked by a IPC issue in nvflare where the message size can change for the allgather V call. I haven't tested horizontal yet.

Difference

A C interface

The original PR proposes a C++ interface. This appears to be convenient but has some drawbacks:

  • Cross DLL problem. C++ is not good at defining DLLs, it's extremely difficult to make good loadable DLL. For instance, memory allocated by the plugin should never be freed by the host. The malloc and free calls (along with new and delete) should come from the same version of libc and C++ runtime, along with the C++ compiler. XGBoost is distributed in binary format in PyPI, which makes this requirement difficult to fulfill. This rule applies to CUDA memory as well.
  • In addition, exceptions should rarely escape DLL, meaning C++ exceptions in the plugin should never reach XGBoost. One can look up the cross DLL problem online for more info. Using a C interface is the simplest solution to these errors.
  • libraries loading each other are brittle. It can cause issues with static initialization and recursive load. GRPC is running into issues with cyclic deps.
  • Remove circular dependency.
  • The life time of the plugin is now tied into the communicator instead of a new global variable.

Split up the histogram implementation

  • The original PR packs the federated learning into the existing histogram builder. I split it into the plugin directory as a policy class. The policy class will be a playground for people who are working on it without affecting the main XGBoost logic. Also, change will have less resistance from reviewers.

Remove secure data split

  • The collective::IsFederated() will be the only source of information about whether training is performed in an encrypted space. If we don't remove the secure split in data, we will have 8 different states: row-split/col-split x secure/insecure x federated/normal, which is difficult to maintain.

Rebased to the new collective implementation

This PR uses the latest GRPC backend.

Interface

The plugin interface I have at the moment has two major steps for both horizontal and vertical:

  • Build
    Takes in raw data (vertical) or histogram (horizontal), and returns an encrypted/encoded histogram.
  • Sync
    Takes an encrypted histogram, and returns a decrypted one.

The plugin itself owns all the buffers, and is freed after the plugin is closed. In addition, minimal data copying is required from the interface's perspective. For instance, we can directly reuse the row partition as input for the plugin. Lastly, since the histogram procedure is split into a policy class, feel free to modify the plugin interface in the future as you see fit.

@trivialfis
Copy link
Member Author

trivialfis commented Jun 13, 2024

Many existing federated tests have to be disabled with the plugin:

  • GPU related tests.
  • multi-class tests
  • column sampling and constraint tests have to be disabled.
  • quantile based objectives

Do we really want to merge it at this stage?

cc @rongou @ZiyueXu77 @nvidianz

@rongou
Copy link
Contributor

rongou commented Jun 13, 2024

Sounds like a no? :)

@ZiyueXu77
Copy link

Like the idea of limiting the datasplit to horizontal and vertical, while controlling the encryption with plugin (passby v.s. encrypt).

Some notes on secure vertical v.s. our existing well-established vertical (@rongou 's implementation):

  • The sync logic / stage is different: existing vertical lets passive parties compute to the end (local best splits) then sync, while secure vertical let passive parties only compute to the G/H
  • Because of this, the setting of feature histogram on each party is different: existing vertical only holds its own features, while secure vertical pre-allocates and syncs the slots to get a global view of feature set.
  • This would be the root cause for column sampling etc. We shall also examine the efficiency difference
  • Of course, anther thing is we do not have GPU for secure schemes

@trivialfis
Copy link
Member Author

We shall also examine the efficiency difference

This is indeed concerning. The current implementation of this PR should be significantly slower than the existing one. Histogram build is the bottleneck in XGBoost, and the PR switches it into a single thread build from the existing parallel build. We can optimize it, of course. But it would be great if we could refactor them to use more shared code instead of having two different histogram build code paths.

@trivialfis trivialfis marked this pull request as ready for review June 14, 2024 17:12
@trivialfis trivialfis changed the title [WIP] Federated learning plugin. Federated learning plugin. Jun 14, 2024
@trivialfis
Copy link
Member Author

@rongou Please help review when you are available. We will be targeting a feather branch instead of the master branch for this PR.

@trivialfis
Copy link
Member Author

Brought back the tests by @rongou . Here's the configuration:

  • federated but not encrypted:
    communicator_env = {
        "dmlc_communicator": "federated",
        "federated_server_address": f"localhost:{port}",
        "federated_world_size": world_size,
        "federated_rank": rank,
    }
  • federated encrypted:
    communicator_env = {
        "dmlc_communicator": "federated",
        "federated_server_address": f"localhost:{port}",
        "federated_world_size": world_size,
        "federated_rank": rank,
        "federated_plugin": {
            "name": "nvflare",
            "path": "/tmp/libnvflare_proc.so",
        }
    }
  • federated encrypted with mock
    communicator_env = {
        "dmlc_communicator": "federated",
        "federated_server_address": f"localhost:{port}",
        "federated_world_size": world_size,
        "federated_rank": rank,
        "federated_plugin": {"name": "mock"}
    }

All of them have the column-split and row-split variant.

I think this is getting out of hand now. We need a better way to manage them.

@trivialfis trivialfis closed this Jun 18, 2024
@trivialfis trivialfis reopened this Jul 2, 2024
@trivialfis trivialfis changed the title Federated learning plugin. [WIP] Federated learning plugin. Jul 2, 2024
@trivialfis trivialfis marked this pull request as draft July 2, 2024 16:49
@trivialfis trivialfis closed this Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants