Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: C API #134

Merged
merged 3 commits into from
Oct 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions rfcs/20240806-c-api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# C API Design Document (RFC)


## Introduction

The oneCCL communication library’s current APIs is defined in the [oneAPI
specification][ccl-spec]. However, other APIs used by similar collective
communication libraries differ from those used by oneCCL. For example, see
[NCCL][nccl-spec] from Nvidia, [RCCL][rccl-spec] from AMD, and hccl from
Habana. This RFC asks for feedback about aligning the oneCCL APIs to be closer
to other vendor libraries, since this facilitates integration with frameworks
and upstreaming to the open source.

One difference between oneCCL and other vendors communication libraries is that
all other communication libraries have a C API, while oneCCL has a C++ API.
This is because oneCCL was designed to integrate with SYCL, which is based on
C++. One of the goals of oneCCL is to support different hardware and vendors,
such as Intel Data Center GPU Max Series, Intel Core and Intel Xeon family,
Intel Gaudi, Nvidia or AMD GPUs, among others.

[ccl-spec]: https://uxlfoundation.github.io/oneAPI-spec/spec/elements/oneCCL/source/index.html
[hccl-spec]: https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/C_API.html
[nccl-spec]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api.html
[rccl-spec]: https://rocm.docs.amd.com/projects/rccl/en/latest/api-reference/api-library.html#api-library

## Proposal

The proposal is to define a C-like API that aligns with current APIs in other
communication libraries, while introducing a few changes, as described next:

1. Most APIs are C-based like other communication libraries. C++ data
structures are hidden behind handles returned to the user, such as
`ccl::stream` and `ccl::comm`.

2. The API is extended to support different types of streams or queues:

- `onecclResult_t onecclCreateStreamXPU(onecclStream_t* oneccl_stream, void *args)`
the args is a pointer to the stream or queue that is vendor specific.
- `onecclResult_t onecclStreamCreateCPU(onecclStream_t* oneccl_stream, void* args)`
this API is explicit for CPU.

- `onecclResult_t onecclStreamDestroy(onecclStream_t oneccl_stream)`

Once the sycl::queue is registered, it is hidden behind the `onecclStream_t`
handle

3. Add functions to allow users to explicitly control the lifetime of objects,
instead of relying on the C++ destructors

- `onecclResult_t onecclCommFinalize(comm)`
- `onecclResult_t onecclCommDestroy(comm)`

4. Drop support for out-of-order SYCL queue and SYCL buffers. The current
oneCCL library support out of order SYCL queues, but this feature is not
used by the users of the library. In general, the collective operations are
submitted to an in-order queue. When out-of order behavior is required,
commands are submitted to a different in-order queue, and the two queues are
synchronized.

5. Drop support for SYCL buffers. Only [Unified Shared Memory][usm-example] is
supported.

[usm-example]: https://www.intel.com/content/www/us/en/developer/articles/code-sample/dpcpp-usm-code-sample.html

### APIs

The tables below contain the NCCL API, the corresponding new proposed oneCCL
API, and the current oneCCL API.

#### APIs related with communicator creation.

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclGetUniqueId (id)`| `onecclResult_t onecclGetUniqueId (id)`| `ccl::create_main_kvs(); ccl::create_kvs(main_addr);`|
|`ncclResult_t ncclCommInitRank(comm, size, id, rank)`|`onecclResult_t onecclCommInitRank(comm, size, id, rank)(1)`|`comm cl::create_communicator(size, rank, device, context, kvs) comms ccl:create_communicators(size, rank, device, context, kvs)`|
|`ncclResult_t ncclCommInitRankConfig(comm, size, id, rank, attr)`|`onecclResult_t onecclCommInitRankConfig(comm, size, id, rank, attr)`|`comm ccl:create_communicator(size, rank, device, context, kvs, attr)`|
|`ncclResult_t ncclCommInitAll (comms, ndev, dev_list)`|`onecclResult_t onecclCommInitAll(comms,ndev,dev_list)`| Not currently available.Working on adding support.|
|`ncclCommSplit` | Not implemented | Not implemented |
|`nccltResult ncclCommFinalize(comm)`|`onecclResult_t onecclCommFinalize(comm)`| N/A |
|`ncclResult_t ncclCommDestroy(comm)`|`onecclResult_t onecclCommDestroy(comm)`| Destructor |

This assumes that each rank is associated with a device, which has been set before calling this function (ncclCommInitRank).

#### APIs related with Collective Communication operations

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclAllgather (sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclAllgather(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::allgather (2) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclAllreduce(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclAllreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event
communicator::allreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclBroadcast(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclBroadcast(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::broadcast (3) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclReduce(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclReduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclReduceScatter(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclReduceScatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce_scatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
| N/A |`onecclAlltoall onecclAlltoallv` We could deprecate|`communicator::alltoall communicator::alltoallv`|
| N/A |`onecclBarrier` We could deprecate and use Allreduce with 1 Byte|`ccl::event communicator::barrier`|

- Currently oneCCL contains Allgatherv, but this will be deprecated in the
future
- The current API is slightly different, but the next oneCCL release will align
the Broadcast with the one shown here

#### Group APIs

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclGroupStart()`|`onecclResult_t onecclGroupStart()`| N/A |
|`ncclResult_t ncclGroupEnd()` |`onecclResult_t onecclGroupEnd()` | N/A |

#### Point to Point APIs

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclSend(sendbuf, count, datatype, peer, comm, stream)`|`onecclResult_t onecclSend(sendbuf, count, datatype, peer, comm, oneccl_stream)`|`ccl::event communicator::send(sendbuf, count,datatype, peer, comm, oneccl_stream)`|
|`ncclResult_t ncclRecv(…)`|`onecclResult_t onecclRecv(…)`|`communicator::recv`|

#### Other APIs

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclCommCount(comm, size)`|`onecclResult_t onecclCommCount(comm, size)`|`size communicator::size()`|
|`ncclResult_t ncclCommCuDevice(comm, device)`|`onecclResult_t onecclCommGetDevice(comm, device)`|`device communicator::get_device()`|
|`ncclResult_t ncclCommUserRank(comm, rank)`|`onecclResult_t onecclCommUserRank(comm, rank)`|`rank communicator::rank()`|
|`ncclResult_t ncclGetVersion(version)`|`onecclResult_t onecclGetVersion(version)`|`version ccl:get_library_version()`|
|`ncclCommAbort` | `onecclCommAbort` | N/A |
|`ncclCommGetAsyncError`| `onecclCommGetAsyncError` | N/A |
|`ncclGetLastError` | `onecclGetLastError` | N/A |
|`ncclGetErrorString`| `onecclGetErrorString` | N/A |