Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: C API #134

Merged
merged 3 commits into from
Oct 18, 2024
Merged

[RFC]: C API #134

merged 3 commits into from
Oct 18, 2024

Conversation

rscohn2
Copy link
Member

@rscohn2 rscohn2 commented Aug 6, 2024

I formatted Gengbin's C APi proposal as an RFC.

@rscohn2 rscohn2 added the RFC label Aug 6, 2024
Copy link

@garzaran garzaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.


| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`cudaError_t` |`onecclResult_t cudaSetDevice(device)(1)`| N/A |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the device? We should be much more specific here. I think we don't have to cover this API. My proposal is to handle all device handles in ...Config APIs, so for example. For example passing users device to onecclCommInit would look like:

onecclCommConfig_t config = ONECCL_COMM_CONFIG_INITIALIZER;
config.sycl.queue = &queue;

onecclCommInitConfig(rank, size, ..., &config);

Similar pattern could be applied to onecclStreamCreate, so we could get rid of onecclCreateSyclStream which looks like a workaround rather than an elegant solution:

onecclStream_t stream;
onecclStreamConfig_t stream_config = ONECCL_STREAM_SYCL_CONFIG_INITIALIZER;
// or onecclStreamConfig_t config = ONECCL_STREAM_CONFIG_INITIALIZER;
config.sycl.queue = &queue;

onecclStreamCreateConfig(&stream, &config);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to use integer index for the devices temporarily, before SYCL releases API for device selection similar to cudaSetDevice.

For the streams, I think that onecclCreateStreamXpu(&onecclStream_t stream_ptr, void* args) should be enough for all communications backed we would like to support.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@JackAKirk
Copy link

JackAKirk commented Sep 30, 2024

Have you considered this issue intel/llvm#15251
which is described more fully here: oneapi-src/unified-runtime#2077

Are there not similar implementation issues in the level_zero backend due to the implicit device setting nature of SYCL?
I looked into the l0 adapter implementation: I think it is possible that l0 has similar issues, depending on ipc usage: I think it possible that if intel/llvm#15251 could be built for PVC you will see similar issues

I do not see how implementing a wrapper function to sycl that matches the api of cudaSetDevice such as suggested here: https://github.com/intel/llvm/
pull/15382
can solve such issues.

I think that the only current solution is to rely on the assumption that no more than one gpu device will be used per MPI rank and that environment variables such as CUDA_VISIBLE_DEVICES/ intel vendor equivalent, or ONEAPI_DEVICE_SECTOR are used: as detailed in intel/llvm#15251

@garzaran
Copy link

garzaran commented Oct 4, 2024

Have you considered this issue intel/llvm#15251 which is described more fully here: oneapi-src/unified-runtime#2077

Are there not similar implementation issues in the level_zero backend due to the implicit device setting nature of SYCL? I looked into the l0 adapter implementation: I think it is possible that l0 has similar issues, depending on ipc usage: I think it possible that if intel/llvm#15251 could be built for PVC you will see similar issues

I do not see how implementing a wrapper function to sycl that matches the api of cudaSetDevice such as suggested here: https://github.com/intel/llvm/ pull/15382 can solve such issues.

I think that the only current solution is to rely on the assumption that no more than one gpu device will be used per MPI rank and that environment variables such as CUDA_VISIBLE_DEVICES/ intel vendor equivalent, or ONEAPI_DEVICE_SECTOR are used: as detailed in intel/llvm#15251

We do need to support the case where one MPI rank will open more than one GPU device. It is hard to understand all the details in the ticket you refer, but it appears more like a memory leak, which I assume can/has to be fixed. In any case, in general a MPI rank can open all the GPU devices.
Take a look to this example from CUDA: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-3-multiple-devices-per-thread. Tensorflow uses/needs such a use-case where a single process opens all the devices.

Comment on lines 41 to 43
- `onecclResult_t onecclReleaseStream(oneccl_stream)`

`onecclResult_t onecclStreamDestroy(onecclStream_t oneccl_stream)`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So onecclStreamDestroy is the final version right?


`onecclResult_t onecclStreamDestroy(onecclStream_t oneccl_stream)`

Once the sycl::queue is registered, it is hidden behind the `cclStream_t`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cclStream_t -> onecclStream_t

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a typo. @zhenggb72, can you update this and we can approve and merge?

@JackAKirk
Copy link

Have you considered this issue intel/llvm#15251 which is described more fully here: oneapi-src/unified-runtime#2077
Are there not similar implementation issues in the level_zero backend due to the implicit device setting nature of SYCL? I looked into the l0 adapter implementation: I think it is possible that l0 has similar issues, depending on ipc usage: I think it possible that if intel/llvm#15251 could be built for PVC you will see similar issues
I do not see how implementing a wrapper function to sycl that matches the api of cudaSetDevice such as suggested here: https://github.com/intel/llvm/ pull/15382 can solve such issues.
I think that the only current solution is to rely on the assumption that no more than one gpu device will be used per MPI rank and that environment variables such as CUDA_VISIBLE_DEVICES/ intel vendor equivalent, or ONEAPI_DEVICE_SECTOR are used: as detailed in intel/llvm#15251

We do need to support the case where one MPI rank will open more than one GPU device. It is hard to understand all the details in the ticket you refer, but it appears more like a memory leak, which I assume can/has to be fixed. In any case, in general a MPI rank can open all the GPU devices. Take a look to this example from CUDA: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-3-multiple-devices-per-thread. Tensorflow uses/needs such a use-case where a single process opens all the devices.

Thanks @garzaran that link was useful. I've opened a OPENMPI issue with a cuda reproducer that represents what happens in DPC++ here: open-mpi/ompi#12848

@garzaran
Copy link

Looks good

@rscohn2 rscohn2 merged commit 7e4ff57 into uxlfoundation:rfcs Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants