forked from facebookincubator/gloo
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: CUDA documentation detailing high-level support for CUDA in gloo algorithms, usage of streams, and synchronizing memory management. Reviewed By: pietern Differential Revision: D4633120 fbshipit-source-id: d88e230c8dc82fe48cda0f401b61758fa4f07f2e
- Loading branch information
1 parent
50e73a8
commit 0c88194
Showing
1 changed file
with
59 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# NVIDIA GPU support | ||
Gloo includes several collective algorithm implementations that work directly with NVIDIA GPU buffers. These take advantage of overlapping host and GPU operations to decrease overall latency. | ||
|
||
GPU-aware algorithms require CUDA 7 or newer for various CUDA and NCCL features. | ||
|
||
## Serializing GPU device operations | ||
Gloo leverages CUDA streams to sequence operations on a single GPU device without blocking other concurrent activity. Before calling any of the gloo collective functions that operate on GPU buffers, the calling code should | ||
* Ensure the GPU buffer inputs are synchronized and valid, or | ||
* Pass the associated `cudaStream_t`(s) to the gloo collective function so that it can serialize its usage of the inputs. | ||
|
||
If no `cudaStream_t`(s) are passed to the gloo collective function, GPU buffer outputs are valid when the gloo collective function returns. Otherwise, the calling code must synchronize with the streams before using the GPU buffer outputs, i.e., explicitly with `cudaStreamSynchronize()` or inserting dependent operations in the stream. | ||
|
||
See CUDA documentation for additional information about using streams. | ||
|
||
```cpp | ||
void broadcastZeros( | ||
std::shared_ptr<::gloo::Context>& context, | ||
int rank, | ||
float* devicePtr, | ||
size_t count) { | ||
// Allocate a stream to serialize GPU device operations | ||
cudaStream_t stream; | ||
cudaStreamCreate(&stream); | ||
|
||
// Zero local GPU device buffer asynchronously | ||
cudaMemsetAsync(devicePtr, 0, count, stream); | ||
|
||
// Broadcast the buffer to participating machines | ||
gloo::CudaBroadcastOneToAll<float> broadcast( | ||
context, devicePtr, count, rank, stream); | ||
broadcast.run(); | ||
|
||
// Wait until the broadcast is complete | ||
cudaStreamSynchronize(stream); | ||
|
||
cudaStreamDestroy(stream); | ||
} | ||
``` | ||
## Synchronizing GPU memory allocation | ||
Overlapping calls to `cudaMalloc()` or `cudaFree()` may result in deadlock. Gloo and any calling code must coordinate memory allocations. Calling code should | ||
1. Pass a shared `std::mutex` into `gloo::CudaShared::setMutex()` before calling any other gloo functions. | ||
2. Always acquire the mutex before calling CUDA memory allocation functions. | ||
```cpp | ||
// Define a mutex to synchronize calls to cudaMalloc/cudaFree | ||
std::mutex m; | ||
// Share the mutex with gloo | ||
gloo::CudaShared::setMutex(&m); | ||
// Always call cudaMalloc/cudaFree while holding the mutex | ||
void* allocateCudaMemory(size_t bytes) { | ||
std::lock_guard<std::mutex> lock(m); | ||
void* ptr; | ||
cudaMalloc(&ptr, bytes); | ||
return ptr; | ||
} | ||
``` |