Skip to content

Commit

Permalink
CUDA documentation
Browse files Browse the repository at this point in the history
Summary: CUDA documentation detailing high-level support for CUDA in gloo algorithms, usage of streams, and synchronizing memory management.

Reviewed By: pietern

Differential Revision: D4633120

fbshipit-source-id: d88e230c8dc82fe48cda0f401b61758fa4f07f2e
  • Loading branch information
andrewwdye authored and Pieter Noordhuis committed Mar 2, 2017
1 parent 50e73a8 commit 0c88194
Showing 1 changed file with 59 additions and 0 deletions.
59 changes: 59 additions & 0 deletions docs/cuda.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# NVIDIA GPU support
Gloo includes several collective algorithm implementations that work directly with NVIDIA GPU buffers. These take advantage of overlapping host and GPU operations to decrease overall latency.

GPU-aware algorithms require CUDA 7 or newer for various CUDA and NCCL features.

## Serializing GPU device operations
Gloo leverages CUDA streams to sequence operations on a single GPU device without blocking other concurrent activity. Before calling any of the gloo collective functions that operate on GPU buffers, the calling code should
* Ensure the GPU buffer inputs are synchronized and valid, or
* Pass the associated `cudaStream_t`(s) to the gloo collective function so that it can serialize its usage of the inputs.

If no `cudaStream_t`(s) are passed to the gloo collective function, GPU buffer outputs are valid when the gloo collective function returns. Otherwise, the calling code must synchronize with the streams before using the GPU buffer outputs, i.e., explicitly with `cudaStreamSynchronize()` or inserting dependent operations in the stream.

See CUDA documentation for additional information about using streams.

```cpp
void broadcastZeros(
std::shared_ptr<::gloo::Context>& context,
int rank,
float* devicePtr,
size_t count) {
// Allocate a stream to serialize GPU device operations
cudaStream_t stream;
cudaStreamCreate(&stream);

// Zero local GPU device buffer asynchronously
cudaMemsetAsync(devicePtr, 0, count, stream);

// Broadcast the buffer to participating machines
gloo::CudaBroadcastOneToAll<float> broadcast(
context, devicePtr, count, rank, stream);
broadcast.run();

// Wait until the broadcast is complete
cudaStreamSynchronize(stream);

cudaStreamDestroy(stream);
}
```
## Synchronizing GPU memory allocation
Overlapping calls to `cudaMalloc()` or `cudaFree()` may result in deadlock. Gloo and any calling code must coordinate memory allocations. Calling code should
1. Pass a shared `std::mutex` into `gloo::CudaShared::setMutex()` before calling any other gloo functions.
2. Always acquire the mutex before calling CUDA memory allocation functions.
```cpp
// Define a mutex to synchronize calls to cudaMalloc/cudaFree
std::mutex m;
// Share the mutex with gloo
gloo::CudaShared::setMutex(&m);
// Always call cudaMalloc/cudaFree while holding the mutex
void* allocateCudaMemory(size_t bytes) {
std::lock_guard<std::mutex> lock(m);
void* ptr;
cudaMalloc(&ptr, bytes);
return ptr;
}
```

0 comments on commit 0c88194

Please sign in to comment.