Skip to content

Pipelining: Device to Device

Hüseyin Tuğrul BÜYÜKIŞIK edited this page May 9, 2017 · 32 revisions

For versions before v1.2.0, only separable kernels had the option to be pipelined concurrently for multiple GPUs. But some projects having multiple non-separable kernels could not get any help from that feature. Generally more hardware-dependent in-kernel commands like "atomic functions" which use any global memory address whenever needed, make the "partitioning" impossible.

To bypass difficulties of partitioning of multiple non-partitionable kernels, version 1.2.0 of Cekirdekler API adds "device to device" pipelining feature. This technique binds a different device for each stage of a multi-stage pipeline and runs all stages concurrently. Concurrency of stages is also expanded to include double buffering that makes inter-stage data movements' latencies hidden behind the compute time of stages (if its greater).

This is how it works:

Creating a pipeline needs the stages to be created first. A pipeline stage is created with kernels, input-output(and hidden for sequential logic) buffers and a device.

Pipeline stage for addition of "1" to array elements:

Hardware.ClDevices gpu1 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[0];
Pipeline.ClPipelineStage add = new Pipeline.ClPipelineStage();
add.addDevices( gpu1);
ClArray<float> inputArrayGPU1 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU1 = new ClArray<float>(1024);
add.addInputBuffers(inputArrayGPU1);
add.addOutputBuffers(outputArrayGPU1);
add.addKernels(@"
__kernel void vecAdd(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]+1.0f; 
}", "vecAdd", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ }); 

pipeline stage for multiplication of array elements by 5:

Hardware.ClDevices gpu2 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[1];
Pipeline.ClPipelineStage mul = new Pipeline.ClPipelineStage();
mul .addDevices( gpu2);
ClArray<float> inputArrayGPU2 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU2 = new ClArray<float>(1024);
mul .addInputBuffers(inputArrayGPU2);
mul .addOutputBuffers(outputArrayGPU2);
mul .addKernels(@"
__kernel void vecMul(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]*5.0f; 
}", "vecMul", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ });