Pipelining: Device to Device

For versions before v1.2.0, only separable kernels had the option to be pipelined concurrently for multiple GPUs. But some projects having multiple non-separable kernels could not get any help from that feature. Generally more hardware-dependent in-kernel commands like "atomic functions" which use any global memory address whenever needed, make the "partitioning" impossible.

To bypass difficulties of partitioning of multiple non-partitionable kernels, version 1.2.0 of Cekirdekler API adds "device to device" pipelining feature. This technique binds a different device for each stage of a multi-stage pipeline and runs all stages concurrently. Concurrency of stages is also expanded to include double buffering that makes inter-stage data movements' latencies hidden behind the compute time of stages (if its greater).

This is how it works:

Creating a pipeline needs the stages to be created first. A pipeline stage is created with kernels, input-output(and hidden for sequential logic) buffers and a device.

Pipeline stage for addition of "1" to array elements:

Hardware.ClDevices gpu1 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[0];
Pipeline.ClPipelineStage add = new Pipeline.ClPipelineStage();
add.addDevices( gpu1);
ClArray<float> inputArrayGPU1 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU1 = new ClArray<float>(1024);
add.addInputBuffers(inputArrayGPU1);
add.addOutputBuffers(outputArrayGPU1);
add.addKernels(@"
__kernel void vecAdd(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]+1.0f; 
}", "vecAdd", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ });

pipeline stage for multiplication of array elements by 5:

Hardware.ClDevices gpu2 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[1];
Pipeline.ClPipelineStage mul = new Pipeline.ClPipelineStage();
mul .addDevices( gpu2);
ClArray<float> inputArrayGPU2 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU2 = new ClArray<float>(1024);
mul .addInputBuffers(inputArrayGPU2);
mul .addOutputBuffers(outputArrayGPU2);
mul .addKernels(@"
__kernel void vecMul(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]*5.0f; 
}", "vecMul", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ });

building a pipeline from add+mul stages:

add.prependToStage(mul); // prepends add stage to mul stage
var pipeline = add.makePipeline(); // creates a 2-stage pipeline out of add and mul.

pushing data to pipeline actually runs it:

pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult })

so the pushed data travels further in the pipeline at each pushData() method call until it arrives to target(result) array.

pushData() method returns a boolean value indicating that the result is ready or not. If true, the first input values has been computed on all pipeline stages and already copied to output array(or any result array given by client code). So it can be directly used like this:

if(pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("First result is ready!");
    // uses the result for other tasks when its ready (instead of garbage results for the first N iterations)
}

arrayToGetData: is given in an array of object that must be exact same length and type of the input array of first stage. If there are multiple input arrays in the first stage, then there has to be multiple arrays given as an array of objects.

arrayToReceiveResult: is a target for copying results from last stage's output arrays. Similar to input arrays, output arrays has to be same type and length with this receiver array.

To get the results sooner(while it doesn't change the throughput), data may not be pushed and popped, instead, input-output arrays of first(and last) stages can be directly accessed. Not using the pushData method's parameters makes the pipeline not flip the double buffers at the both ends of pipeline. This way, a result can be taken 2 iterations quicker.

Examples:

M is number of stages in the pipeline.

if(pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("Extra client arrays for both inputs and outputs of pipeline");
    Console.WriteLine("Needs M*2+1 iterations");  
    Console.WriteLine("arrayToGetData  and arrayToReceiveResult are used");  
}

if(pipeline.pushData(null, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("Extra client arrays for only outputs of pipeline");
    Console.WriteLine("Needs M*2 iterations"); 
    Console.WriteLine("input arrays of first stage are directly accessed");   
}

if(pipeline.pushData(new object[] { arrayToGetData }))
{
    Console.WriteLine("Extra client arrays for only inputs pipeline");
    Console.WriteLine("Needs M*2 iterations");  
    Console.WriteLine("output arrays of last stage are directly accessed");
}

if(pipeline.pushData())
{
    Console.WriteLine("Extra client arrays for both inputs and outputs of pipeline");
    Console.WriteLine("Needs M*2-1 iterations"); 
    Console.WriteLine("input arrays of first stage and output arrays of last stage are directly accessed"); 
}

Note: only horizontal(and linear) pipelines are supported for now. Future versions will include multi-layered stages(that run concurrently to each other)

Revisiting rules:

input arrays given as parameter in pushData method must match with the first stage input arrays by their types and number (and order).
output arrays given as parameter in pushData method must match with the last stage output arrays by their types and number (and order).
if any pushData parameter is given null value, results for an input is computed 1 iteration sooner but the output or the input may need to be taken/put as a direct access to arrays of stages.
"numberOfArrayElementsPerWorkItem" field of each input-ouput array per stage is important for copying whole array to-from a device. A byte array with "12" as the field is a proper input for a float3 array (4 from float type, 3 from float3's width, multiplication gives 12 bytes per float3 element)
inputs, hidden arrays(for sequential logic) and outputs must be added with the same order of kernel functions arguments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelining: Device to Device

Clone this wiki locally