Skip to content

Device Pool and Task Pool

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jun 13, 2017 · 20 revisions

Greedy Scheduling Of Kernels To Devices

v1.3.1: Task, TaskPool and DevicePool features were added.

Task

Not to be confused with opencl-task, this is just another way of executing kernels. Instead of computing a kernel directly as

data00.nextParam(data01).compute(...);

operation can be saved for later as

ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);

with same parameters as compute() except number cruncher parameter. ClNumberCruncher instance is used only when computing a task as

task.compute(numberCruncher);
task.compute(numberCruncher); // can be repeated, does not destroy itself

here, ClTask instance is an instance of parameter+kernel config frozen even if parameter fields are changed later. So multiple different tasks can be created from same parameters but with some of fields changed each time. Choosing different compute-id for each same kernel name, uses a different kernel instance in background, to increase many-kernel scheduling performance, with help of less clSetKernelArg() calls.

ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data01.read=false;
ClTask taskNoRead = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data00.write=false;
ClTask taskNoReadWrite = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);

task.compute(numCruncher);           // read+write both arrays
taskNoRead.compute(numCruncher);     // read only first array
taskNoReadWrite.compute(numCruncher);// write only second array and read only first array

TaskPool and Device Pool

From ClTask instances, a ClTaskPool is built. ClTaskPool is used for grouping ClTask instances to be able to repeat all of them easily from client code and reset anytime it is needed to revert back to original position of task instance to execute. It is also used by ClDevicePool instance for computing many non-separable kernels with a greedy work scheduler between multiple GPUs. Task pool acts as a source for "producer", devices act as a "consumer", to finish all kernels quicker than a single GPU, using fine grained control logic supported by OpenCL 1.2 's callback feature.

Due to single-producer+multiple-consumer algorithm's multi-thread nature, a kernel may be issued by any device in the pool. There is no guaranteed order of kernel execution nor order of device execution. As soon as a device becomes available, it can start computing; as soon as a task is fed to pool, it can start being computed, all multi-threadedly possible.

Preparing task pool:

int testSize = 1024*8; // size of each kernel
int testTiles = 30;   // number of different kernels, arrays (and tasks), takes 30-40 seconds to compile 30 kernels
            
ClArray<float> []data00 = new ClArray<float>[testTiles];
ClArray<float>[] data01 = new ClArray<float>[testTiles]; 
for (int i = 0; i < testTiles; i++)
{
      data00[i] = new ClArray<float>(testSize); data00[i].write = false; data00[i].read = false;
      data01[i] = new ClArray<float>(testSize); data01[i].write = false; data01[i].read = false;
}


ClTaskPool taskPool = new ClTaskPool();
for (int i = 0; i < testTiles; i++)
{
    ClTask task = data00[i].nextParam(data01[i]).task(1, "test"+i.ToString(), testSize,64);
    taskPool.feed(task);
}
taskPool.reset();

preparing device pool:

string copy = @"
   __kernel void test@@v(__global float *data,__global float *data2)
   {
        int id=get_global_id(0); 
        float f=2.0f+data2[id];
        for(int i=0;i<350;i++)
            f=sin(cos(f));
        data[id]+=f;
   } 
";

// "multiple kernel instances" feature has not been added yet
// so to produce multiple kernels doing same algorithm:
StringBuilder sbRes = new StringBuilder(@"");
for(int i=0;i< testTiles;i++)
{
     sbRes.AppendLine(copy.Replace("@@v",i.ToString()));
}

// only compute-at-will mode is supported for now (and true = fine grained mode is also supported)
ClDevicePool devicePool = new ClDevicePool(ClDevicePoolType.DEVICE_COMPUTE_AT_WILL  , sbRes.ToString(), true,false);
devicePool.addDevice(ClPlatforms.all().gpus()[0]);
devicePool.addDevice(ClPlatforms.all().gpus()[1]);

binding task pool to device pool:

devicePool.enqueueTaskPool(taskPool); // deep clones taskpool instance so multiple copies can be added
taskPool.feed(newTask);
devicePool.enqueueTaskPool(taskPool); // has 1 more task

synchronizing on host:

devicePool.finish();