Skip to content

Device Pool and Task Pool

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jun 15, 2017 · 20 revisions

Greedy Scheduling Of Kernels To Devices

v1.3.1: Task(not a future but can compute in a future time), TaskPool(producer) and DevicePool(consumers) features were added.

Task

Not to be confused with opencl-task, this is just another way of executing kernels. Instead of computing a kernel directly as

data00.nextParam(data01).compute(...);

operation can be saved for later as

ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);

with same parameters as compute() except number cruncher parameter. ClNumberCruncher instance is used only when computing a task as

task.compute(numberCruncher);
task.compute(numberCruncher); // can be repeated, does not destroy itself

here, ClTask instance is an instance of parameter+kernel config frozen even if parameter fields are changed later. So multiple different tasks can be created from same parameters but with some of fields changed each time. Choosing different compute-id for each same kernel name, uses a different kernel instance in background, to increase many-kernel scheduling performance, with help of less clSetKernelArg() calls.

ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data01.read=false;
ClTask taskNoRead = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data00.write=false;
ClTask taskNoReadWrite = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);

task.compute(numCruncher);           // read+write both arrays
taskNoRead.compute(numCruncher);     // read only first array
taskNoReadWrite.compute(numCruncher);// write only second array and read only first array

TaskPool and Device Pool

From ClTask instances, a ClTaskPool is built. ClTaskPool is used for grouping ClTask instances to be able to repeat all of them easily from client code and reset anytime it is needed to revert back to original position of task instance to execute. It is also used by ClDevicePool instance for computing many non-separable kernels with a greedy work scheduler between multiple GPUs. Task pool acts as a source for "producer", devices act as a "consumer", to finish all kernels quicker than a single GPU, using fine grained control logic supported by OpenCL 1.2 's callback feature.

Due to single-producer+multiple-consumer algorithm's multi-thread nature, a kernel may be issued by any device in the pool. There is no guaranteed order of kernel execution nor order of device execution. As soon as a device becomes available, it can start computing; as soon as a task is fed to pool, it can start being computed, all multi-threadedly possible.

Preparing task pool:

int testSize = 1024*8; // size of each kernel
int testTiles = 30;   // number of different kernels, arrays (and tasks), takes 30-40 seconds to compile 30 kernels
            
ClArray<float> []data00 = new ClArray<float>[testTiles];
ClArray<float>[] data01 = new ClArray<float>[testTiles]; 
for (int i = 0; i < testTiles; i++)
{
      data00[i] = new ClArray<float>(testSize); data00[i].write = false; data00[i].read = false;
      data01[i] = new ClArray<float>(testSize); data01[i].write = false; data01[i].read = false;
}


ClTaskPool taskPool = new ClTaskPool();
for (int i = 0; i < testTiles; i++)
{
    ClTask task = data00[i].nextParam(data01[i]).task(1, "test"+i.ToString(), testSize,64);
    taskPool.feed(task);
}
taskPool.reset();

preparing device pool:

string copy = @"
   __kernel void test@@v(__global float *data,__global float *data2)
   {
        int id=get_global_id(0); 
        float f=2.0f+data2[id];
        for(int i=0;i<350;i++)
            f=sin(cos(f));
        data[id]+=f;
   } 
";

// "multiple kernel instances" feature has not been added yet
// so to produce multiple kernels doing same algorithm:
StringBuilder sbRes = new StringBuilder(@"");
for(int i=0;i< testTiles;i++)
{
     sbRes.AppendLine(copy.Replace("@@v",i.ToString()));
}

// only compute-at-will mode is supported for now (and true = fine grained mode is also supported)
ClDevicePool devicePool = new ClDevicePool(ClDevicePoolType.DEVICE_COMPUTE_AT_WILL  , sbRes.ToString(), true,false);
devicePool.addDevice(ClPlatforms.all().gpus()[0]);
devicePool.addDevice(ClPlatforms.all().gpus()[1]);

binding task pool to device pool:

devicePool.enqueueTaskPool(taskPool); // deep clones taskpool instance so multiple copies can be added
taskPool.feed(newTask);
devicePool.enqueueTaskPool(taskPool); // has 1 more task

synchronizing on host:

devicePool.finish();

Multiple queue option is not working for now. This feature is being developed. All tasks(especially duplicaated kernel names) should have a different compute-id for future multi-queue versions of device pool, because different compute id creates a new kernel instance even for same kernel name, to be able to use different sets of parameters at the same time on multiple queues in same device.


Details

Device pool uses multiple command queues per device if fine grained control parameter is true(activated) in its constructor.

Not using fine grained control makes each task to be synchronized and run inside single command queue per device always.

Multi queue parameter is not working yet.

Device Pool Type:

  • Device pool works with only DEVICE_COMPUTE_AT_WILL option which greedily chooses idle or (less idler) GPUs to run

Task Type:

  • TASK_MESSAGE_DEVICE_SELECT_BEGIN and TASK_MESSAGE_DEVICE_SELECT_END bitfields define a task is beginning or ending of a "single-same-device runs this series of tasks" mode, these definition tasks are inclusive to this single device mode range. While a chosen device runs these series of tasks, other devices are issued other tasks or task ranges same as this range. These bitfields are useful for running tasks that use same initialized buffer in the same device. Order of execution of tasks are not ensured. They run on different queues if device pool constructor's fineGrainedQueueControlParameter parameter is true.

  • TASK_MESSAGE_GLOBAL_SYNCHRONIZATION_FIRST and TASK_MESSAGE_GLOBAL_SYNCHRONIZATION_LAST bitfields define a synchronization point for all devices. FIRST word means sync is taken first, then task is run. LAST word means, task is run first, then sync is applied to all devices. When FIRST is set on the task-20, all tasks such as 18,19 are ensured to be completed and all tasks 20,21,... are ensured to be started after sync. Synchronization is an order-preserving action. Without it, tasks are not ensured to run as same order they were fed into pool. Task 21 can finish before task 20. When LAST is set on the task-20, all tasks 18,19,20 are ensured to be completed before the sync and all tasks 21,22,23, .. are ensured to be started after sync. Task 20 can finish before task 19 and task 22 can finish before task 21.

  • TASK_MESSAGE_BROADCAST makes a task executed in all devices. This is useful for initializing all devices' buffers. Only works as read-only (not as array field set, but in kernel side, for defined behavior). Writes to host are not allowed. Kernel side can write to device-side buffers(without zero-copy, without streaming which is enabled by auto for CPUs and integrated GPUs).

  • TASK_MESSAGE_SERIAL_MODE_BEGIN and TASK_MESSAGE_SERIAL_MODE_END are same as TASK_MESSAGE_DEVICE_SELECT_BEGIN and TASK_MESSAGE_DEVICE_SELECT_END, but they also impose single command queue usage to ensure order of task execution for that defined inclusive range of tasks.

Serial mode or single device mode intervals must not overlap.