-
Notifications
You must be signed in to change notification settings - Fork 10
Device Pool and Task Pool
v1.3.1: Task(not a future but can compute in a future time), TaskPool(producer) and DevicePool(consumers) features were added.
Task
Not to be confused with opencl-task, this is just another way of executing kernels. Instead of computing a kernel directly as
data00.nextParam(data01).compute(...);
operation can be saved for later as
ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
with same parameters as compute() except number cruncher parameter. ClNumberCruncher instance is used only when computing a task as
task.compute(numberCruncher);
task.compute(numberCruncher); // can be repeated, does not destroy itself
here, ClTask instance is an instance of parameter+kernel config frozen even if parameter fields are changed later. So multiple different tasks can be created from same parameters but with some of fields changed each time. Choosing different compute-id for each same kernel name, uses a different kernel instance in background, to increase many-kernel scheduling performance, with help of less clSetKernelArg() calls.
ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data01.read=false;
ClTask taskNoRead = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data00.write=false;
ClTask taskNoReadWrite = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
task.compute(numCruncher); // read+write both arrays
taskNoRead.compute(numCruncher); // read only first array
taskNoReadWrite.compute(numCruncher);// write only second array and read only first array
TaskPool and Device Pool
From ClTask instances, a ClTaskPool is built. ClTaskPool is used for grouping ClTask instances to be able to repeat all of them easily from client code and reset anytime it is needed to revert back to original position of task instance to execute. It is also used by ClDevicePool instance for computing many non-separable kernels with a greedy work scheduler between multiple GPUs. Task pool acts as a source for "producer", devices act as a "consumer", to finish all kernels quicker than a single GPU, using fine grained control logic supported by OpenCL 1.2 's callback feature.
Due to single-producer+multiple-consumer algorithm's multi-thread nature, a kernel may be issued by any device in the pool. There is no guaranteed order of kernel execution nor order of device execution. As soon as a device becomes available, it can start computing; as soon as a task is fed to pool, it can start being computed, all multi-threadedly possible.
Preparing task pool:
int testSize = 1024*8; // size of each kernel
int testTiles = 30; // number of different kernels, arrays (and tasks), takes 30-40 seconds to compile 30 kernels
ClArray<float> []data00 = new ClArray<float>[testTiles];
ClArray<float>[] data01 = new ClArray<float>[testTiles];
for (int i = 0; i < testTiles; i++)
{
data00[i] = new ClArray<float>(testSize); data00[i].write = false; data00[i].read = false;
data01[i] = new ClArray<float>(testSize); data01[i].write = false; data01[i].read = false;
}
ClTaskPool taskPool = new ClTaskPool();
for (int i = 0; i < testTiles; i++)
{
ClTask task = data00[i].nextParam(data01[i]).task(1, "test"+i.ToString(), testSize,64);
taskPool.feed(task);
}
taskPool.reset();
preparing device pool:
string copy = @"
__kernel void test@@v(__global float *data,__global float *data2)
{
int id=get_global_id(0);
float f=2.0f+data2[id];
for(int i=0;i<350;i++)
f=sin(cos(f));
data[id]+=f;
}
";
// "multiple kernel instances" feature has not been added yet
// so to produce multiple kernels doing same algorithm:
StringBuilder sbRes = new StringBuilder(@"");
for(int i=0;i< testTiles;i++)
{
sbRes.AppendLine(copy.Replace("@@v",i.ToString()));
}
// only compute-at-will mode is supported for now (and true = fine grained mode is also supported)
ClDevicePool devicePool = new ClDevicePool(ClDevicePoolType.DEVICE_COMPUTE_AT_WILL , sbRes.ToString(), true,false);
devicePool.addDevice(ClPlatforms.all().gpus()[0]);
devicePool.addDevice(ClPlatforms.all().gpus()[1]);
binding task pool to device pool:
devicePool.enqueueTaskPool(taskPool); // deep clones taskpool instance so multiple copies can be added
taskPool.feed(newTask);
devicePool.enqueueTaskPool(taskPool); // has 1 more task
synchronizing on host:
devicePool.finish();
Multiple queue option is not working for now. This feature is being developed. All tasks(especially duplicaated kernel names) should have a different compute-id for future multi-queue versions of device pool, because different compute id creates a new kernel instance even for same kernel name, to be able to use different sets of parameters at the same time on multiple queues in same device.
Details
Device Pool:
- Device pool works with only
DEVICE_COMPUTE_AT_WILL
option which greedily chooses idle or (less idler) GPUs to run
Task:
-
TASK_MESSAGE_DEVICE_SELECT_BEGIN
andTASK_MESSAGE_DEVICE_SELECT_END
bitfields define a task is beginning or ending of a "single-same-device runs this series of tasks" mode, these definition tasks are inclusive to this single device mode range. While a chosen device runs these series of tasks, other devices are issued other tasks or task ranges same as this range.