-
Notifications
You must be signed in to change notification settings - Fork 10
Device Pool and Task Pool
v1.3.1: Task, TaskPool and DevicePool features were added.
Task
Not to be confused with opencl-task, this is just another way of executing a kernel. Instead of computing a kernel directly as
data00.nextParam(data01).compute(...);
operation can be saved for later as
ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
with same parameters as compute() except number cruncher parameter. ClNumberCruncher instance is used only when computing a task as
task.compute(numberCruncher);
here, ClTask instance is an instance of parameter+kernel config frozen even if parameter fields are changed later. So multiple different tasks can be created from same parameters but with some of fields changed each time.
ClTask task = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data01.read=false;
ClTask taskNoRead = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
data00.write=false;
ClTask taskNoReadWrite = data00.nextParam(data01).task(1, "test", testGlobalSize,testLocalSize);
task.compute(numCruncher); // read+write both arrays
taskNoRead.compute(numCruncher); // read only first array
taskNoReadWrite.compute(numCruncher);// write only second array and read only first array
TaskPool and Device Pool
From ClTask instances, a ClTaskPool is built. ClTaskPool is used for grouping ClTask instances to be able to repeat all of them easily from client code and reset anytime it is needed to revert back to original position of task instance to execute. It is also used by ClDevicePool instance for computing many non-separable kernels with a greedy work scheduler between multiple GPUs. Task pool acts as a source for "producer", devices act as a "consumer", to finish all kernels quicker than a single GPU, using fine grained control logic supported by OpenCL 1.2 's callback feature.
Preparing task pool:
int testSize = 1024*8; // size of each kernel
int testTiles = 30; // number of different kernels, arrays (and tasks), takes 30-40 seconds to compile 30 kernels
ClArray<float> []data00 = new ClArray<float>[testTiles];
ClArray<float>[] data01 = new ClArray<float>[testTiles];
for (int i = 0; i < testTiles; i++)
{
data00[i] = new ClArray<float>(testSize); data00[i].write = false; data00[i].read = false;
data01[i] = new ClArray<float>(testSize); data01[i].write = false; data01[i].read = false;
}
ClTaskPool taskPool = new ClTaskPool();
for (int i = 0; i < testTiles; i++)
{
ClTask task = data00[i].nextParam(data01[i]).task(1, "test"+i.ToString(), testSize,64);
taskPool.feed(task);
}
taskPool.reset();
preparing device pool:
string copy = @"
__kernel void test@@v(__global float *data,__global float *data2)
{
int id=get_global_id(0);
float f=2.0f+data2[id];
for(int i=0;i<350;i++)
f=sin(cos(f));
data[id]+=f;
}
";
// "multiple kernel instances" feature has not been added yet
// so to produce multiple kernels doing same algorithm:
StringBuilder sbRes = new StringBuilder(@"");
for(int i=0;i< testTiles;i++)
{
sbRes.AppendLine(copy.Replace("@@v",i.ToString()));
}
// only compute-at-will mode is supported for now (and true = fine grained mode is also supported)
ClDevicePool devicePool = new ClDevicePool(ClDevicePoolType.DEVICE_COMPUTE_AT_WILL , sbRes.ToString(), true,false);
devicePool.addDevice(ClPlatforms.all().gpus()[0]);
devicePool.addDevice(ClPlatforms.all().gpus()[1]);
binding task pool to device pool:
devicePool.enqueueTaskPool(taskPool);
new tasks can be added asynchronously while pool is being computed:
taskPool.feed(newTask);
synchronizing:
devicePool.finish();