-
Notifications
You must be signed in to change notification settings - Fork 10
Buffer Handling: Elements Per Work Item
In real world applications, a workitem accesses more than just a single array element. Processing pixels is a clear example to that. In case of turning an image into a grayscale version, every workitem needs to process at least 1 pixel and each pixel has different channels generally.
Here in this example, a RGBA structure for each pixel is assumed to be given in a buffer:
ClNumberCruncher cr = new ClNumberCruncher(
AcceleratorType.GPU, @"
__kernel void grayscale(__global uchar * rgba)
{
int i=get_global_id(0);
// each workitem access 4 neighbour bytes as a whole pixel
uchar r=rgba[i*4];
uchar g=rgba[i*4+1];
uchar b=rgba[i*4+2];
uchar a=rgba[i*4+3];
int total=(r+g+b+255)>>2;
r=g=b=total;
rgba[i*4]=r;
rgba[i*4+1]=g;
rgba[i*4+2]=b;
rgba[i*4+3]=a+2;
}
");
ClArray<byte> rgba = new ClArray<byte>(1000);
rgba.numberOfElementsPerWorkItem = 4;
rgba.compute(cr, 1, "grayscale", 250, 25);
important parts are the numberOfElementsPerWorkItem
property and the ratio of buffer-size(1000) to global range(250). API takes 25 local sized chunks for work and 25*4=100 sized chunks for buffer copies from host to device and opposite direction. Developer is responsible to keep these values in logical limits to evade any buffer overflows and workitem access violations. Multiple arrays may have different "elements per workitem" value and API assumes it is just a simple stride per workitem, not a complex access pattern to random indices. Those complex write patterns are not applicable for multiple devices but reads are not a problem.
Each workitem has to access contiguous array elements if they are to be written or to be partially read.
When array read
flag is set but partialRead
is not set, API loads whole array into device, so only array out-of-bounds is needed to be checked by developer. If partialRead
is also set, then developer needs to take grain size
based overflows into consideration too. Grain size for distribution of work between devices now is local size
multiplied by numberOfElementsPerWorkItem
so 25*4=100 elements in the upper example, because, opencl kernel parameter type is same with C#(or C++) side array type.
If type of API array elements are different than opencl kernel parameter elements, then load balancer grain size becomes:
work_grain_size = local_size * elements_per_work_item;
buffer_grain_size= work_grain_size * sizeof_(kernel_element) / sizeof_(host_element);
then if cl_double
is 8 bytes and if developer uses byte array in host side but uses it in kernel as double
, then developer is responsible to calculate 8*(number_of_elements_per_work_item)
and pass it into
array.numberOfElementsPerWorkItem;
property.
If kernel uses 3 doubles per workitem and host side is given a byte array,
array.numberOfElementsPerWorkItem=8*3;
should be used. This effectively copies 24 bytes per workitem to be interpreted as 3xdouble variables in opencl kernel. Also the alignment of elements in kernel has to be supported from host side for these different type interpretations.