Skip to content

Buffer Handling: Elements Per                             Work Item

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 31, 2017 · 1 revision

In real world applications, a workitem accesses more than just a single array element. Processing pixels is a clear example to that. In case of turning an image into a grayscale version, every workitem needs to process at least 1 pixel and each pixel has different channels generally.

Here in this example, a RGBA structure for each pixel is assumed to be given in a buffer:

            ClNumberCruncher cr = new ClNumberCruncher(
                AcceleratorType.GPU, @"
                    __kernel void grayscale(__global uchar * rgba)
                    {
                        int i=get_global_id(0);
                        // each workitem access 4 neighbour bytes as a whole pixel
                        uchar r=rgba[i*4];
                        uchar g=rgba[i*4+1];
                        uchar b=rgba[i*4+2];
                        uchar a=rgba[i*4+3];
                        int total=(r+g+b+255)>>2;
                        r=g=b=total;
                        rgba[i*4]=r;
                        rgba[i*4+1]=g;
                        rgba[i*4+2]=b;
                        rgba[i*4+3]=a+2;
                    }
                ");

            ClArray<byte> rgba = new ClArray<byte>(1000);
            rgba.numberOfElementsPerWorkItem = 4;
            rgba.compute(cr, 1, "grayscale", 250, 25);

important parts are the numberOfElementsPerWorkItem property and the ratio of buffer-size(1000) to global range(250). API takes 25 local sized chunks for work and 25*4=100 sized chunks for buffer copies from host to device and opposite direction. Developer is responsible to keep these values in logical limits to evade any buffer overflows and workitem access violations. Multiple arrays may have different "elements per workitem" value and API assumes it is just a simple stride per workitem, not a complex access pattern to random indices. Those complex write patterns are not applicable for multiple devices but reads are not a problem.

Each workitem has to access contiguous array elements if they are to be written or to be partially read.

When array read flag is set but partialRead is not set, API loads whole array into device, so only array out-of-bounds is needed to be checked by developer. If partialRead is also set, then developer needs to take grain size based overflows into consideration too. Grain size for distribution of work between devices now is local size multiplied by numberOfElementsPerWorkItem so 25*4=100 elements in the upper example, because, opencl kernel parameter type is same with C#(or C++) side array type.

If type of API array elements are different than opencl kernel parameter elements, then load balancer grain size becomes:

     work_grain_size  = local_size * elements_per_work_item;
     buffer_grain_size= work_grain_size * sizeof_(kernel_element) / sizeof_(host_element);

then if cl_double is 8 bytes and if developer uses byte array in host side but uses it in kernel as double, then developer is responsible to calculate 8*(number_of_elements_per_work_item) and pass it into

            array.numberOfElementsPerWorkItem;

property.

If kernel uses 3 doubles per workitem and host side is given a byte array,

            array.numberOfElementsPerWorkItem=8*3;

should be used. This effectively copies 24 bytes per workitem to be interpreted as 3xdouble variables in opencl kernel. Also the alignment of elements in kernel has to be supported from host side for these different type interpretations.