Skip to content

Buffer Handling

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 31, 2017 · 14 revisions

Behind the scene, this library creates a duplicate(but doesn't copy) buffer for each array in each device and coordinates necessary copying actions. This makes it extremely fast to concurrently copy all data between host and all devices. Buffers are duplicated but data is not. There are flags to be set or unset to tune the automatic buffer copying algorithm working behind the scene easily.


In the hello world page, there was a sample buffer used with the number cruncher like:

     Cekirdekler.ClArray<byte> array = new Cekirdekler.ClArray<byte>(1000);
     array.compute(numberCruncher, 1, "hello", 1000, 100); 

this usage implicitly copies whole array to all devices (actually duplicates it), then computes in all devices, then gets partial results back from all devices. Results are partial because this library was written "embarrassingly parallel algorithms" in mind. Each device generates its own partial result to make it a full array in the end.

To be able to control which array does what type of copy, one needs to alter some flags.

  • Whenever a "read" word is written in wiki, it is intended to tell a "read from C# side"-->"write to opencl buffer" will happen.

  • Whenever a "partial read" is written in wiki, it is intended to tell that the read operation will be partial so the C# side array will not be data-duplicated but dynamically distributed to all devices. A device may read only 128 elements while a faster device may read 16k elements at the same time so the pci-e bandwidth is conserved in case of "streaming"

  • Whenever a "write" word is written in wiki, it is intended to tell that an exact opposite of "partial read" will happen for an array.


     Cekirdekler.ClArray<byte> array = new Cekirdekler.ClArray<byte>(1000);
     array.read = false;
     array.compute(numberCruncher, 1, "hello", 1000, 100); 

unsetting the read flag makes the array output-only which means, the kernel will generate some data in each device, then each partial result will be copied to target array("array" in this example) accordingly with distribution ranges.


     Cekirdekler.ClArray<byte> array = new Cekirdekler.ClArray<byte>(1000);
     array.read = false;
     array.write = false;
     array.compute(numberCruncher, 1, "hello", 1000, 100);

unsetting both read and write flags means this compute operation will use each device's own buffer for compute but will not copy anything. (this is compatible for single device but can be problematic for multiple devices since load balancer alters distribution percentages and offsets at each .compute() execution) .


Default values for read,partialRead and write flags are true,false and true respectively.

     Cekirdekler.ClArray<byte> array = new Cekirdekler.ClArray<byte>(1000);
     array.partialRead = true;
     array.compute(numberCruncher, 1, "hello", 1000, 100); 

here setting the partialRead flag makes cruncher copy the array to devices in a "sharing" manner, implying that each workitem in the kernel code will access only its own array elements as in a "streaming"-data scenario. A simple example for this usage:

            Cekirdekler.ClNumberCruncher cr = new Cekirdekler.ClNumberCruncher(
                Cekirdekler.AcceleratorType.GPU, @"
                    __kernel void add3stream(__global uchar * arr)
                    {
                        int threadId=get_global_id(0);
                        arr[threadId]+=3;
                    }
                ");
            Cekirdekler.ClArray<byte> array = new Cekirdekler.ClArray<byte>(1000);
            for (int i = 0; i < 1000; i++)
                array[i] = 55;
            array.partialRead = true;
            array.compute(cr, 1, "add3stream", 1000, 100);
            Console.WriteLine(array[770]); 

this program outputs value 58 to console. First, it distributes(not a duplicate) 1000 elements to all devices (like X elements to device 1, Y elements to device-2, 1000-X-Y elements to device-3) then adds 3 to each element then gets results back without duplicating any data. This makes an optimization: if array is 1000 bytes, only 1000 bytes are copied totally. Maybe 200 bytes for gpu-1, 400 bytes for gpu-2 and 400 bytes for gpu-3. When write part is taken into consideration, it is 1000bytes for read and 1000 bytes for write concurrently.