-
Notifications
You must be signed in to change notification settings - Fork 10
Buffer Handling
Behind the scene, this library creates a duplicate(but doesn't copy) buffer for each array in each device and coordinates necessary copying actions. This makes it extremely fast to concurrently copy all data between host and all devices because a device may have any percentage of work that needs data of any length at each iteration. Buffers are duplicated but data is not. There are flags to be set or unset to tune the automatic buffer copying algorithm working behind the scene easily.
Each host-side array also is saved in number cruncher object and whenever same array is used, its latest data partitions in all devices are retained for the compute operation.
Developer can access array elements, read and write on them, copy to/from an equally sized array and get a C# array copy of the backing array(managed or unmanaged).
ClArray<float> x = myArray; // ClArray or C# array or ClFloatArray(or its int,uint,double,... versions)
x[0]=5;
x[5]=x[20];
x.CopyTo(y,startIndex); // y and z can be C# array or ClArray or ClFloatArray(and other versions)
x.CopyFrom(z,startIndex);
x.ToArray() // returns a copy as a C# array
In the hello world page, there was a sample buffer used with the number cruncher like:
ClArray<byte> array = new ClArray<byte>(1000);
array.compute(numberCruncher, 1, "hello", 1000, 100);
this usage implicitly copies whole array to all devices (actually duplicates it), then computes in all devices, then gets partial results back from all devices. Results are partial because this library was written "embarrassingly parallel algorithms" in mind. Each device generates its own partial result to make it a full array in the end.
To be able to control which array does what type of copy, one needs to alter some flags.
-
Whenever a "read" word is written in wiki, it is intended to tell a "read from C# side"-->"write to opencl buffer" will happen.
-
Whenever a "partial read" is written in wiki, it is intended to tell that the read operation will be partial so the C# side array will not be data-duplicated but dynamically distributed to all devices. A device may read only 128 elements while a faster device may read 16k elements at the same time so the pci-e bandwidth is conserved in case of "streaming"
-
Whenever a "write" word is written in wiki, it is intended to tell that an exact opposite of "partial read" will happen for an array.
ClArray<byte> array = new ClArray<byte>(1000);
array.read = false;
array.compute(numberCruncher, 1, "hello", 1000, 100);
unsetting the read
flag makes the array output-only which means, the kernel will generate some data in each device, then each partial result will be copied to target array("array" in this example) accordingly with distribution ranges.
ClArray<byte> array = new ClArray<byte>(1000);
array.read = false;
array.write = false;
array.compute(numberCruncher, 1, "hello", 1000, 100);
unsetting both read
and write
flags means this compute operation will use each device's own buffer for compute but will not copy anything. (this is compatible for single device but can be problematic for multiple devices since load balancer alters distribution percentages and offsets at each .compute()
execution) .
Default values for read
,partialRead
and write
flags are true
,false
and true
respectively.
ClArray<byte> array = new ClArray<byte>(1000);
array.partialRead = true;
array.compute(numberCruncher, 1, "hello", 1000, 100);
here setting the partialRead
flag makes cruncher copy the array to devices in a "sharing" manner, implying that each workitem in the kernel code will access only its own array elements as in a "streaming"-data scenario. A simple example for this usage:
Cekirdekler.ClNumberCruncher cr = new Cekirdekler.ClNumberCruncher(
Cekirdekler.AcceleratorType.GPU, @"
__kernel void add3stream(__global uchar * arr)
{
int threadId=get_global_id(0);
arr[threadId]+=3;
}
");
ClArray<byte> array = new ClArray<byte>(1000);
for (int i = 0; i < 1000; i++)
array[i] = 55;
array.partialRead = true;
array.compute(cr, 1, "add3stream", 1000, 100);
Console.WriteLine(array[770]);
this program outputs value 58 to console. First, it distributes(not a duplicate) 1000 elements to all devices (like X elements to device 1, Y elements to device-2, 1000-X-Y elements to device-3) then adds 3 to each element then gets results back without duplicating any data. This makes an optimization: if array is 1000 bytes, only 1000 bytes are copied totally. Maybe 200 bytes for gpu-1, 400 bytes for gpu-2 and 400 bytes for gpu-3. When write
part is taken into consideration, it is 1000bytes for read
and 1000 bytes for write
concurrently.
v1.2.8 adds ClArray.writeAll
option to get GPGPU results array as a whole(non-partial) instead of just some number of elements related to global range value and number of elements per workitem value. So working on just 2 threads and getting a 30MB result array is possible now. But works only with single GPU because multiple GPUs would raise undefined behavior with this, writing on same location concurrently, so each GPU writes a different result array. GPU 1 writes 1st array, GPU2 writes 2nd array,...
For writeAll
to work, write
field must also be set.
With v.1.2.9+, ClArray
objects can be made "read only" by readOnly
field set to true
or "write only" by writeOnly
field set to true
.
This has an advantage on optimized drivers and kernels that access to arrays to only read or to only write. Setting either of these field clears opposite field bits(setting writeOnly clears readOnly, read, partialRead while setting readOnly clears writeOnly, write, writeAll) and locks them until value is set to false again but only to be able to use in another ClNumberCruncher object(and with same devices).
Once a read-only or write-only array is used in a compute()
method, the device(s) bound to that compute context create the necessary buffers accordingly(as read-only or write-only) so next compute calls can't use this array for other types of access. If it was used for read-only, then it can only be read in same devices(that are in same clNumberCruncher). Clearing read-only or write-only bits makes it usable in other ClNumberCruncher objects(even if devices are same). Because each ClNumberCruncher creates a new context for each device given to it.