-
Notifications
You must be signed in to change notification settings - Fork 10
Streaming Data
In instantiation of ClNumberCruncher
class, there is an optional parameter named stream
and is true
by default. This means all device workers are instructed to do less copies and faster accesses to buffers that are given to API. This type of usage is faster when data is accessed only once and the compute-to-(global)memory-access ratio is low. But is slower when global memory in kernel is accessed for many times and is cache-aware. Then the pci-e bridge or any other far-connection type becomes the bottleneck.
This program fragment is a good candidate for streaming performance optimization:
ClNumberCruncher cr = new ClNumberCruncher(
AcceleratorType.GPU, @"
__kernel void addTen(__global char * a)
{
int i=get_global_id(0);
a[i]+=10;
}
",-1,-1,true);
^
|
enables less-number-of-copies per array(true by default)
ClArray<byte> a = new ClArray<byte>(2500);
a.compute(cr, 1, "addTen", 2500, 250);
the code in the upper codeblock picks CL_MEM_ALLOC_HOST_PTR
for device buffers so they access to array through less number of data copies. If there are any RAM-sharing devices in system(such as integrated gpus without dedicated memory), this option is hardcoded to apply but other devices such as discrete gpus will only enable this by parameter value.
When streaming is enabled(parameter affects all arrays), host-side array type can add even more optimization:
ClFloatArray arr = new ClFloatArray(1000);
this object contains a C++ array which is aligned to a 4096-byte boundary by default and lets the opencl part use CL_MEM_USE_HOST_PTR
flag to decrease number of buffer copies even more and achieve zero-copy processing. It is said to be zero-copy but any vendor's implmenetation may cache some variables(per compute unit) before updating(implicitly) data.
When an array is created with ClArray
with this way:
ClArray<float> arr = new ClArray<float>(1000);
it will have a ClFloatArray
inside and enable use of zero-copy processing.
arr.fastArr=false;
swaps the internal C++ array with a C# one and disables use of zero-copy processing for that array.
Discrete graphics cards have very fast dedicated memory and for more complex OpenCL kernels like an "nbody" or a "quad tree", stream=false
is advised to let these fast devices use their own memory and copy results later.
Explicit device selection also has same option:
ClPlatforms.all().gpus(true)
true value enables the streaming access. This way, number of ClNumberCruncher parameters are also reduced:
ClNumberCruncher gpu = new ClNumberCruncher(ClPlatforms.all().gpus(true), @"
__kernel void test(__global float * data)
{
int id=get_global_id(0);
for(int i=0;i<150;i++)
data[id]+=sqrt(data[id]);
}
");
v1.2.9_hotpatch+: now ClArray objects can determine whether they will be accessed as zero copy(GPU sees directly system-RAM) or not (GPU sees its own dedicated memory).
ClArray.zeroCopy=true/false
such as
myXArray.zeroCopy=true;
just before a compute like myXArray.compute(...)
enables map/unmap(true) or read/write(false) on array level instead of just device level. So a device may access both RAM and VRAM for different arrays in same kernel now.
if (device has stream
parameter set) and (array has zeroCopy
field set), then kernels will access these arrays without copying(CPUs and iGPUs have stream set automatically, if they share same memory with CPU).
Once a device buffer is created in compute(), clearing or setting zeroCopy on ClArray object will not have any effect on same device that was used in compute() but will create a different buffer in a different ClNumberCruncher object(which can be same device or not doesn't matter)