Skip to content

Streaming Data

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Nov 29, 2017 · 6 revisions

In instantiation of ClNumberCruncher class, there is an optional parameter named stream and is true by default. This means all device workers are instructed to do less copies and faster accesses to buffers that are given to API. This type of usage is faster when data is accessed only once and the compute-to-(global)memory-access ratio is low. But is slower when global memory in kernel is accessed for many times and is cache-aware. Then the pci-e bridge or any other far-connection type becomes the bottleneck.

This program fragment is a good candidate for streaming performance optimization:

        ClNumberCruncher cr = new ClNumberCruncher(
            AcceleratorType.GPU, @"
                __kernel void addTen(__global char * a)
                {
                    int i=get_global_id(0);
                    a[i]+=10;
                }
            ",-1,-1,true);
                      ^
                      |
                      enables less-number-of-copies per array(true by default)

        ClArray<byte> a = new ClArray<byte>(2500);
        a.compute(cr, 1, "addTen", 2500, 250);

the code in the upper codeblock picks CL_MEM_ALLOC_HOST_PTR for device buffers so they access to array through less number of data copies. If there are any RAM-sharing devices in system(such as integrated gpus without dedicated memory), this option is hardcoded to apply but other devices such as discrete gpus will only enable this by parameter value.

When streaming is enabled(parameter affects all arrays), host-side array type can add even more optimization:

 ClFloatArray arr = new ClFloatArray(1000);

this object contains a C++ array which is aligned to a 4096-byte boundary by default and lets the opencl part use CL_MEM_USE_HOST_PTR flag to decrease number of buffer copies even more and achieve zero-copy processing. It is said to be zero-copy but any vendor's implmenetation may cache some variables(per compute unit) before updating(implicitly) data.

When an array is created with ClArray with this way:

 ClArray<float> arr = new ClArray<float>(1000);

it will have a ClFloatArray inside and enable use of zero-copy processing.

 arr.fastArr=false;

swaps the internal C++ array with a C# one and disables use of zero-copy processing for that array.

Discrete graphics cards have very fast dedicated memory and for more complex OpenCL kernels like an "nbody" or a "quad tree", stream=false is advised to let these fast devices use their own memory and copy results later.


Explicit device selection also has same option:

ClPlatforms.all().gpus(true)

true value enables the streaming access. This way, number of ClNumberCruncher parameters are also reduced:

ClNumberCruncher gpu = new ClNumberCruncher(ClPlatforms.all().gpus(true), @"
     __kernel void test(__global float * data)
     {
          int id=get_global_id(0);
          for(int i=0;i<150;i++)
              data[id]+=sqrt(data[id]);
     }    
");

v1.2.9_hotpatch+: now ClArray objects can determine whether they will be accessed as zero copy(GPU sees directly system-RAM) or not (GPU sees its own dedicated memory).

ClArray.zeroCopy=true/false

such as

myXArray.zeroCopy=true; just before a compute like myXArray.compute(...) enables map/unmap(true) or read/write(false) on array level instead of just device level. So a device may access both RAM and VRAM for different arrays in same kernel now.

if (device has stream parameter set) and (array has zeroCopy field set), then kernels will access these arrays without copying(CPUs and iGPUs have stream set automatically, if they share same memory with CPU).

Once a device buffer is created in compute(), clearing or setting zeroCopy on ClArray object will not have any effect on same device that was used in compute() but will create a different buffer in a different ClNumberCruncher object(which can be same device or not doesn't matter)