Skip to content

Load Balancing

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Apr 11, 2017 · 13 revisions

Load balancer uses a type of successive over-relaxation method that works over iterations for convergence to stability. To see it in action, performanceFeed property of number cruncher is set:

            ClNumberCruncher cr = new ClNumberCruncher(
                AcceleratorType.GPU|AcceleratorType.CPU, // 2 devices to distribute load 
                @"
                    __kernel void loadBalanceTest(__global float * a)
                    {
                        int i=get_global_id(0);
                        a[i]+=sin(i);  // sine function: somewhat hard to compute                 
                    }
            ");
            cr.performanceFeed = true; // report to console

            ClArray<float> a = new ClArray<float>(1024*1024); // 1M elements = 4M data, too much to copy
            a.partialRead = true;      // efficient array partitioning
            for(int i=0;i<1000;i++)
                a.compute(cr, 1, "loadBalanceTest", 1024*1024, 64); 
                // 1M workitems total, 64 workitems per workgroup

and results are fed to console:

Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 149,7993 milliseconds, compute range: 524288 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 188,5762 milliseconds, compute range: 524288 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 81,2813 milliseconds, compute range: 542336 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 224,6407 milliseconds, compute range: 506240 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 47,798 milliseconds, compute range: 614784 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 49,6483 milliseconds, compute range: 433792 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 37,2562 milliseconds, compute range: 617664 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 38,3673 milliseconds, compute range: 430912 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 43,2538 milliseconds, compute range: 619904 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 6,2722 milliseconds, compute range: 428672 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 39,9596 milliseconds, compute range: 488512 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 9,6918 milliseconds, compute range: 560064 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 23,6922 milliseconds, compute range: 396928 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 10,2896 milliseconds, compute range: 651648 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 29,7142 milliseconds, compute range: 343680 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 5,7563 milliseconds, compute range: 704896 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 19,8342 milliseconds, compute range: 267776 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 4,7323 milliseconds, compute range: 780800 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 20,6675 milliseconds, compute range: 211264 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 5,154 milliseconds, compute range: 837312 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 24,722 milliseconds, compute range: 166528 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 15,948 milliseconds, compute range: 882048 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 14,3287 milliseconds, compute range: 150720 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 9,494 milliseconds, compute range: 897856 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 11,375 milliseconds, compute range: 137024 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 14,3127 milliseconds, compute range: 911552 workitems
Device 0(system-RAM): Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz, compute time: 13,8756 milliseconds, compute range: 145920 workitems
Device 1(system-RAM): Intel(R) HD Graphics 400, compute time: 7,2847 milliseconds, compute range: 902656 workitems

this slowly converges to a balanced point in which 990k workitems are sent to GPU and remaining ones to CPU. Transcandental functions are heavy and GPU has more cores(96 in this case) so load balancer offloads nearly all of the work to GPU.

When compute-to-memory-access ratio is decreased from a sine function to a simple addition, balance ratio becomes nearly 50 percent which clearly shows memory access latency is in effect and cpu is equally capable on fetching memory with its integrated gpu. (one could test on sytem with 2 gpus, 1 on a pci-e 2.0x4 bridge and one on pci-e 2.0x16 bridge and see the partitioning become %20 and %80 respectively)


Multiple Work Types

When there are different kernel computations or different sized arrays for each execution, each compute method should be given a unique integer number as a compute id to let load balancer optimize them differently and remember their state whenever they are repeated:

  a.compute(cr, 1, "kernelA", globalSize, localSize); // optimize kernelA work distribution 
  a.compute(cr, 2, "kernelB", globalSize, localSize); // optimize kernelB work distribution
  a.compute(cr, 3, "kernelC", globalSize, localSize); // optimize kernelC work distribution
  a.compute(cr, 1, "kernelA", globalSize, localSize); // remember kernelA work distribution state and optimize more
  a.compute(cr, 4, "kernelD", globalSize, localSize); // optimize kernelD work distribution
                ^
                |
          compute id

even data itself can change how a GPU finishes a work quicker or slower, such as ray-tracing geometry data making some parts of an image more intensive and make the work distribution dependent on camera angle. Here, camera angle could be used to differentiate each compute operation and optimize uniquely and be remembered whenever same angle(or within some tolerance) is chosen for the camera.

Kernel: Static scene ray-tracer, Compute Id: Camera angle

Kernel: Matrix-multiplication, Compute Id: constant or matrix size

Kernel: N-body brute-force, Compute Id: another constant

Kernel: grayscale image, Compute Id: another constant

Kernel: javascript interpreter, Compute Id: javascript interpreted function hashcode

Kernel: compression, Compute Id: filename hashcode


Smoothing Load Balancing Rate

By default, smoothing is enabled and equivalent to:

 numberCruncher.smoothLoadBalancer = true;

this instructs balancer to save last N states of work distributions for all devices and enables smoothing after a fixed number of iterations(10 for release v1.0.3) because of the needed history data.

Disabling this makes balancing instantly respond to sudden performance spikes that may have different sources from OS interruptions/preemptions, turbo frequency changes and similar things.

In future, adaptive balancer versions will be added to make balancing faster without losing smoothing feature.


Provided that total work is wide enough, what can change the work-partitioning percentage of a device in run-time?

  • Turbo frequency oscillations
  • Manually overclocking
  • Total compute power of device
  • Communication bottlenecks(such as shared-pcie-lanes) and buffer copy overheads
  • Operating system interrupts
  • Asynchronous read/write/compute capabilities of device
  • Sharing device for other jobs (such as rendering with opengl-directx)
  • CPU's ability to(or not to) feed all devices
  • DMA capabilities(for mapping/unmapping==>streaming) of device and motherboard