-
Notifications
You must be signed in to change notification settings - Fork 10
Consecutive(and repeated) Kernels
This is more appropriate for single-device scenarios. Developer can add more kernels by their names separated by space, comma, semicolon,minus and new-line-character. All name-listed kernels are executed one after another in same command queue and whole operation is taken as single operation for profiling and load balancing.
f.compute(gpu, 1, "sortParticles findNeighbors,calculateForces;movePArtices", 1024);
For multiple-device multipe-kernel execution, there has to be multiple compute method calls since arrays are synchronized on only RAM(C# arrays or C++ arrays), not on devices. Results of all devices are joined in RAM, then it is read by all devices on the next compute call.
v1.2.6 adds a "kernel repeat" feature to decrease total latency when developer needs kernel(s) repated with less latency. This feature converts the workflow of
READ-DATA ---> COMPUTE ---> WRITE-RESULTS
to
READ-DATA ---> COMPUTE COMPUTE COMPUTE .... (N times total) COMPUTE ---> WRITE-RESULTS
so once a data is uploaded, it can be repeatedly computed before results are taken back to host side.
To repeat kernels, repeat count value needs to be specified before the compute method is called:
numberCruncher.repeatCount=N;
this value is 1 by default and can't be zero or negative(even if set, corrects to 1).
To repeate kernels with a parameter changer kernel (such as resetting a single counter value or a trivial variable only) at the end of each repeat iteration, a kernel name needs to be specified for number cruncher object:
numberCruncher.repeatKernelName="reinitializeCounters";
this effectively means
READ-DATA ---> kernel reinitializeCounters kernel reinitializeCounters .... kernel reinitializeCounters ---> WRITE-RESULTS
where "kernel" can be just a single kernel or a list of kernels separated by delimiter characters such as space and comma.
When kernel repeat number is on the order of thousands, it saves nearly %50 of the compute time for light workloads. Here is a comparison between for-loop version and a "repeat" featured version: