SYCL Building Blocks

Header-based SYCL reusable algorithms and data structures.

Register bit array

Dynamically addressable bit array which will be kept in GPU registers. The API is based on the one offered by std::bitset (C++20).

Prefix Scans

Decoupled look-back prefix scan

Implements a modified version of the Single-pass Parallel Prefix Scan with Decoupled Look-back with variable strategy: scan-then-reduce or reduce-then-scan. The strategy depends on the previous partition descriptors to hide latencies. Performs at speeds close to memcpy and has only 2n memory movements. Forward guarantees progress required.

Cooperative Prefix Scan

Implements a radix-N scan-then-propagate strategy using Kogge-Stone group-scans and propagation fans. This implementation demonstrates the use of Cooperative Groups and is thus experimental. The computation is performed in a single kernel launch and using the whole device. The bottleneck is currently the implementation of the SYCL group algorithms.

Reduction

Parallel reduction algorithm using the SYCL reduction interface and a recursive approach to unroll the loops. Memory bound, performance is thus equivalent to CUB.

Intrinsics

CUDA intrinsics missing in SYCL such as bit-reversal funnel-shifter and many more. See intrinsics.hpp for a list of implemented functions.

Runtime Index Wrapper

Functions and Classes used to access arrays-based types with a dynamic/runtime index. This allows to force the registerization of these arrays which is not possible otherwise, on GPU (annd even CPU!).

This allows also the compiler to see nex optimisations (see my SYCL Summer Session Talk). For best performance, if addressing bytes, pack them in 32/64 bit words and extract the byte yourself and use the library to address words.

When writing to the array, if all threads access the same index only, there is a lighter/faster version available by defining RUNTIME_IDX_STORE_USE_SWITCH. Sometimes registerization won't happen using. Read speed is not affected.

There is a read version, runtime_index_wrapper_log that uses a binary search. It reduces the number of steps, but often is slower because of thread divergence if not all the threads access the same value.

The wrapper has a specialisation for std::array, sycl::vec, C-style arrays and sycl::id. It also accepts any type that has a subscript operator, but then the used must put the maximum accessed index in the first template parameter.

Benchmarks

With Google benchmark on a GTX 1660 Ti:

Benchmark                       Time      CPU  Iterations     UserCounters...
registerized_array/1073741824 57.2 ms  57.1 ms  12 items_per_second=1040G/s
stack_array/1073741824        2216 ms  2214 ms   1 items_per_second=37.3725G/s

And on a CPU:

Benchmark                       Time      CPU  Iterations     UserCounters...
registerized_array/1073741824  3.98 ms   3.95 ms   178 items_per_second=27.1994T/s
stack_array/1073741824         7.02 ms   6.99 ms   100 items_per_second=15.3549T/s

Use example

int array[10] = {0};
runtime_index_wrapper(array, i % 10, j) // performs array[i%10]=j and returns J
assert(j == runtime_index_wrapper(array, i % 10)); // reads the value

Class interface

std::array<size_t, 10> vec = {};
sycl::ext::runtime_wrapper acc(vec);

acc.write(i % 10, j) // performs array[i%10]=j and returns J
assert(j == acc[i % 10]); // reads the value

For types that have a subscript operator, but where the implementation is not able to extract the maximum accessed index, the user must provide the size when using read/write.

std::vector<int> vec ... ;
sycl::ext::runtime_wrapper acc(vec);

acc.write<vec_size>(i, val) // performs array[i%10]=j and returns J
assert(j == acc.read<vec_size>(i % 10)); // reads the value

runtime_byte_array

This class solves the issue previously mentioned of storing bytes. It allows to increase the look-up time by packing the bytes in bigger words (default is 32 bits words).

runtime_byte_array<4> array{'S', 'Y', 'C', 'L', ..}; // Creates an array stored in a 32 bit word, by default

runtime_byte_array<256, uint64_t> array2{}; // Create an array of 256 0-initialized bytes stored in 32 64-bit words.

Some values

With an array of 12 bytes:

registerized_and_optimised_byte_array/1073741824       2266 ms         2264 ms            1 items_per_second=47.4268G/s
registerized_byte_array/1073741824                     2921 ms         2918 ms            1 items_per_second=36.8014G/s
stack_byte_array/1073741824                            3433 ms         3429 ms            1 items_per_second=31.3109G/s

For arrays of 32 bytes we're still faster that both the previous wrapper and a stack-frame array. For bigger arrays (>64), the regular stack-frame array is faster because our look-up time is too big. For smaller arrays (<16), the byte selection from the words introduces overhead and the regular wrapper performs better. Benchmark your application.

[Experimental] Cooperative groups

Device wide synchronisation functions. Kernel ranges are bound to ensure forward progress, If the Nvidia GPU is using AMS this might not be enough.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
benchmarks		benchmarks
cmake		cmake
include		include
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SYCL Building Blocks

Register bit array

Prefix Scans

Decoupled look-back prefix scan

Cooperative Prefix Scan

Reduction

Intrinsics

Runtime Index Wrapper

Benchmarks

Use example

Class interface

runtime_byte_array

Some values

[Experimental] Cooperative groups

About

Uh oh!

Releases

Packages

Languages

Michoumichmich/SYCL-Building-Blocks

Folders and files

Latest commit

History

Repository files navigation

SYCL Building Blocks

Register bit array

Prefix Scans

Decoupled look-back prefix scan

Cooperative Prefix Scan

Reduction

Intrinsics

Runtime Index Wrapper

Benchmarks

Use example

Class interface

runtime_byte_array

Some values

[Experimental] Cooperative groups

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages