This is a collection of key reduction kernel patterns collated from other benchmarks.
- oneDPL
- RAJA custom reducer for complex min
List of benchmark kernels, and their sources.
Reduction | Original benchmark |
---|---|
r += a[i] * b[i] | Dot product |
r += c[i] | Sum of complex numbers |
(r1,r2) += (c1[i],c2[i]) | Sum of complex numbers, stored as SoA |
min(abs(c[i])) | Minimum absolute value of complex numbers |
Various scalar sums | CloverLeaf Field Summary kernel |
count, min, max, mean, std | Pandas Series.Describe |
The benchmark is structured as a driver calling routines from a linked library. Each library implements the benchmark kernels.
Build using CMake:
cmake -Bbuild -H. -DMODEL=<model> # Valid: OpenMP, Kokkos, RAJA, OpenMP-target, SYCL, oneDPL
cmake --build build
No extra stages are required to build with OpenMP (for the CPU).
The MODEL
name is OpenMP-target
; you must also specify OMP_TARGET
to indicate which backend to use. Currently, only Intel
is supported.
We recommend installing oneAPI to use the Intel backend. The HPC Kit is needed for OpenMP. Remember to run /opt/intel/oneapi/setvars.sh
(or wherever you have installed it) before building / running.
Backend | Options |
---|---|
Intel | -DMODEL=OpenMP-target -DOMP_TARGET=Intel -DCMAKE_CXX_COMPILER=icpx |
This code builds Kokkos inline.
- Download the Kokkos source.
- Add
-DKOKKOS_SRC=/path/to/downloaded/kokkos
to the CMake configure stage. - Pass any Kokkos options to the CMake configure too:
-DKokkos_...
. For example:
Backend | Options |
---|---|
CUDA | -DKokkos_ENABLE_CUDA=On -DKokkos_ENABLE_CUDA_LAMBDA=On -DKokkos_ARCH_VOLTA72=On |
- Download the latest RAJA source: git clone --recursive https://github.com/LLNL/RAJA.git
- Add
-DRAJA_SRC=/path/to/downloaded/RAJA
to the CMake configure stage. - Add other options according to the table below:
Backend | Options |
---|---|
OpenMP | -DENABLE_OpenMP=On |
CUDA | -DENABLE_CUDA=On -DCMAKE_CUDA_ARCHITECTURES=XX -DCUDA_ARCH=sm_XX -DCUDA_TOOLKIT_ROOT_DIR=/path/to/cuda |
HIP | -DENABLE_HIP=On |
We recommend installing the oneAPI basekit. Remember to run /opt/intel/oneAPI/setvars.sh
(or wherever you have installed it) before building / running.
We also recommend using a recent 'nightly' build of the oneAPI DPC++ compiler, which can be found here (see the 'assets' arrow and download dpcpp_compiler.tar.gz
.) When you have untarred the package, you should source <path-to-nightly>/startup.sh
.
Toolchain | Options |
---|---|
oneAPI | -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_CXX_FLAGS='-fsycl -fsycl-unnamed-lambda' |
We recommend installing the oneAPI basekit. Remember to run /opt/intel/oneAPI/setvars.sh
(or wherever you have installed it) before building / running.
This code has only been tested with the 2021.3 release of oneAPI.
Toolchain | Options |
---|---|
oneAPI | -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_CXX_FLAGS='-fsycl -fsycl-unnamed-lambda' |
The repository is arranged as follows:
<root>
README.md # This README file
AUTHORS # A list of contributors
LICENSE # The software license governing the software in this repository
src/
main.cpp # Main driver code
CMakeLists.txt # CMake build system configuration
config.hpp.in # Input file for printing software version at runtime
*.hpp # Benchmark definition header files
omp/*.cpp # OpenMP implementations of the benchmarks
kokkos/*.cpp # Kokkos implementations of the benchmarks
raja/*.cpp # RAJA implementations of the benchmarks
Each benchmark follows a standardised interface:
- Constructor: Set up the programming model (if required).
setup()
: Allocate and initialise benchmark data.run()
: Run the benchmark once. This will return a result that can be verified.teardown()
: Deallocate benchmark data.- Deconstructor: Programming model finalisation (if required).
The main driver code calls (and times) these routines in the above order.
The benchmark definition defines a expect()
routine, which specifies the expected result of the benchmark, as returned byrun()
.
Each implementation in a particular programming model supplies a .cpp
file for each of the benchmarks.
These take the form of the class definition for each of the benchmark classes in the corresponding headers.
At build time, one implementation (programming model) is selected, and linked against the main application. We use a PImpl approach for storing the benchmark data. Each programming model uses very different types for data arrays (pointers, Views, Buffers, etc.). Using this approach we can avoid templating the main driver code, and supply different implementations for each benchmark at link time.
To be announced.