Skip to content

Commit

Permalink
OpenCL based ACC-backend and SMM library (#406)
Browse files Browse the repository at this point in the history
* Completed implementation with passing regtests. Included validation internal to OpenCL backend (disabled by default); useful for debugging failing tests, etc.

* Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Some more changes as suggested by code review.

* Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Attempt to update cmake.in (only guessing). Changes suggested by code review.

* Collected acc_opencl_synchronous_memops into global acc_opencl_options structure; included svm_interop variable into structure. Configure svm_interop depending on OpenCL standard level (only coarse grained SVM is planned/needed). Adjusted acc_host_mem_allocate/acc_host_mem_deallocate and cc_dev_mem_allocate/acc_dev_mem_deallocate to incorporate SVM. Renamed some backend/helper functions. Tested and fixed acc_opencl_stristr.

* Respect compile-time setting (ACC_OPENCL_SVM).

* Removed support for ACC_OPENCL_STREAM_OOOEXEC as usage depends on in-order behavior.
Removed OpenCL private test (nothing to test left after code cleanup).
Introduced environment variables to control acc_opencl_options.
Fixed acc_opencl_stristr.

* Fixed calling clGetMemObjectInfo accidentally with wrong object. Runtime-select implementations of atomic_add_global (new form almost doubles perf. on Nvidia OpenCL).

* Attempt to fix linker errors with additional test case (HIP/ROCm).

* Fixed warnings about explicitly deprecated CUDA/HIP functions.

* Another attempt to fix cross-dependency in CUDA/HIP backend.

* One more attempt to fix cross-dependencies.

* Disabled dbcsr_acc_test for HIP (linker error due to cross-dependency).

* Revert "Fixed warnings about explicitly deprecated CUDA/HIP functions."

This reverts commit 7fc9407.

* Prettify.

* Improved creating resource/kernel file. Introduced CONSTANT and related runtime-check; adjusted kernel-buffer kinds accordingly (kernel code). Disabled SVM support (compile-time). Code cleanup.

* Renamed CONSTANT to GLOBAL and expand GLOBAL to either "constant" or "global". Fixes for BSD/macOS (acc_opencl.sh).

* Removed superfluous barrier.

* Allow to disable (pre-)transposing B-matrices (to only run the SMM-kernel). Disabled comparison against EPSILON and thereby avoid EXIT_FAILURE when missing the tolerance (error margin is printed).

* Prepared for tuned kernel (introduced parameters; WIP)

* Introduced OPENCL_LIBSMM_SMM_BLOCK_M/N. Code cleanup.
* Adjusted script generating resource file (kernel header).
* Introduced compile-time WG-size (transpose kernel).

* Implemented blocking SMMs into tiles. Introduced (mini-)batchsize (only BS=1 is implemented yet).

* BE: Attempt to iteratively limit the WG-size prior to building the kernel (based on device's maximum supported WG-size).
* BE: Adjusted definition of acc_opencl_wgsize; implemented device-specific path (in addition to kernel-specific path).
* BE: Adjusted acc_opencl_wgsize to take device as an argument (rather than querying the active device internally).
* LIBSMM: Introduced/implemented OPENCL_LIBSMM_SMM_BLOCK_M/N.
* LIBSMM: Reworked kernel with more compile-time knowledge.
* Keep macro definitions (acc_opencl.sh; kernel header).

* Implemented intra-kernel (mini-)batch accumulation (disabled by default; BS=1). Normalized initial matrix values in benchmark driver.

* Fixed SMM-kernel for (mini-)batches (1 < BS). Rely 2d-arrays for clarity (cleanup).

* Adjusted and fixed work split. Print additional norm (debug). Fixed compiler warning.

* Removed barrier (mini-batch).

* Fixed array initializer.

* Reintroduced barrier.

* Removed dead code (as suggested).

* Initial auto-tuning script (based in OpenTuner; documentation and requirements.txt to follow).
Reduced benchmark runtime to accelerate auto-tuning.

* acc_bench_smm: Introduced compile-time (VALIDATE) and runtime option (CHECK environment variable) to allow omitting validation of results.
* acc_bench_smm: Reduced number of repetitions (normally the warm-up makes timing stable enough).
* acc_bench_smm: Sanitize command line arguments.

* Adjusted filename of finally written result.

* Prettified Python script.

* Fixed file header/banner.

* Improved performance of SMM-kernel; adjusted tune_multiply.py accordingly.
* tune_multiply.py: Return an non-competitive/bad result in case of an error/invalid experiment (auto-tuning).
* tune_multiply.py: Avoid UnboundLocalError in Python code (local variable 'match' referenced before assignment).
* tune_multiply.py: Improved handling errors and error messages.
* tune_multiply.py: Adjusted defaults/seed.

* Adjusted filename (max.gflops found), and added newline (final result file).

* Extend result/file for easier reuse (JSON), and merge JSONs into CSV file.

* Implemented console output/information about merged/ignored JSON files.
* Allow custom separator (CSV file).
* Code cleanup (tune_multiply.py).

* Implemented loading tuned parameters embedded into binary or from file.

* Ensure initialization/finalization is outside of parallel region (BE and LIBSMM).
* OPENCL_LIBSMM_PARAMS_DELIMS are used to tokenize parameters (CSV file).
* Print type-id in addition to name of element type (benchmark drivers).
* Regex to match console output; optional CSV file (tune_multiply.py).
* JSON/CSV: store type-id rather than typename (smaller).
* Introduced (optional) parameter file to Make and CMake.
* Improved help/usage message, and handling errors (acc_opencl.sh).
* Support CSV parameter file (acc_opencl.sh).
* License banner (acc_opencl.sh).

* Fixed issues pointed out by Shellcheck.

* Fixed/worked around initialize/finalize issue.

* Correct initialization/finalization flow (benchmark drivers); including a workaround for #422 (CUDA).

* Missed workaround for CUDA (#422).

* Added requirements (OpenTuner). Added wrapper script to tune multiple triplets in several sessions.

* Improved console output.

* Updated various documentation pieces (WIP).

* Allow empty/no choice with respect to USE_ACCEL.

* Attempt to CI-test OpenCL backend and LIBSMM.

* Adjusted CI/build setup: build LIBXSMM and help CMake to find OpenCL.

* Extend PKG_CONFIG_PATH rather than overriding it.

* Further adjusted build/run scripts (Daint-CI).

* One more attempt to get CI up and running.

* Disabled Daint-CI runtime tests (temporarily). Prepared revised transpose kernel.

* Replaced OPENCL_LIBSMM_TRANS_WGSIZE in favor of OPENCL_LIBSMM_TRANS_BLOCK_M.
* Sanitize command line arguments similar to acc_bench_smm.
* Folded inplace-transpose into general transpose.cl.

* Improved finding OpenCL bits (e.g., on Daint).

* Fixed nasty typo. Adjusted default GPU to P100 (to better adhere to DBCSR default).

* Improved build messages/help.

* Adjusted installation instructions for clarity.

* Adjusted existing documentation to better accommodate/distinct the OpenCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM.

* Documented auto-tuning.

* Improved console output (tune_multiply.sh).

* Note about opentuner.db directory. Some additional details and rephrase.

* Adjusted separator (tune_multiply.sh).

* Improved documentation with some sample output (auto-tuning).
  • Loading branch information
hfp authored Feb 2, 2021
1 parent 21dae0f commit ba7f143
Show file tree
Hide file tree
Showing 46 changed files with 3,870 additions and 158 deletions.
14 changes: 14 additions & 0 deletions .ci/daint.cscs.ch/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,20 @@ pipeline {
}
}
}
stage("OpenCL") {
stages {
stage('build') {
steps {
run_batch("0:15:00", "ocl", "build")
}
}
// stage('test') {
// steps {
// run_batch("1:00:00", "ocl", "test")
// }
// }
}
}
stage("Intel") {
stages {
stage('build') {
Expand Down
2 changes: 1 addition & 1 deletion .ci/daint.cscs.ch/cray.build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ cd "${SCRATCH}/${BUILD_TAG}.cray"

cmake \
-DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
-DUSE_CUDA=ON \
-DUSE_ACCEL=cuda \
-DWITH_GPU=P100 \
-DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_cray_mpi_mp" \
-DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_cray_mpi_mp" \
Expand Down
2 changes: 1 addition & 1 deletion .ci/daint.cscs.ch/gnu.build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ cd "${SCRATCH}/${BUILD_TAG}.gnu"
cmake \
-DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
-DCMAKE_CROSSCOMPILING_EMULATOR="" \
-DUSE_CUDA=ON \
-DUSE_ACCEL=cuda \
-DWITH_GPU=P100 \
-DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_gnu_mpi_mp" \
-DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_gnu_mpi_mp" \
Expand Down
2 changes: 1 addition & 1 deletion .ci/daint.cscs.ch/intel.build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ cd "${SCRATCH}/${BUILD_TAG}.intel"

cmake \
-DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
-DUSE_CUDA=ON \
-DUSE_ACCEL=cuda \
-DWITH_GPU=P100 \
-DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_intel_mpi_mp" \
-DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_intel_mpi_mp" \
Expand Down
55 changes: 55 additions & 0 deletions .ci/daint.cscs.ch/ocl.build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash -l

#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="mc"
#SBATCH --partition="cscsci"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT

set -o errexit
set -o nounset
set -o pipefail

module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu cudatoolkit CMake/3.14.5
module unload cray-libsci_acc
module list

# Checkout and build LIBXSMM
if [ ! -d "${HOME}/libxsmm" ]; then
cd "${HOME}"
git clone https://github.com/hfp/libxsmm.git
fi
cd "${HOME}/libxsmm"
git checkout 02d6ab213a35d5fc2f6454c3b465598b0c086c17
make -j
cd ..

set -o xtrace # do not set earlier to avoid noise from module

umask 0002 # make sure group members can access the data

mkdir -p "${SCRATCH}/${BUILD_TAG}.ocl"
chmod 0775 "${SCRATCH}/${BUILD_TAG}.ocl"
cd "${SCRATCH}/${BUILD_TAG}.ocl"

# help CMake to find the OpenCL implementation
export NVSDKCOMPUTE_ROOT=${CUDATOOLKIT_HOME}
export PKG_CONFIG_PATH=${HOME}/libxsmm/lib:${PKG_CONFIG_PATH}

cmake \
-DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
-DCMAKE_CROSSCOMPILING_EMULATOR="" \
-DUSE_ACCEL=opencl -DUSE_SMM=libxsmm \
-DOpenCL_LIBRARY="${CUDATOOLKIT_HOME}/lib64/libOpenCL.so" \
-DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_gnu_mpi_mp" \
-DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_gnu_mpi_mp" \
-DMPIEXEC_EXECUTABLE="$(command -v srun)" \
-DTEST_MPI_RANKS="${SLURM_NTASKS}" \
-DTEST_OMP_THREADS="${SLURM_CPUS_PER_TASK}" \
"${WORKSPACE}" |& tee -a "${STAGE_NAME}.out"

make VERBOSE=1 -j |& tee -a "${STAGE_NAME}.out"
36 changes: 36 additions & 0 deletions .ci/daint.cscs.ch/ocl.test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash -l

#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="gpu"
#SBATCH --partition="cscsci"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT

set -o errexit
set -o nounset
set -o pipefail

module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu cudatoolkit CMake/3.14.5
module unload cray-libsci_acc
module list

set -o xtrace # do not set earlier to avoid noise from module

umask 0002 # make sure group members can access the data

mkdir -p "${SCRATCH}/${BUILD_TAG}.ocl"
chmod 0775 "${SCRATCH}/${BUILD_TAG}.ocl"
cd "${SCRATCH}/${BUILD_TAG}.ocl"

export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
export OMP_PROC_BIND=TRUE # set thread affinity
# OMP_NUM_THREADS is set by cmake

# document the current environment
env |& tee -a "${STAGE_NAME}.out"

env CTEST_OUTPUT_ON_FAILURE=1 make test ARGS="--timeout 900" |& tee -a "${STAGE_NAME}.out"
2 changes: 1 addition & 1 deletion .github/workflows/testing-linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ jobs:
cmake -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DUSE_${{ matrix.use_openmp }} \
-DUSE_HIP=ON \
-DUSE_ACCEL=hip \
-DWITH_GPU=Mi50 \
..
- name: Build
Expand Down
75 changes: 45 additions & 30 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,15 +90,10 @@ set(USE_SMM
"Small Matrix Multiplication implementation to use (default: blas)")
set_property(CACHE USE_SMM PROPERTY STRINGS blas libxsmm)

option(USE_CUDA "Build with CUDA support" OFF)
option(USE_HIP "Build with HIP support" OFF)
# USE_CUDA and USE_HIP are mutually exclusive options: we either compile with
# nvcc OR with hipcc
if (USE_CUDA AND USE_HIP)
message(
FATAL_ERROR
"USE_CUDA and USE_HIP options are mutually exclusive. Please choose one.")
endif ()
set(USE_ACCEL
""
CACHE STRING "Build with acceleration support (default: none)")
set_property(CACHE USE_ACCEL PROPERTY STRINGS "" opencl cuda hip)

set(SUPPORTED_CUDA_ARCHITECTURES K20X K40 K80 P100 V100)
set(SUPPORTED_HIP_ARCHITECTURES Mi50)
Expand All @@ -117,21 +112,27 @@ enable_language(Fortran)

if (WITH_C_API AND WITH_EXAMPLES)
enable_language(CXX)
enable_language(C)
endif ()

# we're always using at least C++11
# always use at least C++11
set(CMAKE_CXX_STANDARD 11)

# =================================================================================================
# PACKAGE DISCOVERY (compiler configuration can impact package discovery)
find_package(PkgConfig)

# =================================== OpenMP and OpenMP/offload backend
# =================================== OpenMP
if (USE_OPENMP)
find_package(OpenMP REQUIRED)
endif ()

# =================================== LIBXSMM (rely on pkg-config)
if ((USE_SMM MATCHES "libxsmm") OR (USE_ACCEL MATCHES "opencl"))
pkg_check_modules(LIBXSMM IMPORTED_TARGET GLOBAL libxsmmf)
endif ()

# =================================== BLAS & LAPACK, PkgConfig
find_package(PkgConfig)
find_package(LAPACK REQUIRED) # needed for some of the integrated test routines,
# also calls find_package(BLAS)

Expand All @@ -141,8 +142,7 @@ find_package(LAPACK REQUIRED) # needed for some of the integrated test routines,
# environment for a python interpreter before searching elsewhere in the system.
# In CMake <3.15, the system is searched before the virtual environment.
if (NOT Python_EXECUTABLE)
# If the python interpreter isn't specified as a command line option, look for
# it:
# If the python interpreter is not specified (command line), try finding it:
find_package(
Python
COMPONENTS Interpreter
Expand Down Expand Up @@ -185,15 +185,35 @@ endif ()
if (USE_SMM MATCHES "blas")
message("-- Using BLAS for Small Matrix Multiplication")
elseif (USE_SMM MATCHES "libxsmm")
# rely on pkg-config in order to link against libxsmm
pkg_check_modules(deps REQUIRED IMPORTED_TARGET GLOBAL libxsmmf)
message("-- Using libxsmm for Small Matrix Multiplication")
if (LIBXSMM_FOUND)
message("-- Using LIBXSMM for Small Matrix Multiplication")
else ()
message(
FATAL_ERROR
"LIBXSMM is not found but requested (USE_SMM). "
"Please install PkgConfig, build LIBXSMM, and "
"set PKG_CONFIG_PATH=/path/to/libxsmm/lib")
endif ()
else ()
message(FATAL_ERROR "Unknown SMM library specified")
endif ()

# =================================== GPU backend
if (USE_CUDA OR USE_HIP)
# =================================== GPU backends
if (USE_ACCEL MATCHES "opencl")
if (NOT LIBXSMM_FOUND)
message(
FATAL_ERROR
"LIBXSMM is not found but required for "
"LIBSMM based on the ACC/OpenCL backend. "
"Please install PkgConfig, LIBXSMM, and "
"set PKG_CONFIG_PATH=/path/to/libxsmm/lib")
endif ()

find_package(OpenCL REQUIRED)
enable_language(C)
endif ()

if (USE_ACCEL MATCHES "cuda|hip")
enable_language(CXX)
set(GPU_ARCH_NUMBER_K20X 35)
set(GPU_ARCH_NUMBER_K40 35)
Expand All @@ -203,8 +223,7 @@ if (USE_CUDA OR USE_HIP)
set(GPU_ARCH_NUMBER_Mi50 gfx906)
endif ()

if (USE_CUDA)

if (USE_ACCEL MATCHES "cuda")
enable_language(CUDA)
if (CMAKE_CUDA_COMPILER_VERSION LESS 5.5)
message(FATAL_ERROR "CUDA version >= 5.5 is required.")
Expand All @@ -214,9 +233,8 @@ if (USE_CUDA)
list(FIND SUPPORTED_CUDA_ARCHITECTURES ${WITH_GPU} GPU_SUPPORTED)
if (GPU_SUPPORTED EQUAL -1)
message(
FATAL_ERROR
"GPU architecture requested (${WITH_GPU}) is not supported. Please choose from: ${SUPPORTED_CUDA_ARCHITECTURES}"
)
FATAL_ERROR "GPU architecture requested (${WITH_GPU}) is not supported. "
"Please choose from: ${SUPPORTED_CUDA_ARCHITECTURES}")
endif ()

# assume that the backend compiler for nvcc understands the -std=c++11
Expand All @@ -243,7 +261,6 @@ if (USE_CUDA)
else ()
message(STATUS "Found cuBLAS: ${CUBLAS}")
endif ()

if (WITH_CUDA_PROFILING)
find_library(
CUDA_NVTOOLSEXT nvToolsExt
Expand All @@ -257,15 +274,13 @@ endif ()

# inspired from
# https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/2_Cookbook/12_cmake_hip_add_executable
if (USE_HIP)

if (USE_ACCEL MATCHES "hip")
# Make sure the GPU required is supported
list(FIND SUPPORTED_HIP_ARCHITECTURES ${WITH_GPU} GPU_SUPPORTED)
if (GPU_SUPPORTED EQUAL -1)
message(
FATAL_ERROR
"GPU architecture requested (${WITH_GPU}) is not supported. Please choose from: ${SUPPORTED_HIP_ARCHITECTURES}"
)
FATAL_ERROR "GPU architecture requested (${WITH_GPU}) is not supported. "
"Please choose from: ${SUPPORTED_HIP_ARCHITECTURES}")
endif ()

# Set path to HIP installation, include HIP cmake utilities
Expand Down
5 changes: 5 additions & 0 deletions cmake/CompilerConfiguration.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,8 @@ Please open an issue at https://github.com/cp2k/dbcsr/issues with the reported c
message("-- CMAKE_CXX_COMPILER_ID: " ${CMAKE_CXX_COMPILER_ID})
message("-- CMAKE_CXX_COMPILER full path: " ${CMAKE_CXX_COMPILER})
endif ()

# inherit C flags from CXX
set(CMAKE_C_FLAGS_RELEASE ${CMAKE_CXX_FLAGS_RELEASE})
set(CMAKE_C_FLAGS_COVERAGE ${CMAKE_CXX_FLAGS_COVERAGE})
set(CMAKE_C_FLAGS_DEBUG ${CMAKE_CXX_FLAGS_DEBUG})
Loading

0 comments on commit ba7f143

Please sign in to comment.