OpenCL based ACC-backend and SMM library (#406)

* Completed implementation with passing regtests. Included validation internal to OpenCL backend (disabled by default); useful for debugging failing tests, etc. * Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Some more changes as suggested by code review. * Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Attempt to update cmake.in (only guessing). Changes suggested by code review. * Collected acc_opencl_synchronous_memops into global acc_opencl_options structure; included svm_interop variable into structure. Configure svm_interop depending on OpenCL standard level (only coarse grained SVM is planned/needed). Adjusted acc_host_mem_allocate/acc_host_mem_deallocate and cc_dev_mem_allocate/acc_dev_mem_deallocate to incorporate SVM. Renamed some backend/helper functions. Tested and fixed acc_opencl_stristr. * Respect compile-time setting (ACC_OPENCL_SVM). * Removed support for ACC_OPENCL_STREAM_OOOEXEC as usage depends on in-order behavior. Removed OpenCL private test (nothing to test left after code cleanup). Introduced environment variables to control acc_opencl_options. Fixed acc_opencl_stristr. * Fixed calling clGetMemObjectInfo accidentally with wrong object. Runtime-select implementations of atomic_add_global (new form almost doubles perf. on Nvidia OpenCL). * Attempt to fix linker errors with additional test case (HIP/ROCm). * Fixed warnings about explicitly deprecated CUDA/HIP functions. * Another attempt to fix cross-dependency in CUDA/HIP backend. * One more attempt to fix cross-dependencies. * Disabled dbcsr_acc_test for HIP (linker error due to cross-dependency). * Revert "Fixed warnings about explicitly deprecated CUDA/HIP functions." This reverts commit 7fc9407. * Prettify. * Improved creating resource/kernel file. Introduced CONSTANT and related runtime-check; adjusted kernel-buffer kinds accordingly (kernel code). Disabled SVM support (compile-time). Code cleanup. * Renamed CONSTANT to GLOBAL and expand GLOBAL to either "constant" or "global". Fixes for BSD/macOS (acc_opencl.sh). * Removed superfluous barrier. * Allow to disable (pre-)transposing B-matrices (to only run the SMM-kernel). Disabled comparison against EPSILON and thereby avoid EXIT_FAILURE when missing the tolerance (error margin is printed). * Prepared for tuned kernel (introduced parameters; WIP) * Introduced OPENCL_LIBSMM_SMM_BLOCK_M/N. Code cleanup. * Adjusted script generating resource file (kernel header). * Introduced compile-time WG-size (transpose kernel). * Implemented blocking SMMs into tiles. Introduced (mini-)batchsize (only BS=1 is implemented yet). * BE: Attempt to iteratively limit the WG-size prior to building the kernel (based on device's maximum supported WG-size). * BE: Adjusted definition of acc_opencl_wgsize; implemented device-specific path (in addition to kernel-specific path). * BE: Adjusted acc_opencl_wgsize to take device as an argument (rather than querying the active device internally). * LIBSMM: Introduced/implemented OPENCL_LIBSMM_SMM_BLOCK_M/N. * LIBSMM: Reworked kernel with more compile-time knowledge. * Keep macro definitions (acc_opencl.sh; kernel header). * Implemented intra-kernel (mini-)batch accumulation (disabled by default; BS=1). Normalized initial matrix values in benchmark driver. * Fixed SMM-kernel for (mini-)batches (1 < BS). Rely 2d-arrays for clarity (cleanup). * Adjusted and fixed work split. Print additional norm (debug). Fixed compiler warning. * Removed barrier (mini-batch). * Fixed array initializer. * Reintroduced barrier. * Removed dead code (as suggested). * Initial auto-tuning script (based in OpenTuner; documentation and requirements.txt to follow). Reduced benchmark runtime to accelerate auto-tuning. * acc_bench_smm: Introduced compile-time (VALIDATE) and runtime option (CHECK environment variable) to allow omitting validation of results. * acc_bench_smm: Reduced number of repetitions (normally the warm-up makes timing stable enough). * acc_bench_smm: Sanitize command line arguments. * Adjusted filename of finally written result. * Prettified Python script. * Fixed file header/banner. * Improved performance of SMM-kernel; adjusted tune_multiply.py accordingly. * tune_multiply.py: Return an non-competitive/bad result in case of an error/invalid experiment (auto-tuning). * tune_multiply.py: Avoid UnboundLocalError in Python code (local variable 'match' referenced before assignment). * tune_multiply.py: Improved handling errors and error messages. * tune_multiply.py: Adjusted defaults/seed. * Adjusted filename (max.gflops found), and added newline (final result file). * Extend result/file for easier reuse (JSON), and merge JSONs into CSV file. * Implemented console output/information about merged/ignored JSON files. * Allow custom separator (CSV file). * Code cleanup (tune_multiply.py). * Implemented loading tuned parameters embedded into binary or from file. * Ensure initialization/finalization is outside of parallel region (BE and LIBSMM). * OPENCL_LIBSMM_PARAMS_DELIMS are used to tokenize parameters (CSV file). * Print type-id in addition to name of element type (benchmark drivers). * Regex to match console output; optional CSV file (tune_multiply.py). * JSON/CSV: store type-id rather than typename (smaller). * Introduced (optional) parameter file to Make and CMake. * Improved help/usage message, and handling errors (acc_opencl.sh). * Support CSV parameter file (acc_opencl.sh). * License banner (acc_opencl.sh). * Fixed issues pointed out by Shellcheck. * Fixed/worked around initialize/finalize issue. * Correct initialization/finalization flow (benchmark drivers); including a workaround for #422 (CUDA). * Missed workaround for CUDA (#422). * Added requirements (OpenTuner). Added wrapper script to tune multiple triplets in several sessions. * Improved console output. * Updated various documentation pieces (WIP). * Allow empty/no choice with respect to USE_ACCEL. * Attempt to CI-test OpenCL backend and LIBSMM. * Adjusted CI/build setup: build LIBXSMM and help CMake to find OpenCL. * Extend PKG_CONFIG_PATH rather than overriding it. * Further adjusted build/run scripts (Daint-CI). * One more attempt to get CI up and running. * Disabled Daint-CI runtime tests (temporarily). Prepared revised transpose kernel. * Replaced OPENCL_LIBSMM_TRANS_WGSIZE in favor of OPENCL_LIBSMM_TRANS_BLOCK_M. * Sanitize command line arguments similar to acc_bench_smm. * Folded inplace-transpose into general transpose.cl. * Improved finding OpenCL bits (e.g., on Daint). * Fixed nasty typo. Adjusted default GPU to P100 (to better adhere to DBCSR default). * Improved build messages/help. * Adjusted installation instructions for clarity. * Adjusted existing documentation to better accommodate/distinct the OpenCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM. * Documented auto-tuning. * Improved console output (tune_multiply.sh). * Note about opentuner.db directory. Some additional details and rephrase. * Adjusted separator (tune_multiply.sh). * Improved documentation with some sample output (auto-tuning).
cp2k · Feb 2, 2021 · ba7f143 · ba7f143
1 parent 21dae0f
commit ba7f143
Show file tree

Hide file tree

Showing 46 changed files with 3,870 additions and 158 deletions.
diff --git a/.ci/daint.cscs.ch/Jenkinsfile b/.ci/daint.cscs.ch/Jenkinsfile
@@ -59,6 +59,20 @@ pipeline {
                         }
                     }
                 }
+                stage("OpenCL") {
+                    stages {
+                        stage('build') {
+                            steps {
+                                run_batch("0:15:00", "ocl", "build")
+                            }
+                        }
+//                        stage('test') {
+//                            steps {
+//                                run_batch("1:00:00", "ocl", "test")
+//                            }
+//                        }
+                    }
+                }
                 stage("Intel") {
                     stages {
                         stage('build') {

diff --git a/.ci/daint.cscs.ch/cray.build.sh b/.ci/daint.cscs.ch/cray.build.sh
@@ -30,7 +30,7 @@ cd "${SCRATCH}/${BUILD_TAG}.cray"
 
 cmake \
     -DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
-    -DUSE_CUDA=ON \
+    -DUSE_ACCEL=cuda \
     -DWITH_GPU=P100 \
     -DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_cray_mpi_mp" \
     -DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_cray_mpi_mp" \

diff --git a/.ci/daint.cscs.ch/gnu.build.sh b/.ci/daint.cscs.ch/gnu.build.sh
@@ -28,7 +28,7 @@ cd "${SCRATCH}/${BUILD_TAG}.gnu"
 cmake \
     -DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
     -DCMAKE_CROSSCOMPILING_EMULATOR="" \
-    -DUSE_CUDA=ON \
+    -DUSE_ACCEL=cuda \
     -DWITH_GPU=P100 \
     -DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_gnu_mpi_mp" \
     -DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_gnu_mpi_mp" \

diff --git a/.ci/daint.cscs.ch/intel.build.sh b/.ci/daint.cscs.ch/intel.build.sh
@@ -31,7 +31,7 @@ cd "${SCRATCH}/${BUILD_TAG}.intel"
 
 cmake \
     -DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
-    -DUSE_CUDA=ON \
+    -DUSE_ACCEL=cuda \
     -DWITH_GPU=P100 \
     -DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_intel_mpi_mp" \
     -DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_intel_mpi_mp" \

diff --git a/.ci/daint.cscs.ch/ocl.build.sh b/.ci/daint.cscs.ch/ocl.build.sh
@@ -0,0 +1,55 @@
+#!/bin/bash -l
+
+#SBATCH --export=ALL
+#SBATCH --exclusive
+#SBATCH --constraint="mc"
+#SBATCH --partition="cscsci"
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=4
+#SBATCH --cpus-per-task=3
+#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT
+
+set -o errexit
+set -o nounset
+set -o pipefail
+
+module swap PrgEnv-cray PrgEnv-gnu
+module load daint-gpu cudatoolkit CMake/3.14.5
+module unload cray-libsci_acc
+module list
+
+# Checkout and build LIBXSMM
+if [ ! -d "${HOME}/libxsmm" ]; then
+  cd "${HOME}"
+  git clone https://github.com/hfp/libxsmm.git
+fi
+cd "${HOME}/libxsmm"
+git checkout 02d6ab213a35d5fc2f6454c3b465598b0c086c17
+make -j
+cd ..
+
+set -o xtrace  # do not set earlier to avoid noise from module
+
+umask 0002  # make sure group members can access the data
+
+mkdir -p "${SCRATCH}/${BUILD_TAG}.ocl"
+chmod 0775 "${SCRATCH}/${BUILD_TAG}.ocl"
+cd "${SCRATCH}/${BUILD_TAG}.ocl"
+
+# help CMake to find the OpenCL implementation
+export NVSDKCOMPUTE_ROOT=${CUDATOOLKIT_HOME}
+export PKG_CONFIG_PATH=${HOME}/libxsmm/lib:${PKG_CONFIG_PATH}
+
+cmake \
+    -DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment \
+    -DCMAKE_CROSSCOMPILING_EMULATOR="" \
+    -DUSE_ACCEL=opencl -DUSE_SMM=libxsmm \
+    -DOpenCL_LIBRARY="${CUDATOOLKIT_HOME}/lib64/libOpenCL.so" \
+    -DBLAS_FOUND=ON -DBLAS_LIBRARIES="-lsci_gnu_mpi_mp" \
+    -DLAPACK_FOUND=ON -DLAPACK_LIBRARIES="-lsci_gnu_mpi_mp" \
+    -DMPIEXEC_EXECUTABLE="$(command -v srun)" \
+    -DTEST_MPI_RANKS="${SLURM_NTASKS}" \
+    -DTEST_OMP_THREADS="${SLURM_CPUS_PER_TASK}" \
+    "${WORKSPACE}" |& tee -a "${STAGE_NAME}.out"
+
+make VERBOSE=1 -j |& tee -a "${STAGE_NAME}.out"
diff --git a/.ci/daint.cscs.ch/ocl.test.sh b/.ci/daint.cscs.ch/ocl.test.sh
@@ -0,0 +1,36 @@
+#!/bin/bash -l
+
+#SBATCH --export=ALL
+#SBATCH --exclusive
+#SBATCH --constraint="gpu"
+#SBATCH --partition="cscsci"
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=4
+#SBATCH --cpus-per-task=3
+#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT
+
+set -o errexit
+set -o nounset
+set -o pipefail
+
+module swap PrgEnv-cray PrgEnv-gnu
+module load daint-gpu cudatoolkit CMake/3.14.5
+module unload cray-libsci_acc
+module list
+
+set -o xtrace  # do not set earlier to avoid noise from module
+
+umask 0002  # make sure group members can access the data
+
+mkdir -p "${SCRATCH}/${BUILD_TAG}.ocl"
+chmod 0775 "${SCRATCH}/${BUILD_TAG}.ocl"
+cd "${SCRATCH}/${BUILD_TAG}.ocl"
+
+export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
+export OMP_PROC_BIND=TRUE # set thread affinity
+# OMP_NUM_THREADS is set by cmake
+
+# document the current environment
+env |& tee -a "${STAGE_NAME}.out"
+
+env CTEST_OUTPUT_ON_FAILURE=1 make test ARGS="--timeout 900" |& tee -a "${STAGE_NAME}.out"
diff --git a/.github/workflows/testing-linux.yml b/.github/workflows/testing-linux.yml
@@ -101,7 +101,7 @@ jobs:
         cmake -G Ninja \
           -DCMAKE_BUILD_TYPE=Release \
           -DUSE_${{ matrix.use_openmp }} \
-          -DUSE_HIP=ON \
+          -DUSE_ACCEL=hip \
           -DWITH_GPU=Mi50 \
           ..
     - name: Build

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -90,15 +90,10 @@ set(USE_SMM
           "Small Matrix Multiplication implementation to use (default: blas)")
 set_property(CACHE USE_SMM PROPERTY STRINGS blas libxsmm)
 
-option(USE_CUDA "Build with CUDA support" OFF)
-option(USE_HIP "Build with HIP support" OFF)
-# USE_CUDA and USE_HIP are mutually exclusive options: we either compile with
-# nvcc OR with hipcc
-if (USE_CUDA AND USE_HIP)
-  message(
-    FATAL_ERROR
-      "USE_CUDA and USE_HIP options are mutually exclusive. Please choose one.")
-endif ()
+set(USE_ACCEL
+    ""
+    CACHE STRING "Build with acceleration support (default: none)")
+set_property(CACHE USE_ACCEL PROPERTY STRINGS "" opencl cuda hip)
 
 set(SUPPORTED_CUDA_ARCHITECTURES K20X K40 K80 P100 V100)
 set(SUPPORTED_HIP_ARCHITECTURES Mi50)
@@ -117,21 +112,27 @@ enable_language(Fortran)
 
 if (WITH_C_API AND WITH_EXAMPLES)
   enable_language(CXX)
+  enable_language(C)
 endif ()
 
-# we're always using at least C++11
+# always use at least C++11
 set(CMAKE_CXX_STANDARD 11)
 
 # =================================================================================================
 # PACKAGE DISCOVERY (compiler configuration can impact package discovery)
+find_package(PkgConfig)
 
-# =================================== OpenMP and OpenMP/offload backend
+# =================================== OpenMP
 if (USE_OPENMP)
   find_package(OpenMP REQUIRED)
 endif ()
 
+# =================================== LIBXSMM (rely on pkg-config)
+if ((USE_SMM MATCHES "libxsmm") OR (USE_ACCEL MATCHES "opencl"))
+  pkg_check_modules(LIBXSMM IMPORTED_TARGET GLOBAL libxsmmf)
+endif ()
+
 # =================================== BLAS & LAPACK, PkgConfig
-find_package(PkgConfig)
 find_package(LAPACK REQUIRED) # needed for some of the integrated test routines,
                               # also calls find_package(BLAS)
 
@@ -141,8 +142,7 @@ find_package(LAPACK REQUIRED) # needed for some of the integrated test routines,
 # environment for a python interpreter before searching elsewhere in the system.
 # In CMake <3.15, the system is searched before the virtual environment.
 if (NOT Python_EXECUTABLE)
-  # If the python interpreter isn't specified as a command line option, look for
-  # it:
+  # If the python interpreter is not specified (command line), try finding it:
   find_package(
     Python
     COMPONENTS Interpreter
@@ -185,15 +185,35 @@ endif ()
 if (USE_SMM MATCHES "blas")
   message("-- Using BLAS for Small Matrix Multiplication")
 elseif (USE_SMM MATCHES "libxsmm")
-  # rely on pkg-config in order to link against libxsmm
-  pkg_check_modules(deps REQUIRED IMPORTED_TARGET GLOBAL libxsmmf)
-  message("-- Using libxsmm for Small Matrix Multiplication")
+  if (LIBXSMM_FOUND)
+    message("-- Using LIBXSMM for Small Matrix Multiplication")
+  else ()
+    message(
+      FATAL_ERROR
+        "LIBXSMM is not found but requested (USE_SMM). "
+        "Please install PkgConfig, build LIBXSMM, and "
+        "set PKG_CONFIG_PATH=/path/to/libxsmm/lib")
+  endif ()
 else ()
   message(FATAL_ERROR "Unknown SMM library specified")
 endif ()
 
-# =================================== GPU backend
-if (USE_CUDA OR USE_HIP)
+# =================================== GPU backends
+if (USE_ACCEL MATCHES "opencl")
+  if (NOT LIBXSMM_FOUND)
+    message(
+      FATAL_ERROR
+        "LIBXSMM is not found but required for "
+        "LIBSMM based on the ACC/OpenCL backend. "
+        "Please install PkgConfig, LIBXSMM, and "
+        "set PKG_CONFIG_PATH=/path/to/libxsmm/lib")
+  endif ()
+
+  find_package(OpenCL REQUIRED)
+  enable_language(C)
+endif ()
+
+if (USE_ACCEL MATCHES "cuda|hip")
   enable_language(CXX)
   set(GPU_ARCH_NUMBER_K20X 35)
   set(GPU_ARCH_NUMBER_K40 35)
@@ -203,8 +223,7 @@ if (USE_CUDA OR USE_HIP)
   set(GPU_ARCH_NUMBER_Mi50 gfx906)
 endif ()
 
-if (USE_CUDA)
-
+if (USE_ACCEL MATCHES "cuda")
   enable_language(CUDA)
   if (CMAKE_CUDA_COMPILER_VERSION LESS 5.5)
     message(FATAL_ERROR "CUDA version >= 5.5 is required.")
@@ -214,9 +233,8 @@ if (USE_CUDA)
   list(FIND SUPPORTED_CUDA_ARCHITECTURES ${WITH_GPU} GPU_SUPPORTED)
   if (GPU_SUPPORTED EQUAL -1)
     message(
-      FATAL_ERROR
-        "GPU architecture requested (${WITH_GPU}) is not supported. Please choose from: ${SUPPORTED_CUDA_ARCHITECTURES}"
-    )
+      FATAL_ERROR "GPU architecture requested (${WITH_GPU}) is not supported. "
+                  "Please choose from: ${SUPPORTED_CUDA_ARCHITECTURES}")
   endif ()
 
   # assume that the backend compiler for nvcc understands the -std=c++11
@@ -243,7 +261,6 @@ if (USE_CUDA)
   else ()
     message(STATUS "Found cuBLAS: ${CUBLAS}")
   endif ()
-
   if (WITH_CUDA_PROFILING)
     find_library(
       CUDA_NVTOOLSEXT nvToolsExt
@@ -257,15 +274,13 @@ endif ()
 
 # inspired from
 # https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/2_Cookbook/12_cmake_hip_add_executable
-if (USE_HIP)
-
+if (USE_ACCEL MATCHES "hip")
   # Make sure the GPU required is supported
   list(FIND SUPPORTED_HIP_ARCHITECTURES ${WITH_GPU} GPU_SUPPORTED)
   if (GPU_SUPPORTED EQUAL -1)
     message(
-      FATAL_ERROR
-        "GPU architecture requested (${WITH_GPU}) is not supported. Please choose from: ${SUPPORTED_HIP_ARCHITECTURES}"
-    )
+      FATAL_ERROR "GPU architecture requested (${WITH_GPU}) is not supported. "
+                  "Please choose from: ${SUPPORTED_HIP_ARCHITECTURES}")
   endif ()
 
   # Set path to HIP installation, include HIP cmake utilities

diff --git a/cmake/CompilerConfiguration.cmake b/cmake/CompilerConfiguration.cmake
@@ -88,3 +88,8 @@ Please open an issue at https://github.com/cp2k/dbcsr/issues with the reported c
   message("-- CMAKE_CXX_COMPILER_ID: " ${CMAKE_CXX_COMPILER_ID})
   message("-- CMAKE_CXX_COMPILER full path: " ${CMAKE_CXX_COMPILER})
 endif ()
+
+# inherit C flags from CXX
+set(CMAKE_C_FLAGS_RELEASE ${CMAKE_CXX_FLAGS_RELEASE})
+set(CMAKE_C_FLAGS_COVERAGE ${CMAKE_CXX_FLAGS_COVERAGE})
+set(CMAKE_C_FLAGS_DEBUG ${CMAKE_CXX_FLAGS_DEBUG})