Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
OpenCL based ACC-backend and SMM library (#406)
* Completed implementation with passing regtests. Included validation internal to OpenCL backend (disabled by default); useful for debugging failing tests, etc. * Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Some more changes as suggested by code review. * Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Attempt to update cmake.in (only guessing). Changes suggested by code review. * Collected acc_opencl_synchronous_memops into global acc_opencl_options structure; included svm_interop variable into structure. Configure svm_interop depending on OpenCL standard level (only coarse grained SVM is planned/needed). Adjusted acc_host_mem_allocate/acc_host_mem_deallocate and cc_dev_mem_allocate/acc_dev_mem_deallocate to incorporate SVM. Renamed some backend/helper functions. Tested and fixed acc_opencl_stristr. * Respect compile-time setting (ACC_OPENCL_SVM). * Removed support for ACC_OPENCL_STREAM_OOOEXEC as usage depends on in-order behavior. Removed OpenCL private test (nothing to test left after code cleanup). Introduced environment variables to control acc_opencl_options. Fixed acc_opencl_stristr. * Fixed calling clGetMemObjectInfo accidentally with wrong object. Runtime-select implementations of atomic_add_global (new form almost doubles perf. on Nvidia OpenCL). * Attempt to fix linker errors with additional test case (HIP/ROCm). * Fixed warnings about explicitly deprecated CUDA/HIP functions. * Another attempt to fix cross-dependency in CUDA/HIP backend. * One more attempt to fix cross-dependencies. * Disabled dbcsr_acc_test for HIP (linker error due to cross-dependency). * Revert "Fixed warnings about explicitly deprecated CUDA/HIP functions." This reverts commit 7fc9407. * Prettify. * Improved creating resource/kernel file. Introduced CONSTANT and related runtime-check; adjusted kernel-buffer kinds accordingly (kernel code). Disabled SVM support (compile-time). Code cleanup. * Renamed CONSTANT to GLOBAL and expand GLOBAL to either "constant" or "global". Fixes for BSD/macOS (acc_opencl.sh). * Removed superfluous barrier. * Allow to disable (pre-)transposing B-matrices (to only run the SMM-kernel). Disabled comparison against EPSILON and thereby avoid EXIT_FAILURE when missing the tolerance (error margin is printed). * Prepared for tuned kernel (introduced parameters; WIP) * Introduced OPENCL_LIBSMM_SMM_BLOCK_M/N. Code cleanup. * Adjusted script generating resource file (kernel header). * Introduced compile-time WG-size (transpose kernel). * Implemented blocking SMMs into tiles. Introduced (mini-)batchsize (only BS=1 is implemented yet). * BE: Attempt to iteratively limit the WG-size prior to building the kernel (based on device's maximum supported WG-size). * BE: Adjusted definition of acc_opencl_wgsize; implemented device-specific path (in addition to kernel-specific path). * BE: Adjusted acc_opencl_wgsize to take device as an argument (rather than querying the active device internally). * LIBSMM: Introduced/implemented OPENCL_LIBSMM_SMM_BLOCK_M/N. * LIBSMM: Reworked kernel with more compile-time knowledge. * Keep macro definitions (acc_opencl.sh; kernel header). * Implemented intra-kernel (mini-)batch accumulation (disabled by default; BS=1). Normalized initial matrix values in benchmark driver. * Fixed SMM-kernel for (mini-)batches (1 < BS). Rely 2d-arrays for clarity (cleanup). * Adjusted and fixed work split. Print additional norm (debug). Fixed compiler warning. * Removed barrier (mini-batch). * Fixed array initializer. * Reintroduced barrier. * Removed dead code (as suggested). * Initial auto-tuning script (based in OpenTuner; documentation and requirements.txt to follow). Reduced benchmark runtime to accelerate auto-tuning. * acc_bench_smm: Introduced compile-time (VALIDATE) and runtime option (CHECK environment variable) to allow omitting validation of results. * acc_bench_smm: Reduced number of repetitions (normally the warm-up makes timing stable enough). * acc_bench_smm: Sanitize command line arguments. * Adjusted filename of finally written result. * Prettified Python script. * Fixed file header/banner. * Improved performance of SMM-kernel; adjusted tune_multiply.py accordingly. * tune_multiply.py: Return an non-competitive/bad result in case of an error/invalid experiment (auto-tuning). * tune_multiply.py: Avoid UnboundLocalError in Python code (local variable 'match' referenced before assignment). * tune_multiply.py: Improved handling errors and error messages. * tune_multiply.py: Adjusted defaults/seed. * Adjusted filename (max.gflops found), and added newline (final result file). * Extend result/file for easier reuse (JSON), and merge JSONs into CSV file. * Implemented console output/information about merged/ignored JSON files. * Allow custom separator (CSV file). * Code cleanup (tune_multiply.py). * Implemented loading tuned parameters embedded into binary or from file. * Ensure initialization/finalization is outside of parallel region (BE and LIBSMM). * OPENCL_LIBSMM_PARAMS_DELIMS are used to tokenize parameters (CSV file). * Print type-id in addition to name of element type (benchmark drivers). * Regex to match console output; optional CSV file (tune_multiply.py). * JSON/CSV: store type-id rather than typename (smaller). * Introduced (optional) parameter file to Make and CMake. * Improved help/usage message, and handling errors (acc_opencl.sh). * Support CSV parameter file (acc_opencl.sh). * License banner (acc_opencl.sh). * Fixed issues pointed out by Shellcheck. * Fixed/worked around initialize/finalize issue. * Correct initialization/finalization flow (benchmark drivers); including a workaround for #422 (CUDA). * Missed workaround for CUDA (#422). * Added requirements (OpenTuner). Added wrapper script to tune multiple triplets in several sessions. * Improved console output. * Updated various documentation pieces (WIP). * Allow empty/no choice with respect to USE_ACCEL. * Attempt to CI-test OpenCL backend and LIBSMM. * Adjusted CI/build setup: build LIBXSMM and help CMake to find OpenCL. * Extend PKG_CONFIG_PATH rather than overriding it. * Further adjusted build/run scripts (Daint-CI). * One more attempt to get CI up and running. * Disabled Daint-CI runtime tests (temporarily). Prepared revised transpose kernel. * Replaced OPENCL_LIBSMM_TRANS_WGSIZE in favor of OPENCL_LIBSMM_TRANS_BLOCK_M. * Sanitize command line arguments similar to acc_bench_smm. * Folded inplace-transpose into general transpose.cl. * Improved finding OpenCL bits (e.g., on Daint). * Fixed nasty typo. Adjusted default GPU to P100 (to better adhere to DBCSR default). * Improved build messages/help. * Adjusted installation instructions for clarity. * Adjusted existing documentation to better accommodate/distinct the OpenCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM. * Documented auto-tuning. * Improved console output (tune_multiply.sh). * Note about opentuner.db directory. Some additional details and rephrase. * Adjusted separator (tune_multiply.sh). * Improved documentation with some sample output (auto-tuning).
- Loading branch information