Skip to content

v2024.07.0

Latest
Compare
Choose a tag to compare
@rhornung67 rhornung67 released this 12 Aug 17:14
6e81aa5

This release contains new features, bug fixes, and build improvements.

Please download the RAJAPerf-v2024.07.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

  • New features and usage changes:

    • Added MATVEC_3D_STENCIL kernel to Apps group
    • Added MULTI_REDUCE kernel to Basic group. Multi-reduce is a new capability in RAJA.
    • Added HISTOGRAM kernel to Algorithm group. This kernel tests the RAJA multi-reduce capability and has algorithm options involving atomic operations, such as the ability to assess various degrees of atomic contention.
    • Added many SYCL kernel variants (note -- all RAJA SYCL variant kernels with reductions use the new RAJA reduction interface):
      • Basic group: ARRAY_OF_PTRS, COPY8, DAXPY, IF_QUAD, INIT3, INIT_VIEW_1D, INIT_VIEW1D_OFFSET, MAT_MAT_SHARED, MULADDSUB, NESTED_INIT, REDUCE3_INT, TRAP_INT
      • Lcals group: DIFF_PREDICT, EOS, FIRST_DIFF, FIRST_MIN, FIRST_SUM, GEN_LIN_RECUR, HYDRO_1D, HYDRO_2D, INT_PREDICT, PLANCKIAN, TRIDIAG_ELIM
      • Polybench group: POLYBENCH_2MM, POLYBENCH_3MM, POLYBENCH_ADI, POLYBENCH_ATAX, POLYBENCH_FDTD_2D, POLYBENCH_FLOYD_WARSHALL, POLYBENCH_GEMM, POLYBENCH_GEMVER, POLYBENCH_GESUMMV, POLYBENCH_HEAT_3D, POLYBENCH_JACOBI_1D, POLYBENCH_JACOBI_2D, POLYBENCH_MVT
      • Stream group: ADD, COPY, DOT, MUL, TRIAD
      • Apps group: CONVECTION3DPA, DEL_DOT_VEC_2D, DIFFUSION3DPA, EDGE3D, ENERGY, FIRS, LTIMES, LTIMES_NOVIEW, MASS3DEA, MASS3DPA, MATVEC_3D_STENCIL, PRESSURE, VOL3D, ZONAL_ACCUMULATION_3D
      • Alg group: REDUCE_SUM
    • Add new kernel group Comm, which now contains all HALO* kernels.
    • Add occupancy calculator grid stride (occgs_<block_size>) tuning for CUDA and HIP variants of kernels with reductions. This generally improves performance at problem sizes greater than the maximum occupancy of a device because the amount of work to finalize a reduction is proportional to the number of blocks. The occupancy calculator grid stride tuning works by launching fewer total threads than iterates and uses a grid stride loop to assign multiple iterates to each thread. The maximum number of blocks is determined by the occupancy calculator to maximize occupancy.
    • Add reduction "tunings" for RAJA variants of kernels with reductions to compare performance of RAJA's default reduction interface and RAJA's new (experimental) reduction interface.
    • Change to use pinned memory for HIP variants of kernels with reductions and "fused" kernels. This improves performance as pinned memory can be cached on a HIP device.
    • Add additional CUDA memory space options for kernel data to compare performance, specifically CudaManagedHostPreferred, CudaManagedDevicePreferred, CudaManagedHostPreferredDeviceAccessed, CudaManagedDevicePreferredHostAccessed (see the output generated by the --help option for more information).
    • Make comparison of performance of kernels with reduction more fair by adding a RAJA GPU block atomic (blkatm) tuning that more closely matches the base GPU kernel variant implementations. Note there is currently false sharing/extra atomic contention when there are multiple reductions in kernels run with the blkatm tunings. This is not addressed (yet).
    • Apply new RAJA "dispatch" policies in Comm_HALOEXCHANGE_FUSED kernel.
    • Adds real multi-rank MPI implementations of Comm_HALOEXCHANGE and Comm_HALOEXCHANGE_FUSED kernels.
    • Kernels that had problematic implementations (causing correctness issues) have all been fixed. Earlier these kernels were disabled by default and there was a command line option to run them. That option has been removed and all previously-problematic kernels are enabled by default.
    • Added new ATOMIC kernel and various options to check atomic performance related to the contention level.
    • Generalize and add manual tunings to scan kernels.
    • Split "bytes per rep" counters into bytes read, bytes written, and bytes atomic modify-write. The goal is to better understand performance on hardware where the total bandwidth is not the same as the sum of the read and write bandwidth.
    • Added command line options to support the selection of more kernel-specific execution settings. Give the -h or --help command line option for usage information.
    • Added command line option to specify a set of "warmup" kernels to run to override the default behavior of running a set of warmup kernels based on the features used in the set of kernels specified to run.
  • Build changes / improvements:

    • The RAJA submodule has been updated to v2024.07.0.
    • The BLT submodule has been updated to v0.6.2, which is the version used by the RAJA submodule version.
    • Improvements to the Caliper instrumentation to include per-kernel RAJA features in the performance metrics.
    • Fixed issues with install process of shared library builds.
    • Some changes to default "tuning" names that are reported so they are more consistent across different RAJA back-ends.
    • Many CI testing improvements and updates including added CI testing for the case when MPI is enabled to test the Comm_* kernels in a manner that more closely resembles how real application codes run.
  • Bug fixes / improvements:

    • Make Basic_INDEXLIST_3LOOP kernel implementations consistent. That is, change it to read the last member of the counts array instead of using a reducer for the RAJA variants.
    • Change the internal size type for arrays to allow running benchmarks at much larger problem sizes.
    • Fix issue that caused the Basic_INDEXLIST kernel to hang occasionally.
    • A variety of fixes and cleanups in the LC build scripts (of interest to users with access to LC machines).
    • Fix an issue where a command line option requesting information only would run the Suite when it shouldn't.
    • Make memory use of the base GPU variants of kernels with reductions more consistent with the RAJA reduction implementation. These variants were using memory poorly. They now use device based memory that is host accessible to avoid making two cuda/hipMemcpy calls. This significantly reduces host side overheads and improves performance of base GPU reduction kernels when run at smaller problem sizes.
    • Fix compilation issues with the OpenMP target offload variants of the Basic_COPY8 kernel.
    • Fix issue with Lcals_FIRST_MIN GPU kernel reduction implementation.
    • Convert all non-RAJA base and lambda GPU kernel variants so that all GPU kernel variants use the same kernel launch methods that RAJA uses internally. Also added are compile-time checks for number of kernel arguments and their types so that calls to launch methods always matches the kernel definitions.
    • Fix the Base_HIP variant of the INDEXLIST kernel, which would occasionally deadlock.
    • Made internal memory usage (allocation, initialization, deallocation) for SYCL kernel variants consistent with all other variants.
    • Fixed Sphinx theme in Read The Docs documentation.