Skip to content

BlockCRS Benchmark

kyungjoo-kim edited this page May 18, 2018 · 5 revisions

This benchmark measures the performance of Tpetra::BlockCrsMatrix in a realistic application context. The code uses 7 point stencile operator to mimic finite volume CFD code. The problem domain is a 3D cube and is distributed over MPI processors. Internally, the code exploits node-level parallelism using Kokkos. This benchmark measures the following performance features.

  • logal/global graph construction
  • local/global block crs matrix and multivector fill
  • block crs matrix vector multiplication
  • equivalent flat scalar matrix vector multiplication This benchmarks provides a baseline performance of the current Tpetra::BlockCrsMatrix implementation.

CMake setup

In this section, we show how to configure the Trilinos code for Intel and NVIDIA GPU architectures. First we show the base configuration that is commonly used for our target architectures and we explain customized cmake variables and setup for each target architecture.

CMake base configure

#!/bin/bash  

USE_CUDA=OFF  # ON if GPU
USE_OPENMP=ON 

EXAMPLE=ON
TEST=ON

BUILD_TYPE=RELEASE  # or DEBUG
TRILINOS_DIR=/your/trilinos/source/directory
INSTALL_DIR=/your/trilinos/install/directory

rm -rf C*  
cmake \ 
    -D BUILD_SHARED_LIBS:BOOL=OFF \                                                                           
    -D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \                                                       
    -D Trilinos_ENABLE_INSTALL_CMAKE_CONFIG_FILES:BOOL=ON \                                                   
    -D Trilinos_ENABLE_EXAMPLES:BOOL=${EXAMPLE} \                                                             
    -D Trilinos_ENABLE_TESTS:BOOL=${TEST} \                                                                                                                                 
    -D Trilinos_ENABLE_Fortran:BOOL=OFF \                                                                     
    -D Trilinos_ENABLE_KokkosCore:BOOL=ON \                                                                   
    -D Trilinos_ENABLE_KokkosAlgorithms:BOOL=ON \                                                             
    -D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \                                                                
    -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \                                                       
    -D Trilinos_ENABLE_Tpetra:BOOL=ON \
    -D Teuchos_ENABLE_LONG_LONG_INT:BOOL=OFF \                                                                                                                                       
    -D CMAKE_BUILD_TYPE:STRING=${BUILD_TYPE} \                                                                                                        
    -D CMAKE_CXX_COMPILER:FILEPATH="mpicxx" \                                                                 
    -D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \                                                                      
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D CMAKE_INSTALL_PREFIX:PATH=${INSTALL_DIR} \                                                                                                                            
    -D TPL_ENABLE_GLM=OFF \                                                                                   
    -D TPL_ENABLE_MPI:BOOL=ON \                                                                               
    -D TPL_ENABLE_LAPACK:BOOL=ON \                                                                            
    -D TPL_ENABLE_BLAS:BOOL=ON \                                                                              
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D Trilinos_ENABLE_OpenMP=${USE_OPENMP} \                                                                 
    -D Kokkos_ENABLE_OpenMP:BOOL=${USE_OPENMP} \                                                              
    -D Kokkos_ENABLE_TESTS:BOOL=ON \                                                                          
    -D TPL_ENABLE_CUDA:BOOL=${USE_CUDA} \                                                                     
    -D TPL_ENABLE_CUSPARSE:BOOL=${USE_CUDA} \                                                                 
    -D Kokkos_ENABLE_Cuda:BOOL=${USE_CUDA} \                                                                  
    -D Kokkos_ENABLE_Cuda_UVM:BOOL=${USE_CUDA} \                                                              
    $TRILINOS_DIR    

Architecture specific CMake setup

  • specify KOKKOS_ARCH
  -D KOKKOS_ARCH="[OPT]", available options are 
               [AMD]
                 AMDAVX         = AMD CPU
               [ARM]
                 ARMv80         = ARMv8.0 Compatible CPU
                 ARMv81         = ARMv8.1 Compatible CPU
                 ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
               [IBM]
                 Power7         = IBM POWER7 and POWER7+ CPUs
                 Power8         = IBM POWER8 CPUs
                 Power9         = IBM POWER9 CPUs
               [Intel]
                 WSM            = Intel Westmere CPUs
                 SNB            = Intel Sandy/Ivy Bridge CPUs
                 HSW            = Intel Haswell CPUs
                 BDW            = Intel Broadwell Xeon E-class CPUs
                 SKX            = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
               [Intel Xeon Phi]
                 KNC            = Intel Knights Corner Xeon Phi
                 KNL            = Intel Knights Landing Xeon Phi
               [NVIDIA]
                 Kepler30       = NVIDIA Kepler generation CC 3.0
                 Kepler32       = NVIDIA Kepler generation CC 3.2
                 Kepler35       = NVIDIA Kepler generation CC 3.5
                 Kepler37       = NVIDIA Kepler generation CC 3.7
                 Maxwell50      = NVIDIA Maxwell generation CC 5.0
                 Maxwell52      = NVIDIA Maxwell generation CC 5.2
                 Maxwell53      = NVIDIA Maxwell generation CC 5.3
                 Pascal60       = NVIDIA Pascal generation CC 6.0
                 Pascal61       = NVIDIA Pascal generation CC 6.1
                 Volta70        = NVIDIA Volta generation CC 7.0
                 Volta72        = NVIDIA Volta generation CC 7.2
   for heterogeneous architectures, put each arch variables with comma 
   e.g., "Power8,Pascal60"
  • specify LAPACK and BLAS libraries
  -D TPL_LAPACK_LIBRARIES:FILEPATH="-llapack" or "-mkl" (Intel compiler)
  -D TPL_BLAS_LIBRARIES:FILEPATH="-lblas" or "-mkl" (Intel compiler)

   if your BLAS and LAPACK is located in a non-standard path, please
   append the path to LD_LIBRARY_PATH.                                                                
  • For CUDA, set CUDA specfiic environment varialbes as follows.
export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper                                              
export CUDA_LAUNCH_BLOCKING=1                                                                                 
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1   

Path to benchmark source:

Path to benchmark executable:

  • $BUILD/packages/tpetra/core/example/BlockCrs/TpetraCore_BlockCrsPerfTest.exe

Command line options and default values

[kyukim @bread] BlockCrs > ./TpetraCore_BlockCrsPerfTest.exe --help
Usage: ./TpetraCore_BlockCrsPerfTest.exe [options]
  options:
  --help                               Prints this help message
  --pause-for-debugging                Pauses for user input to allow attaching a debugger
  --echo-command-line                  Echo the command-line but continue as normal
  --num-elements-i       int           Number of cells in the I dimension.
                                       (default: --num-elements-i=2)
  --num-elements-j       int           Number of cells in the J dimension.
                                       (default: --num-elements-j=2)
  --num-elements-k       int           Number of cells in the K dimension.
                                       (default: --num-elements-k=2)
  --num-procs-i          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-i=1)
  --num-procs-j          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-j=1)
  --num-procs-k          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-k=1)
  --blocksize            int           Block size. The # of DOFs coupled in a multiphysics flow problem.
                                       (default: --blocksize=5)
  --nrhs                 int           Number of right hand sides to solve for.
                                       (default: --nrhs=1)
  --repeat               int           Number of iterations of matvec operations to measure performance.
                                       (default: --repeat=100)

Suggested scaling study: DESCRIBE WEAK/STRONG

  • Single Node OpenMP Strong Scale
OMP_NUM_THREADS=4 OMP_PROC_BIND=spread OMP_PLACES=threads ./TpetraCore_BlockCrsPerfTest.exe \
  --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --blocksize=5 --nrhs=1 --repeat=20 
  • Single Node CUDA
OMP_NUM_THREADS=1 ./TpetraCore_BlockCrsPerfTest.exe \
  --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --blocksize=5 --nrhs=1 --repeat=20 
  • Multi Node Weak Scale
OMP_NUM_THREADS=2 OMP_PROC_BIND=spread OMP_PLACES=threads mpirun -np 32 ./TpetraCore_BlockCrsPerfTest.exe \
  --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 \
  --num-procs-i=4 --num-procs-j=8 --num-procs-k=1 \
  --blocksize=5 --nrhs=1 --repeat=20

Preliminary results:

  • Platform used:
  • Summary or screenshot:
Clone this wiki locally