Skip to content

BlockCRS Benchmark

kyungjoo-kim edited this page May 11, 2018 · 5 revisions

CMake setup

We demonstrate how to configure the Trilinos code for Intel and NVIDIA GPU architectures. First we show the base configuration that is commonly used for our target architectures and we show the custom cmake variables and setup for each.

CMake base configure

#!/bin/bash  

USE_CUDA=OFF  # ON if GPU
USE_OPENMP=ON 

EXAMPLE=ON
TEST=ON

BUILD_TYPE=RELEASE  # or DEBUG
TRILINOS_DIR=/your/trilinos/source/directory
INSTALL_DIR=/your/trilinos/install/directory

rm -rf C*  
cmake \ 
    -D BUILD_SHARED_LIBS:BOOL=OFF \                                                                           
    -D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \                                                       
    -D Trilinos_ENABLE_INSTALL_CMAKE_CONFIG_FILES:BOOL=ON \                                                   
    -D Trilinos_ENABLE_EXAMPLES:BOOL=${EXAMPLE} \                                                             
    -D Trilinos_ENABLE_TESTS:BOOL=${TEST} \                                                                                                                                 
    -D Trilinos_ENABLE_Fortran:BOOL=OFF \                                                                     
    -D Trilinos_ENABLE_KokkosCore:BOOL=ON \                                                                   
    -D Trilinos_ENABLE_KokkosAlgorithms:BOOL=ON \                                                             
    -D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \                                                                
    -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \                                                       
    -D Trilinos_ENABLE_Tpetra:BOOL=ON \
    -D Teuchos_ENABLE_LONG_LONG_INT:BOOL=OFF \                                                                                                                                       
    -D CMAKE_BUILD_TYPE:STRING=${BUILD_TYPE} \                                                                                                        
    -D CMAKE_CXX_COMPILER:FILEPATH="mpicxx" \                                                                 
    -D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \                                                                      
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D CMAKE_INSTALL_PREFIX:PATH=${INSTALL_DIR} \                                                                                                                            
    -D TPL_ENABLE_GLM=OFF \                                                                                   
    -D TPL_ENABLE_MPI:BOOL=ON \                                                                               
    -D TPL_ENABLE_LAPACK:BOOL=ON \                                                                            
    -D TPL_ENABLE_BLAS:BOOL=ON \                                                                              
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D Trilinos_ENABLE_OpenMP=${USE_OPENMP} \                                                                 
    -D Kokkos_ENABLE_OpenMP:BOOL=${USE_OPENMP} \                                                              
    -D Kokkos_ENABLE_TESTS:BOOL=ON \                                                                          
    -D TPL_ENABLE_CUDA:BOOL=${USE_CUDA} \                                                                     
    -D TPL_ENABLE_CUSPARSE:BOOL=${USE_CUDA} \                                                                 
    -D Kokkos_ENABLE_Cuda:BOOL=${USE_CUDA} \                                                                  
    -D Kokkos_ENABLE_Cuda_UVM:BOOL=${USE_CUDA} \                                                              
    $TRILINOS_DIR    

Architecture specific CMake setup

  • specify KOKKOS_ARCH
  -D KOKKOS_ARCH="[OPT]", available options are 
               [AMD]
                 AMDAVX         = AMD CPU
               [ARM]
                 ARMv80         = ARMv8.0 Compatible CPU
                 ARMv81         = ARMv8.1 Compatible CPU
                 ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
               [IBM]
                 Power7         = IBM POWER7 and POWER7+ CPUs
                 Power8         = IBM POWER8 CPUs
                 Power9         = IBM POWER9 CPUs
               [Intel]
                 WSM            = Intel Westmere CPUs
                 SNB            = Intel Sandy/Ivy Bridge CPUs
                 HSW            = Intel Haswell CPUs
                 BDW            = Intel Broadwell Xeon E-class CPUs
                 SKX            = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
               [Intel Xeon Phi]
                 KNC            = Intel Knights Corner Xeon Phi
                 KNL            = Intel Knights Landing Xeon Phi
               [NVIDIA]
                 Kepler30       = NVIDIA Kepler generation CC 3.0
                 Kepler32       = NVIDIA Kepler generation CC 3.2
                 Kepler35       = NVIDIA Kepler generation CC 3.5
                 Kepler37       = NVIDIA Kepler generation CC 3.7
                 Maxwell50      = NVIDIA Maxwell generation CC 5.0
                 Maxwell52      = NVIDIA Maxwell generation CC 5.2
                 Maxwell53      = NVIDIA Maxwell generation CC 5.3
                 Pascal60       = NVIDIA Pascal generation CC 6.0
                 Pascal61       = NVIDIA Pascal generation CC 6.1
                 Volta70        = NVIDIA Volta generation CC 7.0
                 Volta72        = NVIDIA Volta generation CC 7.2
   for heterogeneous architectures, put each arch variables with comma e.g., "Power8,Pascal60"
  • specify LAPACK and BLAS libraries
  -D TPL_LAPACK_LIBRARIES:FILEPATH="-llapack" or "-mkl" (if an Intel compiler is used)                                                                 
  -D TPL_BLAS_LIBRARIES:FILEPATH="-lblas" or "-mkl" (if an Intel compiler is used)    

   if your BLAS and LAPACK is located in a non-standard path, please append the path to LD_LIBRARY_PATH.                                                                
  • For CUDA, set CUDA specfiic environment varialbes as follows.
export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper                                              
export CUDA_LAUNCH_BLOCKING=1                                                                                 
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1   

Path to benchmark source:

Path to benchmark executable:

  • $BUILD/packages/tpetra/core/example/BlockCrs/TpetraCore_BlockCrsPerfTest.exe

Command line options and default values

[kyukim @bread] BlockCrs > ./TpetraCore_BlockCrsPerfTest.exe --help
Usage: ./TpetraCore_BlockCrsPerfTest.exe [options]
  options:
  --help                               Prints this help message
  --pause-for-debugging                Pauses for user input to allow attaching a debugger
  --echo-command-line                  Echo the command-line but continue as normal
  --num-elements-i       int           Number of cells in the I dimension.
                                       (default: --num-elements-i=2)
  --num-elements-j       int           Number of cells in the J dimension.
                                       (default: --num-elements-j=2)
  --num-elements-k       int           Number of cells in the K dimension.
                                       (default: --num-elements-k=2)
  --num-procs-i          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-i=1)
  --num-procs-j          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-j=1)
  --num-procs-k          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-k=1)
  --blocksize            int           Block size. The # of DOFs coupled in a multiphysics flow problem.
                                       (default: --blocksize=5)
  --nrhs                 int           Number of right hand sides to solve for.
                                       (default: --nrhs=1)
  --repeat               int           Number of iterations of matvec operations to measure performance.
                                       (default: --repeat=100)

Suggested scaling study: DESCRIBE WEAK/STRONG

  • Single Node OpenMP Strong Scale
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

OMP_NUM_THREADS=1 ./TpetraCore_BlockCrsPerfTest.exe --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --blocksize=5 --nrhs=1 --repeat=20
OMP_NUM_THREADS=4 ./TpetraCore_BlockCrsPerfTest.exe --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --blocksize=5 --nrhs=1 --repeat=20 
  • Single Node CUDA

  • Multi Node Weak Scale

Preliminary results:

  • Platform used:
  • Summary or screenshot:
Clone this wiki locally