BlockCRS Benchmark

CMake setup

We demonstrate how to configure the Trilinos code for Intel and NVIDIA GPU architectures. First we show the base configuration that is commonly used for our target architectures and we show the custom cmake variables and setup for each.

CMake base configure

#!/bin/bash  

USE_CUDA=OFF  # ON if GPU
USE_OPENMP=ON 

EXAMPLE=ON
TEST=ON

BUILD_TYPE=RELEASE  # or DEBUG
TRILINOS_DIR=/your/trilinos/source/directory
INSTALL_DIR=/your/trilinos/install/directory

rm -rf C*  
cmake \ 
    -D BUILD_SHARED_LIBS:BOOL=OFF \                                                                           
    -D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \                                                       
    -D Trilinos_ENABLE_INSTALL_CMAKE_CONFIG_FILES:BOOL=ON \                                                   
    -D Trilinos_ENABLE_EXAMPLES:BOOL=${EXAMPLE} \                                                             
    -D Trilinos_ENABLE_TESTS:BOOL=${TEST} \                                                                                                                                 
    -D Trilinos_ENABLE_Fortran:BOOL=OFF \                                                                     
    -D Trilinos_ENABLE_KokkosCore:BOOL=ON \                                                                   
    -D Trilinos_ENABLE_KokkosAlgorithms:BOOL=ON \                                                             
    -D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \                                                                
    -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \                                                       
    -D Trilinos_ENABLE_Tpetra:BOOL=ON \
    -D Teuchos_ENABLE_LONG_LONG_INT:BOOL=OFF \                                                                                                                                       
    -D CMAKE_BUILD_TYPE:STRING=${BUILD_TYPE} \                                                                                                        
    -D CMAKE_CXX_COMPILER:FILEPATH="mpicxx" \                                                                 
    -D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \                                                                      
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D CMAKE_INSTALL_PREFIX:PATH=${INSTALL_DIR} \                                                                                                                            
    -D TPL_ENABLE_GLM=OFF \                                                                                   
    -D TPL_ENABLE_MPI:BOOL=ON \                                                                               
    -D TPL_ENABLE_LAPACK:BOOL=ON \                                                                            
    -D TPL_ENABLE_BLAS:BOOL=ON \                                                                              
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D Trilinos_ENABLE_OpenMP=${USE_OPENMP} \                                                                 
    -D Kokkos_ENABLE_OpenMP:BOOL=${USE_OPENMP} \                                                              
    -D Kokkos_ENABLE_TESTS:BOOL=ON \                                                                          
    -D TPL_ENABLE_CUDA:BOOL=${USE_CUDA} \                                                                     
    -D TPL_ENABLE_CUSPARSE:BOOL=${USE_CUDA} \                                                                 
    -D Kokkos_ENABLE_Cuda:BOOL=${USE_CUDA} \                                                                  
    -D Kokkos_ENABLE_Cuda_UVM:BOOL=${USE_CUDA} \                                                              
    $TRILINOS_DIR

Architecture specific CMake setup

specify KOKKOS_ARCH

  -D KOKKOS_ARCH="[OPT]", available options are 
               [AMD]
                 AMDAVX         = AMD CPU
               [ARM]
                 ARMv80         = ARMv8.0 Compatible CPU
                 ARMv81         = ARMv8.1 Compatible CPU
                 ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
               [IBM]
                 Power7         = IBM POWER7 and POWER7+ CPUs
                 Power8         = IBM POWER8 CPUs
                 Power9         = IBM POWER9 CPUs
               [Intel]
                 WSM            = Intel Westmere CPUs
                 SNB            = Intel Sandy/Ivy Bridge CPUs
                 HSW            = Intel Haswell CPUs
                 BDW            = Intel Broadwell Xeon E-class CPUs
                 SKX            = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
               [Intel Xeon Phi]
                 KNC            = Intel Knights Corner Xeon Phi
                 KNL            = Intel Knights Landing Xeon Phi
               [NVIDIA]
                 Kepler30       = NVIDIA Kepler generation CC 3.0
                 Kepler32       = NVIDIA Kepler generation CC 3.2
                 Kepler35       = NVIDIA Kepler generation CC 3.5
                 Kepler37       = NVIDIA Kepler generation CC 3.7
                 Maxwell50      = NVIDIA Maxwell generation CC 5.0
                 Maxwell52      = NVIDIA Maxwell generation CC 5.2
                 Maxwell53      = NVIDIA Maxwell generation CC 5.3
                 Pascal60       = NVIDIA Pascal generation CC 6.0
                 Pascal61       = NVIDIA Pascal generation CC 6.1
                 Volta70        = NVIDIA Volta generation CC 7.0
                 Volta72        = NVIDIA Volta generation CC 7.2
   for heterogeneous architectures, put each arch variables with comma e.g., "Power8,Pascal60"

specify LAPACK and BLAS libraries

  -D TPL_LAPACK_LIBRARIES:FILEPATH="-llapack" or "-mkl" (if an Intel compiler is used)                                                                 
  -D TPL_BLAS_LIBRARIES:FILEPATH="-lblas" or "-mkl" (if an Intel compiler is used)    

   if your BLAS and LAPACK is located in a non-standard path, please append the path to LD_LIBRARY_PATH.

For CUDA, set CUDA specfiic environment varialbes as follows.

export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper                                              
export CUDA_LAUNCH_BLOCKING=1                                                                                 
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1

Path to benchmark source:

Trilinos/packages/tpetra/core/example/BlockCrs

Path to benchmark executable:

$BUILD/packages/tpetra/core/example/BlockCrs/TpetraCore_BlockCrsPerfTest.exe

Command line options and default values

[kyukim @bread] BlockCrs > ./TpetraCore_BlockCrsPerfTest.exe --help
Usage: ./TpetraCore_BlockCrsPerfTest.exe [options]
  options:
  --help                               Prints this help message
  --pause-for-debugging                Pauses for user input to allow attaching a debugger
  --echo-command-line                  Echo the command-line but continue as normal
  --num-elements-i       int           Number of cells in the I dimension.
                                       (default: --num-elements-i=2)
  --num-elements-j       int           Number of cells in the J dimension.
                                       (default: --num-elements-j=2)
  --num-elements-k       int           Number of cells in the K dimension.
                                       (default: --num-elements-k=2)
  --num-procs-i          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-i=1)
  --num-procs-j          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-j=1)
  --num-procs-k          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-k=1)
  --blocksize            int           Block size. The # of DOFs coupled in a multiphysics flow problem.
                                       (default: --blocksize=5)
  --nrhs                 int           Number of right hand sides to solve for.
                                       (default: --nrhs=1)
  --repeat               int           Number of iterations of matvec operations to measure performance.
                                       (default: --repeat=100)

Suggested scaling study: DESCRIBE WEAK/STRONG

Single Node OpenMP Strong Scale

export OMP_PROC_BIND=spread
export OMP_PLACES=threads

OMP_NUM_THREADS=1 ./TpetraCore_BlockCrsPerfTest.exe --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --blocksize=5 --nrhs=1 --repeat=20
OMP_NUM_THREADS=4 ./TpetraCore_BlockCrsPerfTest.exe --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --blocksize=5 --nrhs=1 --repeat=20

Single Node CUDA
Multi Node Weak Scale

Preliminary results:

Platform used:
Summary or screenshot:

Copyright © Trilinos a Series of LF Projects, LLC
For web site terms of use, trademark policy and other project policies please see https://lfprojects.org.

Trilinos Developer Home
Trilinos Package Owners
Policies
    New Developers
    Trilinos PR/CR
    Productivity++
    Support Policy
    Test Dashboard Policy
    Testing Policy
    Managing Issues
        New Issue Quick Ref
        Handling Stale Issues and Pull Requests
        Release Notes
    Software Quality Plan
    Compiler Warnings/Errors
    Proposing a New Package
    Guidance on Copyrights and Licenses
Tools
    CMake
    Doxygen
    git
    GitHub Notifications
    Mail lists
    Clang-format
Version Control
    Initial git setup
    'feature'/'develop'/'master' (cheatsheet)
    Simple centralized workflow
Building
    SEMS Dev Env
    Mac OS X
    ATDM Platforms
Containers
    Development Tips
    Automated Workflows
Testing
    Test Harness
    Pull Request Testing
        Submitting a Pull Request
        Pull Request Workflow
        Reproducing PR Errors
        Addressing Test Failures
        Trilinos Status Table Archive
    Pre-push (Checkin) Testing
        Remote pull/test/push
PR Creation & Approval Guidelines for Tpetra, Ifpack2, and MueLu Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly