Controlling default datatypes and iterations via environment variables

Also updated readme Exported MPI, CUDA and caliper links public
ShatrovOA · Jan 29, 2025 · 5fbdcd2 · 5fbdcd2
1 parent a63868a
commit 5fbdcd2
Show file tree

Hide file tree

Showing 10 changed files with 187 additions and 99 deletions.
diff --git a/README.md b/README.md
@@ -22,9 +22,12 @@ Special optimization is automatically used in 3D plan in case number of MPI proc
 - Fortran, C and C++ interfaces
 - 2D and 3D transposition plans
 - Slab and Pencil decompositions
+- Host and CUDA versions
 - Can be linked with multiple FFT libraries simultaneously. Execution library can be specified during plan creation. Currenly supported libraries are:
   -  [FFTW3](https://www.fftw.org/)
   -  [MKL](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2024-2/fourier-transform-functions.html)
+  -  [cuFFT](https://docs.nvidia.com/cuda/cufft/)
+  -  [VkFFT](https://github.com/DTolm/VkFFT)
 
 
 ## Usage
@@ -48,7 +51,7 @@ Plan creation subroutines have two common arguments:
 - executor_type - this argument specifies which external library should be used to create and execute 1d FFT plans. Default value is `DTFFT_EXECUTOR_NONE` which means that FFTs will not be executed.
 
 ### Execution
-When executing plan user must provide `transpose_type` argument. Two options are available: `DTFFT_TRANSPOSE_OUT` and `DTFFT_TRANSPOSE_IN`. First one assumes that incoming data is aligned in X direction (fastest) and return data aligned in Z direction.
+When executing plan user must provide `execute_type` argument. Two options are available: `DTFFT_TRANSPOSE_OUT` and `DTFFT_TRANSPOSE_IN`. First one assumes that incoming data is aligned in X direction (fastest) and return data aligned in Z direction.
 
 All plans require additional auxiliary buffer. This buffer can be passed by user to `execute` method.  If user do not provide such buffer during the first call to `execute`, necessary memory will be allocated internally and deallocated when user calls `destroy` method.
 
@@ -60,16 +63,40 @@ To build this library modern (2008+) Fortran compiler is required. This library
 
 | Option   | Possible values | Default value | Description |
 | -------- | ------- | -------- | ------- |
-| DTFFT_WITH_FFTW | on / off | off | Build dtFFT with FFTW support. When enabled user need to set `FFTWDIR` environmental variable in order to find FFTW3 located in custom directory. Both single and double precision versions of library are required |
-| DTFFT_WITH_MKL | on / off | off | Build dtFFT with MKL DFTI support |
+| DTFFT_WITH_CUDA | on / off | off | Build dtFFT with CUDA support. This options requires Nvidia HPC SDK compilers for both C/C++/Fortran. Make sure that `nvcc` compiler for correct CUDA version is in PATH |
+| DTFFT_CUDA_CC_LIST | Valid CUDA CC list | 70;80;90 | List of CUDA compute capabilities to build CUDA Fortran against |
+| DTFFT_WITH_FFTW | on / off | off | Build dtFFT with FFTW support. When enabled user need to set `FFTWDIR` environment variable in order to find FFTW3 located in custom directory. Both single and double precision versions of library are required |
+| DTFFT_WITH_MKL | on / off | off | Build dtFFT with MKL DFTI support. This option requires `MKLROOT` environment variable to be set |
+| DTFFT_WITH_CUFFT | on / off | off | Build dtFFT with cuFFT support. This option automatically enables `DTFFT_WITH_CUDA` |
+| DTFFT_WITH_VKFFT | on / off | off | Build dtFFT with VkFFT support. This options requires to set additional configuration parameter `VKFFT_DIR` that points to vkFFT.h. This option automatically enables `DTFFT_WITH_CUDA` |
 | DTFFT_BUILD_TESTS | on / off | off | Build tests |
 | DTFFT_ENABLE_COVERAGE | on / off | off | Build coverage of library. Only possible with gfortran |
 | DTFFT_BUILD_SHARED | on / off | on | Build shared library |
 | DTFFT_USE_MPI | on / off | on | Use Fortran `mpi` module instead of `mpi_f08` |
 | DTFFT_BUILD_C_CXX_API | on / off | on | Build C/C++ API |
 | DTFFT_ENABLE_PERSISTENT_COMM | on / off | off | In case you are planning to execute plan multiple times then it can be very beneficial to use persistent communications. But user must aware that such communications are created at first call to `execute` or `transpose` subroutines and pointers are saved internally inside MPI. All other plan executions will use those pointers. Take care not to free them. |
-| DTFFT_WITH_CALIPER | on / off | off | Enable library profiler via Caliper. Additional parameter is required to find caliper: `caliper_DIR` |
-| DTFFT_MEASURE_WARMUP_ITERS| positive integer | 2 | Number of warmup iterations to run before plan testing when passing DTFFT_MEASURE or DTFFT_PATIENT to effort_flag parameter during plan creation |
+| DTFFT_WITH_PROFILER | on / off | off | Enable library profiler. If `DTFFT_WITH_CUDA` is enabled then library will use nvtx3 library, otherwise caliper will be used and additional option may be required: `caliper_DIR` |
+
+During configuration one should set `CMAKE_INSTALL_PREFIX` with desired installation prefix. dtFFT can later be used in cmake configuration with following commands:
+```cmake
+# CUDAToolkit is required only for CUDA build and
+# must be found before dtfft
+find_package(CUDAToolkit REQUIRED)
+
+find_package(dtfft)
+add_executable(my_prog my_prog.c)
+target_link_libraries(my_prog PRIVATE dtfft)
+```
+Provided cmake target will add include directories and link all libraries. Make sure to point CMake to the desired dtFFT installation with dtfft_DIR when configuring the target program:
+```bash
+cmake -Ddtfft_DIR=<dtfft-installation-dir>/lib[64]/cmake/dtfft ..
+```
+
+## Useful runtime environment variables
+| Name | Possible values | Default value | Description |
+| -------- | ------- | -------- | ------- |
+| DTFFT_ENABLE_LOG | 0 / 1 | 0 | Make dtFFT write useful info |
+| DTFFT_MEASURE_WARMUP_ITERS | non negative integer | 2 | Number of warmup iterations to run before plan testing when passing `DTFFT_MEASURE` or `DTFFT_PATIENT` to effort_flag parameter during plan creation |
 | DTFFT_MEASURE_ITERS | positive integer | 5 | Number of iterations to run in order to find best plan when passing `DTFFT_MEASURE` or `DTFFT_PATIENT` to effort_flag parameter during plan creation |
 | DTFFT_FORWARD_X_Y | 1 / 2 | 2 | Default id of transposition plan for X -> Y transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags |
 | DTFFT_BACKWARD_X_Y | 1 / 2 | 2 | Default id of transposition plan for Y -> X transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags |
@@ -78,23 +105,22 @@ To build this library modern (2008+) Fortran compiler is required. This library
 | DTFFT_FORWARD_X_Z | 1 / 2 | 2 | Default id of transposition plan for X -> Z transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags in case Z-slab optimization is used |
 | DTFFT_BACKWARD_X_Z | 1 / 2 | 2 | Default id of transposition plan for Z -> Y transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags in case Z-slab optimization is used|
 
-## Notes for C users
-C and C++ interfaces of the library is available. Simply
+## Notes for C/C++ users
+dtFFT provids headers for both C and C++. Simply
 ```c
 // C header
 #include <dtfft.h>
 // C++ header
 #include <dtfft.hpp>
 ```
-and tell compiler where it should search for it:
-```bash
-mpicc ... -I<path_to_dtfft>/include ...
-```
-Since C arrays are stored in row-major order which is opposite to Fortran column-major when creating the plan, user should pass the dimensions of the array to the planner in reverse order. For example, if your array is a rank three N x M x L matrix in row-major order, you should pass the dimensions of the array as if it were an L x M x N matrix. Also if you are using R2R transform and wish to perform different transform kinds on different dimensions then buffer ```kinds``` should also be reversed.
+Since C arrays are stored in row-major order which is opposite to Fortran column-major when creating the plan, user should pass the dimensions of the array to the planner in reverse order. For example, if your array is a rank three N x M x L matrix in row-major order, you should pass the dimensions of the array as if it were an L x M x N matrix. Also if you are using R2R transform and wish to perform different transform kinds on different dimensions then buffer ```kinds``` should also be reversed. Same goes for MPI Communicators with attached cartesian topology.
 
+Examples are provided in ```tests/c``` folder.
 ## Next Steps
 
-- GPU Support
+- Optimize CUDA NVRTC kernels
+- Add support for nvshmem
+- Add support for custom NCCL installation
 ## Contribution
 
 You can help this project by reporting problems, suggestions, localizing it or contributing to the code. Go to issue tracker and check if your problem/suggestion is already reported. If not, create a new issue with a descriptive title and detail your suggestion or steps to reproduce the problem.

diff --git a/benchmark/cuda/dtfft_bench.h b/benchmark/cuda/dtfft_bench.h
@@ -62,7 +62,7 @@ void run_dtfft(bool c2c, dtfft_precision_t precision, bool enable_z_slab) {
     }
   }
 
-  int64_t alloc_size;
+  size_t alloc_size;
   DTFFT_CALL( dtfft_get_alloc_size(plan, &alloc_size) );
   create_time +=MPI_Wtime();
   if(comm_rank == 0) {

diff --git a/include/dtfft.hpp b/include/dtfft.hpp
@@ -86,16 +86,16 @@ namespace dtfft
 */
         template<typename T1, typename T2>
         dtfft_error_code_t
-        execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t transpose_type)
-        {return execute(in.data(), out.data(), transpose_type, NULL);}
+        execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t execute_type)
+        {return execute(in.data(), out.data(), execute_type, NULL);}
 
 
 
 /** \brief Plan execution with optional auxiliary vector
   *
   * \param[inout]   in              Incoming vector
   * \param[out]     out             Result vector
-  * \param[in]      transpose_type  Type of transform:
+  * \param[in]      execute_type    Type of transform:
   *                                   - `DTFFT_TRANSPOSE_OUT`
   *                                   - `DTFFT_TRANSPOSE_IN`
   * \param[inout]   aux             Optional auxiliary vector
@@ -104,41 +104,41 @@ namespace dtfft
 */
         template<typename T1, typename T2, typename T3>
         dtfft_error_code_t
-        execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t transpose_type, std::vector<T3> &aux)
-        {return execute(in.data(), out.data(), transpose_type, aux.data());}
+        execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t execute_type, std::vector<T3> &aux)
+        {return execute(in.data(), out.data(), execute_type, aux.data());}
 
 
 
 /** \brief Plan execution without auxiliary buffer using C-style pointers instead of vectors
   *
   * \param[inout]   in              Incoming buffer
   * \param[out]     out             Result buffer
-  * \param[in]      transpose_type  Type of transform:
+  * \param[in]      execute_type    Type of transform:
   *                                   - `DTFFT_TRANSPOSE_OUT`
   *                                   - `DTFFT_TRANSPOSE_IN`
   *
   * \return Status code of method execution
 */
         dtfft_error_code_t
-        execute(void *in, void *out, const dtfft_execute_type_t transpose_type)
-        {return execute(in, out, transpose_type, NULL);}
+        execute(void *in, void *out, const dtfft_execute_type_t execute_type)
+        {return execute(in, out, execute_type, NULL);}
 
 
 
 /** \brief Plan execution with auxiliary buffer using C-style pointers instead of vectors
   *
   * \param[inout]   in              Incoming buffer
   * \param[out]     out             Result buffer
-  * \param[in]      transpose_type  Type of transform:
+  * \param[in]      execute_type    Type of transform:
   *                                   - `DTFFT_TRANSPOSE_OUT`
   *                                   - `DTFFT_TRANSPOSE_IN`
   * \param[inout]   aux             Optional auxiliary buffer
   *
   * \return Status code of method execution
 */
         dtfft_error_code_t
-        execute(void *in, void *out, const dtfft_execute_type_t transpose_type, void *aux)
-        {return dtfft_execute(_plan, in, out, transpose_type, aux);}
+        execute(void *in, void *out, const dtfft_execute_type_t execute_type, void *aux)
+        {return dtfft_execute(_plan, in, out, execute_type, aux);}
 
 
 
@@ -147,7 +147,7 @@ namespace dtfft
   *
   * \param[inout]   in              Incoming vector
   * \param[out]     out             Transposed vector
-  * \param[in]      transpose_type  Type of transpose:
+  * \param[in]      execute_type    Type of transpose:
   *                                   - `DTFFT_TRANSPOSE_X_TO_Y`
   *                                   - `DTFFT_TRANSPOSE_Y_TO_X`
   *                                   - `DTFFT_TRANSPOSE_Y_TO_Z` (3d plan only)

diff --git a/src/dtfft_parameters.F90 b/src/dtfft_parameters.F90
@@ -172,6 +172,7 @@ module dtfft_parameters
   integer(int32), parameter,  public :: COLOR_FFT           = int(Z'00FCD05D')
   integer(int32), parameter,  public :: COLOR_AUTOTUNE      = int(Z'006075FF')
   integer(int32), parameter,  public :: COLOR_AUTOTUNE2     = int(Z'0056E874')
+  integer(int32), parameter,  public :: COLOR_DESTROY       = int(Z'00000000')
   integer(int32), parameter,  public :: COLOR_TRANSPOSE_PALLETTE(-3:3) = [COLOR_TRANSPOSE_ZX, COLOR_TRANSPOSE_ZY, COLOR_TRANSPOSE_YX, 0, COLOR_TRANSPOSE_XY, COLOR_TRANSPOSE_YZ, COLOR_TRANSPOSE_XZ]
 
   integer(int32),  parameter,  public  :: DTFFT_SUCCESS = CONF_DTFFT_SUCCESS
@@ -199,26 +200,7 @@ module dtfft_parameters
   integer(int32),  parameter,  public  :: DTFFT_ERROR_GPU_NOT_SET = CONF_DTFFT_ERROR_GPU_NOT_SET
   integer(int32),  parameter,  public  :: DTFFT_ERROR_VKFFT_R2R_2D_PLAN = CONF_DTFFT_ERROR_VKFFT_R2R_2D_PLAN
   integer(int32),  parameter,  public  :: DTFFT_ERROR_NOT_DEVICE_PTR = CONF_DTFFT_ERROR_NOT_DEVICE_PTR
-
 
-#if (DTFFT_FORWARD_X_Y > 2) || (DTFFT_FORWARD_X_Y <= 0)
-#error "Invalid DTFFT_FORWARD_X_Y parameter"
-#endif
-#if (DTFFT_BACKWARD_X_Y > 2) || (DTFFT_BACKWARD_X_Y <= 0)
-#error "Invalid DTFFT_BACKWARD_X_Y parameter"
-#endif
-#if (DTFFT_FORWARD_Y_Z > 2) || (DTFFT_FORWARD_Y_Z <= 0)
-#error "Invalid DTFFT_FORWARD_Y_Z parameter"
-#endif
-#if (DTFFT_BACKWARD_Y_Z > 2) || (DTFFT_BACKWARD_Y_Z <= 0)
-#error "Invalid DTFFT_BACKWARD_Y_Z parameter"
-#endif
-#if (DTFFT_FORWARD_X_Z > 2) || (DTFFT_FORWARD_X_Z <= 0)
-#error "Invalid DTFFT_FORWARD_X_Z parameter"
-#endif
-#if (DTFFT_BACKWARD_X_Z > 2) || (DTFFT_BACKWARD_X_Z <= 0)
-#error "Invalid DTFFT_BACKWARD_X_Z parameter"
-#endif
 
 #ifdef DTFFT_WITH_CUDA
   integer(int8),  parameter,  public  :: DTFFT_GPU_BACKEND_MPI_DATATYPE = CONF_DTFFT_GPU_BACKEND_MPI_DATATYPE

diff --git a/src/dtfft_plan.F90 b/src/dtfft_plan.F90
@@ -410,6 +410,8 @@ subroutine destroy(self, error_code)
     if ( .not. self%is_created ) ierr = DTFFT_ERROR_PLAN_NOT_CREATED
     CHECK_ERROR_AND_RETURN
 
+    REGION_BEGIN("dtfft_destroy", COLOR_DESTROY)
+
 #ifndef DTFFT_TRANSPOSE_ONLY
     select type ( self )
     class is ( dtfft_plan_r2c )
@@ -474,6 +476,7 @@ subroutine destroy(self, error_code)
     end block
     self%ndims = -1
     if ( present( error_code ) ) error_code = DTFFT_SUCCESS
+    REGION_END("dtfft_destroy")
   end subroutine destroy
 
   logical function get_z_slab(self, error_code)
@@ -678,7 +681,7 @@ integer(int32) function check_create_args(self, dims, comm, precision, effort_fl
     integer(int32)                            :: top_type         !< MPI Comm topology type
     integer(int32)                            :: dim              !< Counter
 
-    CHECK_INTERNAL_CALL( dtfft_init() )
+    CHECK_INTERNAL_CALL( init_internal() )
 
     self%ndims = size(dims, kind=int8)
     CHECK_INPUT_PARAMETER(self%ndims, VALID_DIMENSIONS, DTFFT_ERROR_INVALID_N_DIMENSIONS)

diff --git a/src/dtfft_transpose_plan_cuda.F90 b/src/dtfft_transpose_plan_cuda.F90
@@ -340,6 +340,7 @@ subroutine run_autotune_backend(comms, cart_comm, pencils, base_storage, stream,
     type(cudaEvent) :: timer_start, timer_stop
     character(len=:), allocatable :: testing_phase
     type(backend_helper)                      :: helper
+    integer(int32) :: n_warmup_iters, n_iters
     ! integer(cuda_count_kind) :: free, total
 
     if ( present(backend_id) ) then
@@ -411,33 +412,39 @@ subroutine run_autotune_backend(comms, cart_comm, pencils, base_storage, stream,
       PHASE_BEGIN(testing_phase, COLOR_AUTOTUNE)
       WRITE_INFO(testing_phase)
 
-      PHASE_BEGIN("Warmup, "//int_to_str(DTFFT_MEASURE_WARMUP_ITERS)//" iterations", COLOR_TRANSPOSE)
-      do iter = 1, DTFFT_MEASURE_WARMUP_ITERS
+      n_warmup_iters = get_iters_from_env(.true.)
+
+      PHASE_BEGIN("Warmup, "//int_to_str(n_warmup_iters)//" iterations", COLOR_TRANSPOSE)
+      do iter = 1, n_warmup_iters
         do i = 1, 2_int8 * n_transpose_plans
           call plans(i)%execute(in, out, stream)
         enddo
       enddo
       CUDA_CALL( "cudaStreamSynchronize", cudaStreamSynchronize(stream) )
-      PHASE_END("Warmup, "//int_to_str(DTFFT_MEASURE_WARMUP_ITERS)//" iterations")
+      PHASE_END("Warmup, "//int_to_str(n_warmup_iters)//" iterations")
+
+      call MPI_Barrier(cart_comm, mpi_ierr)
+
+      n_iters = get_iters_from_env(.false.)
 
-      PHASE_BEGIN("Testing, "//int_to_str(DTFFT_MEASURE_ITERS)//" iterations", COLOR_EXECUTE)
+      PHASE_BEGIN("Testing, "//int_to_str(n_iters)//" iterations", COLOR_EXECUTE)
       total_time = 0.0
     ! do i = 1, 2_int8 * n_transpose_plans
       CUDA_CALL( "cudaEventRecord", cudaEventRecord(timer_start, stream) )
-      do iter = 1, DTFFT_MEASURE_ITERS
+      do iter = 1, n_iters
         do i = 1, 2_int8 * n_transpose_plans
           call plans(i)%execute(in, out, stream)
         enddo
       enddo
       CUDA_CALL( "cudaEventRecord", cudaEventRecord(timer_stop, stream) )
       CUDA_CALL( "cudaEventSynchronize", cudaEventSynchronize(timer_stop) )
       CUDA_CALL( "cudaEventElapsedTime", cudaEventElapsedTime(execution_time, timer_start, timer_stop) )
-      execution_time = execution_time / real(DTFFT_MEASURE_ITERS, real32)
+      execution_time = execution_time / real(n_iters, real32)
       total_time = total_time + execution_time
       ! WRITE_INFO( TRANSPOSE_NAMES(plans(i)%get_tranpose_id())//" : "//double_to_str(real(execution_time, real64))//" [ms]")
     ! enddo
 
-      PHASE_END("Testing, "//int_to_str(DTFFT_MEASURE_ITERS)//" iterations")
+      PHASE_END("Testing, "//int_to_str(n_iters)//" iterations")
 
       call MPI_Allreduce(total_time, min_execution_time, 1, MPI_REAL4, MPI_MIN, cart_comm, mpi_ierr)
       call MPI_Allreduce(total_time, max_execution_time, 1, MPI_REAL4, MPI_MAX, cart_comm, mpi_ierr)