Skip to content

Commit

Permalink
Controlling default datatypes and iterations via environment variables
Browse files Browse the repository at this point in the history
Also updated readme
Exported MPI, CUDA and caliper links public
  • Loading branch information
ShatrovOA committed Jan 29, 2025
1 parent a63868a commit 5fbdcd2
Show file tree
Hide file tree
Showing 10 changed files with 187 additions and 99 deletions.
52 changes: 39 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,12 @@ Special optimization is automatically used in 3D plan in case number of MPI proc
- Fortran, C and C++ interfaces
- 2D and 3D transposition plans
- Slab and Pencil decompositions
- Host and CUDA versions
- Can be linked with multiple FFT libraries simultaneously. Execution library can be specified during plan creation. Currenly supported libraries are:
- [FFTW3](https://www.fftw.org/)
- [MKL](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2024-2/fourier-transform-functions.html)
- [cuFFT](https://docs.nvidia.com/cuda/cufft/)
- [VkFFT](https://github.com/DTolm/VkFFT)


## Usage
Expand All @@ -48,7 +51,7 @@ Plan creation subroutines have two common arguments:
- executor_type - this argument specifies which external library should be used to create and execute 1d FFT plans. Default value is `DTFFT_EXECUTOR_NONE` which means that FFTs will not be executed.

### Execution
When executing plan user must provide `transpose_type` argument. Two options are available: `DTFFT_TRANSPOSE_OUT` and `DTFFT_TRANSPOSE_IN`. First one assumes that incoming data is aligned in X direction (fastest) and return data aligned in Z direction.
When executing plan user must provide `execute_type` argument. Two options are available: `DTFFT_TRANSPOSE_OUT` and `DTFFT_TRANSPOSE_IN`. First one assumes that incoming data is aligned in X direction (fastest) and return data aligned in Z direction.

All plans require additional auxiliary buffer. This buffer can be passed by user to `execute` method. If user do not provide such buffer during the first call to `execute`, necessary memory will be allocated internally and deallocated when user calls `destroy` method.

Expand All @@ -60,16 +63,40 @@ To build this library modern (2008+) Fortran compiler is required. This library

| Option | Possible values | Default value | Description |
| -------- | ------- | -------- | ------- |
| DTFFT_WITH_FFTW | on / off | off | Build dtFFT with FFTW support. When enabled user need to set `FFTWDIR` environmental variable in order to find FFTW3 located in custom directory. Both single and double precision versions of library are required |
| DTFFT_WITH_MKL | on / off | off | Build dtFFT with MKL DFTI support |
| DTFFT_WITH_CUDA | on / off | off | Build dtFFT with CUDA support. This options requires Nvidia HPC SDK compilers for both C/C++/Fortran. Make sure that `nvcc` compiler for correct CUDA version is in PATH |
| DTFFT_CUDA_CC_LIST | Valid CUDA CC list | 70;80;90 | List of CUDA compute capabilities to build CUDA Fortran against |
| DTFFT_WITH_FFTW | on / off | off | Build dtFFT with FFTW support. When enabled user need to set `FFTWDIR` environment variable in order to find FFTW3 located in custom directory. Both single and double precision versions of library are required |
| DTFFT_WITH_MKL | on / off | off | Build dtFFT with MKL DFTI support. This option requires `MKLROOT` environment variable to be set |
| DTFFT_WITH_CUFFT | on / off | off | Build dtFFT with cuFFT support. This option automatically enables `DTFFT_WITH_CUDA` |
| DTFFT_WITH_VKFFT | on / off | off | Build dtFFT with VkFFT support. This options requires to set additional configuration parameter `VKFFT_DIR` that points to vkFFT.h. This option automatically enables `DTFFT_WITH_CUDA` |
| DTFFT_BUILD_TESTS | on / off | off | Build tests |
| DTFFT_ENABLE_COVERAGE | on / off | off | Build coverage of library. Only possible with gfortran |
| DTFFT_BUILD_SHARED | on / off | on | Build shared library |
| DTFFT_USE_MPI | on / off | on | Use Fortran `mpi` module instead of `mpi_f08` |
| DTFFT_BUILD_C_CXX_API | on / off | on | Build C/C++ API |
| DTFFT_ENABLE_PERSISTENT_COMM | on / off | off | In case you are planning to execute plan multiple times then it can be very beneficial to use persistent communications. But user must aware that such communications are created at first call to `execute` or `transpose` subroutines and pointers are saved internally inside MPI. All other plan executions will use those pointers. Take care not to free them. |
| DTFFT_WITH_CALIPER | on / off | off | Enable library profiler via Caliper. Additional parameter is required to find caliper: `caliper_DIR` |
| DTFFT_MEASURE_WARMUP_ITERS| positive integer | 2 | Number of warmup iterations to run before plan testing when passing DTFFT_MEASURE or DTFFT_PATIENT to effort_flag parameter during plan creation |
| DTFFT_WITH_PROFILER | on / off | off | Enable library profiler. If `DTFFT_WITH_CUDA` is enabled then library will use nvtx3 library, otherwise caliper will be used and additional option may be required: `caliper_DIR` |

During configuration one should set `CMAKE_INSTALL_PREFIX` with desired installation prefix. dtFFT can later be used in cmake configuration with following commands:
```cmake
# CUDAToolkit is required only for CUDA build and
# must be found before dtfft
find_package(CUDAToolkit REQUIRED)
find_package(dtfft)
add_executable(my_prog my_prog.c)
target_link_libraries(my_prog PRIVATE dtfft)
```
Provided cmake target will add include directories and link all libraries. Make sure to point CMake to the desired dtFFT installation with dtfft_DIR when configuring the target program:
```bash
cmake -Ddtfft_DIR=<dtfft-installation-dir>/lib[64]/cmake/dtfft ..
```

## Useful runtime environment variables
| Name | Possible values | Default value | Description |
| -------- | ------- | -------- | ------- |
| DTFFT_ENABLE_LOG | 0 / 1 | 0 | Make dtFFT write useful info |
| DTFFT_MEASURE_WARMUP_ITERS | non negative integer | 2 | Number of warmup iterations to run before plan testing when passing `DTFFT_MEASURE` or `DTFFT_PATIENT` to effort_flag parameter during plan creation |
| DTFFT_MEASURE_ITERS | positive integer | 5 | Number of iterations to run in order to find best plan when passing `DTFFT_MEASURE` or `DTFFT_PATIENT` to effort_flag parameter during plan creation |
| DTFFT_FORWARD_X_Y | 1 / 2 | 2 | Default id of transposition plan for X -> Y transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags |
| DTFFT_BACKWARD_X_Y | 1 / 2 | 2 | Default id of transposition plan for Y -> X transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags |
Expand All @@ -78,23 +105,22 @@ To build this library modern (2008+) Fortran compiler is required. This library
| DTFFT_FORWARD_X_Z | 1 / 2 | 2 | Default id of transposition plan for X -> Z transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags in case Z-slab optimization is used |
| DTFFT_BACKWARD_X_Z | 1 / 2 | 2 | Default id of transposition plan for Z -> Y transpose which will be used if plan created with `DTFFT_ESTIMATE` and `DTFFT_MEASURE` effort_flags in case Z-slab optimization is used|

## Notes for C users
C and C++ interfaces of the library is available. Simply
## Notes for C/C++ users
dtFFT provids headers for both C and C++. Simply
```c
// C header
#include <dtfft.h>
// C++ header
#include <dtfft.hpp>
```
and tell compiler where it should search for it:
```bash
mpicc ... -I<path_to_dtfft>/include ...
```
Since C arrays are stored in row-major order which is opposite to Fortran column-major when creating the plan, user should pass the dimensions of the array to the planner in reverse order. For example, if your array is a rank three N x M x L matrix in row-major order, you should pass the dimensions of the array as if it were an L x M x N matrix. Also if you are using R2R transform and wish to perform different transform kinds on different dimensions then buffer ```kinds``` should also be reversed.
Since C arrays are stored in row-major order which is opposite to Fortran column-major when creating the plan, user should pass the dimensions of the array to the planner in reverse order. For example, if your array is a rank three N x M x L matrix in row-major order, you should pass the dimensions of the array as if it were an L x M x N matrix. Also if you are using R2R transform and wish to perform different transform kinds on different dimensions then buffer ```kinds``` should also be reversed. Same goes for MPI Communicators with attached cartesian topology.

Examples are provided in ```tests/c``` folder.
## Next Steps

- GPU Support
- Optimize CUDA NVRTC kernels
- Add support for nvshmem
- Add support for custom NCCL installation
## Contribution

You can help this project by reporting problems, suggestions, localizing it or contributing to the code. Go to issue tracker and check if your problem/suggestion is already reported. If not, create a new issue with a descriptive title and detail your suggestion or steps to reproduce the problem.
Expand Down
2 changes: 1 addition & 1 deletion benchmark/cuda/dtfft_bench.h
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ void run_dtfft(bool c2c, dtfft_precision_t precision, bool enable_z_slab) {
}
}

int64_t alloc_size;
size_t alloc_size;
DTFFT_CALL( dtfft_get_alloc_size(plan, &alloc_size) );
create_time +=MPI_Wtime();
if(comm_rank == 0) {
Expand Down
24 changes: 12 additions & 12 deletions include/dtfft.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -86,16 +86,16 @@ namespace dtfft
*/
template<typename T1, typename T2>
dtfft_error_code_t
execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t transpose_type)
{return execute(in.data(), out.data(), transpose_type, NULL);}
execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t execute_type)
{return execute(in.data(), out.data(), execute_type, NULL);}



/** \brief Plan execution with optional auxiliary vector
*
* \param[inout] in Incoming vector
* \param[out] out Result vector
* \param[in] transpose_type Type of transform:
* \param[in] execute_type Type of transform:
* - `DTFFT_TRANSPOSE_OUT`
* - `DTFFT_TRANSPOSE_IN`
* \param[inout] aux Optional auxiliary vector
Expand All @@ -104,41 +104,41 @@ namespace dtfft
*/
template<typename T1, typename T2, typename T3>
dtfft_error_code_t
execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t transpose_type, std::vector<T3> &aux)
{return execute(in.data(), out.data(), transpose_type, aux.data());}
execute(std::vector<T1> &in, std::vector<T2> &out, const dtfft_execute_type_t execute_type, std::vector<T3> &aux)
{return execute(in.data(), out.data(), execute_type, aux.data());}



/** \brief Plan execution without auxiliary buffer using C-style pointers instead of vectors
*
* \param[inout] in Incoming buffer
* \param[out] out Result buffer
* \param[in] transpose_type Type of transform:
* \param[in] execute_type Type of transform:
* - `DTFFT_TRANSPOSE_OUT`
* - `DTFFT_TRANSPOSE_IN`
*
* \return Status code of method execution
*/
dtfft_error_code_t
execute(void *in, void *out, const dtfft_execute_type_t transpose_type)
{return execute(in, out, transpose_type, NULL);}
execute(void *in, void *out, const dtfft_execute_type_t execute_type)
{return execute(in, out, execute_type, NULL);}



/** \brief Plan execution with auxiliary buffer using C-style pointers instead of vectors
*
* \param[inout] in Incoming buffer
* \param[out] out Result buffer
* \param[in] transpose_type Type of transform:
* \param[in] execute_type Type of transform:
* - `DTFFT_TRANSPOSE_OUT`
* - `DTFFT_TRANSPOSE_IN`
* \param[inout] aux Optional auxiliary buffer
*
* \return Status code of method execution
*/
dtfft_error_code_t
execute(void *in, void *out, const dtfft_execute_type_t transpose_type, void *aux)
{return dtfft_execute(_plan, in, out, transpose_type, aux);}
execute(void *in, void *out, const dtfft_execute_type_t execute_type, void *aux)
{return dtfft_execute(_plan, in, out, execute_type, aux);}



Expand All @@ -147,7 +147,7 @@ namespace dtfft
*
* \param[inout] in Incoming vector
* \param[out] out Transposed vector
* \param[in] transpose_type Type of transpose:
* \param[in] execute_type Type of transpose:
* - `DTFFT_TRANSPOSE_X_TO_Y`
* - `DTFFT_TRANSPOSE_Y_TO_X`
* - `DTFFT_TRANSPOSE_Y_TO_Z` (3d plan only)
Expand Down
20 changes: 1 addition & 19 deletions src/dtfft_parameters.F90
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ module dtfft_parameters
integer(int32), parameter, public :: COLOR_FFT = int(Z'00FCD05D')
integer(int32), parameter, public :: COLOR_AUTOTUNE = int(Z'006075FF')
integer(int32), parameter, public :: COLOR_AUTOTUNE2 = int(Z'0056E874')
integer(int32), parameter, public :: COLOR_DESTROY = int(Z'00000000')
integer(int32), parameter, public :: COLOR_TRANSPOSE_PALLETTE(-3:3) = [COLOR_TRANSPOSE_ZX, COLOR_TRANSPOSE_ZY, COLOR_TRANSPOSE_YX, 0, COLOR_TRANSPOSE_XY, COLOR_TRANSPOSE_YZ, COLOR_TRANSPOSE_XZ]

integer(int32), parameter, public :: DTFFT_SUCCESS = CONF_DTFFT_SUCCESS
Expand Down Expand Up @@ -199,26 +200,7 @@ module dtfft_parameters
integer(int32), parameter, public :: DTFFT_ERROR_GPU_NOT_SET = CONF_DTFFT_ERROR_GPU_NOT_SET
integer(int32), parameter, public :: DTFFT_ERROR_VKFFT_R2R_2D_PLAN = CONF_DTFFT_ERROR_VKFFT_R2R_2D_PLAN
integer(int32), parameter, public :: DTFFT_ERROR_NOT_DEVICE_PTR = CONF_DTFFT_ERROR_NOT_DEVICE_PTR


#if (DTFFT_FORWARD_X_Y > 2) || (DTFFT_FORWARD_X_Y <= 0)
#error "Invalid DTFFT_FORWARD_X_Y parameter"
#endif
#if (DTFFT_BACKWARD_X_Y > 2) || (DTFFT_BACKWARD_X_Y <= 0)
#error "Invalid DTFFT_BACKWARD_X_Y parameter"
#endif
#if (DTFFT_FORWARD_Y_Z > 2) || (DTFFT_FORWARD_Y_Z <= 0)
#error "Invalid DTFFT_FORWARD_Y_Z parameter"
#endif
#if (DTFFT_BACKWARD_Y_Z > 2) || (DTFFT_BACKWARD_Y_Z <= 0)
#error "Invalid DTFFT_BACKWARD_Y_Z parameter"
#endif
#if (DTFFT_FORWARD_X_Z > 2) || (DTFFT_FORWARD_X_Z <= 0)
#error "Invalid DTFFT_FORWARD_X_Z parameter"
#endif
#if (DTFFT_BACKWARD_X_Z > 2) || (DTFFT_BACKWARD_X_Z <= 0)
#error "Invalid DTFFT_BACKWARD_X_Z parameter"
#endif

#ifdef DTFFT_WITH_CUDA
integer(int8), parameter, public :: DTFFT_GPU_BACKEND_MPI_DATATYPE = CONF_DTFFT_GPU_BACKEND_MPI_DATATYPE
Expand Down
5 changes: 4 additions & 1 deletion src/dtfft_plan.F90
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,8 @@ subroutine destroy(self, error_code)
if ( .not. self%is_created ) ierr = DTFFT_ERROR_PLAN_NOT_CREATED
CHECK_ERROR_AND_RETURN

REGION_BEGIN("dtfft_destroy", COLOR_DESTROY)

#ifndef DTFFT_TRANSPOSE_ONLY
select type ( self )
class is ( dtfft_plan_r2c )
Expand Down Expand Up @@ -474,6 +476,7 @@ subroutine destroy(self, error_code)
end block
self%ndims = -1
if ( present( error_code ) ) error_code = DTFFT_SUCCESS
REGION_END("dtfft_destroy")
end subroutine destroy

logical function get_z_slab(self, error_code)
Expand Down Expand Up @@ -678,7 +681,7 @@ integer(int32) function check_create_args(self, dims, comm, precision, effort_fl
integer(int32) :: top_type !< MPI Comm topology type
integer(int32) :: dim !< Counter

CHECK_INTERNAL_CALL( dtfft_init() )
CHECK_INTERNAL_CALL( init_internal() )

self%ndims = size(dims, kind=int8)
CHECK_INPUT_PARAMETER(self%ndims, VALID_DIMENSIONS, DTFFT_ERROR_INVALID_N_DIMENSIONS)
Expand Down
21 changes: 14 additions & 7 deletions src/dtfft_transpose_plan_cuda.F90
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,7 @@ subroutine run_autotune_backend(comms, cart_comm, pencils, base_storage, stream,
type(cudaEvent) :: timer_start, timer_stop
character(len=:), allocatable :: testing_phase
type(backend_helper) :: helper
integer(int32) :: n_warmup_iters, n_iters
! integer(cuda_count_kind) :: free, total

if ( present(backend_id) ) then
Expand Down Expand Up @@ -411,33 +412,39 @@ subroutine run_autotune_backend(comms, cart_comm, pencils, base_storage, stream,
PHASE_BEGIN(testing_phase, COLOR_AUTOTUNE)
WRITE_INFO(testing_phase)

PHASE_BEGIN("Warmup, "//int_to_str(DTFFT_MEASURE_WARMUP_ITERS)//" iterations", COLOR_TRANSPOSE)
do iter = 1, DTFFT_MEASURE_WARMUP_ITERS
n_warmup_iters = get_iters_from_env(.true.)

PHASE_BEGIN("Warmup, "//int_to_str(n_warmup_iters)//" iterations", COLOR_TRANSPOSE)
do iter = 1, n_warmup_iters
do i = 1, 2_int8 * n_transpose_plans
call plans(i)%execute(in, out, stream)
enddo
enddo
CUDA_CALL( "cudaStreamSynchronize", cudaStreamSynchronize(stream) )
PHASE_END("Warmup, "//int_to_str(DTFFT_MEASURE_WARMUP_ITERS)//" iterations")
PHASE_END("Warmup, "//int_to_str(n_warmup_iters)//" iterations")

call MPI_Barrier(cart_comm, mpi_ierr)

n_iters = get_iters_from_env(.false.)

PHASE_BEGIN("Testing, "//int_to_str(DTFFT_MEASURE_ITERS)//" iterations", COLOR_EXECUTE)
PHASE_BEGIN("Testing, "//int_to_str(n_iters)//" iterations", COLOR_EXECUTE)
total_time = 0.0
! do i = 1, 2_int8 * n_transpose_plans
CUDA_CALL( "cudaEventRecord", cudaEventRecord(timer_start, stream) )
do iter = 1, DTFFT_MEASURE_ITERS
do iter = 1, n_iters
do i = 1, 2_int8 * n_transpose_plans
call plans(i)%execute(in, out, stream)
enddo
enddo
CUDA_CALL( "cudaEventRecord", cudaEventRecord(timer_stop, stream) )
CUDA_CALL( "cudaEventSynchronize", cudaEventSynchronize(timer_stop) )
CUDA_CALL( "cudaEventElapsedTime", cudaEventElapsedTime(execution_time, timer_start, timer_stop) )
execution_time = execution_time / real(DTFFT_MEASURE_ITERS, real32)
execution_time = execution_time / real(n_iters, real32)
total_time = total_time + execution_time
! WRITE_INFO( TRANSPOSE_NAMES(plans(i)%get_tranpose_id())//" : "//double_to_str(real(execution_time, real64))//" [ms]")
! enddo

PHASE_END("Testing, "//int_to_str(DTFFT_MEASURE_ITERS)//" iterations")
PHASE_END("Testing, "//int_to_str(n_iters)//" iterations")

call MPI_Allreduce(total_time, min_execution_time, 1, MPI_REAL4, MPI_MIN, cart_comm, mpi_ierr)
call MPI_Allreduce(total_time, max_execution_time, 1, MPI_REAL4, MPI_MAX, cart_comm, mpi_ierr)
Expand Down
Loading

0 comments on commit 5fbdcd2

Please sign in to comment.