[WIP]Generic Optimizations #79

AD2605 · 2023-09-25T07:26:33Z

Some generic optimizations benefiting subgroup and workgroup DFTs and bringing performance improvements on all devices

Checklist

Tick if relevant:

FMarno · 2023-09-26T11:14:24Z

src/common/subgroup.hpp

@@ -100,12 +101,15 @@ __attribute__((always_inline)) inline void cross_sg_naive_dft(T& real, T& imag,
    T res_imag = 0;

    unrolled_loop<0, N, 1>([&](int idx_in) __attribute__((always_inline)) {
-      const T multi_re = twiddle<T>::Re[N][idx_in * idx_out % N];
+      const T multi_re = static_cast<T>(
+          sycl::cos(static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) / static_cast<float>(N)));


can you put static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) / static_cast<float>(N) in a variable, it's a lot to read.

FMarno · 2023-09-26T11:14:31Z

src/common/subgroup.hpp

        }
-        return -twiddle<T>::Im[N][idx_in * idx_out % N];
+        return static_cast<T>(sycl::sin(static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) /


negation missing

FMarno · 2023-09-26T11:14:55Z

src/common/subgroup.hpp

    }
-    return -twiddle<T>::Im[N * M][k * n];
+    return static_cast<T>(


negation missing

FMarno · 2023-09-26T11:16:46Z

src/common/subgroup.hpp

 #include <common/twiddle.hpp>
 #include <common/twiddle_calc.hpp>


Are twiddle.hpp and twiddle_calc.hpp needed?

…izations_1

hjabird · 2023-10-10T09:27:21Z

src/common/subgroup.hpp

+      const T multi_re = static_cast<T>(sycl::cos(theta));
+      const T multi_im = static_cast<T>(sycl::sin(theta));


We can skip the multiplication by M_PI (which - fun-fact - is not part of the standard) if we use cospi and sinpi.

hjabird · 2023-10-10T09:29:34Z

src/common/subgroup.hpp

+      using theta_t = T;
+#else
+      using theta_t = float;


What if we're doing a double precision DFT? If its acceptable to use single-precision twiddles in a double-precision DFT, do we get a speedup if we use half-precision twiddles in a single-precision DFT?

It is not acceptable. We previously had accuracy issues with twiddles being calculated to lower precision due to fast math. Lower precision twiddles will create the same issues.

What if we're doing a double precision DFT? If its acceptable to use single-precision twiddles in a double-precision DFT

Hence the introduction of PORTFFT_USE_FAST_TRIG_APPROX, that is left to the user, if they are okay with some mixed precision compute, at the cost of some accuracy ( I haven't run into it yet until now, for both float and double tests, on Nvidia, AMD and locally, maybe because other twiddles are computed at full precision on host)

do we get a speedup if we use half-precision twiddles in a single-precision DFT

No, the speedup does not come from lowering the precision for computing sin/cos, but rather the idea is to generate very specific instructions and depending on the target device, may or may not be scheduled on a different hardware.
For Example On Nvidia:

With optimization level O3, and when the input to sycl::sin/cos ( I have not tested it using sycl::sinpi yet) is float. this generates sin.approx.ftz, resulting in the SASS MUFU.SIN. MUFU (previously known as SFU on pre Volta ( I Think ?)), is a separate set of ALUs who's function is to just calculate transcendental functions, and some other like square root and inverse etc etc. When sin/cos get scheduled on this (rather than using the long polynomial approximation), they use a bit less precise approximation, which is much faster to compute ( see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions for accuracy ranges). The MUFU takes much less cycles to compute(I remember reading MUFU utilizes just 4 cycles to generate the result per op per warp, but which doc was it, I dont have it at the top of my head). In addition to faster compute, this leads to ILP possibilities as sin/cos compute runs on different hardware, normal ALUs ops / shuffle ops can be performed till the time there are no data dependencies (see this line). As far as my knowledge goes, there are no special units on AMD to compute sin/cos, but this will generate V_SIN_FP32 vector operation, and this is faster than fetching each twiddle from global memory ( see Vega Instruction set for more, especially global fetching mechanisms.), hence getting a speedup there as well.

This being said, faster compute should be able to beat the caching effect. For example, if MUFU.SIN is not generated, that would result in a long polynomial approximation, and is a lot lot slower than our current implementation. I am yet to compare the two approaches on GPUs which have a large amount of L1 (like 512KB) and cache loads aggressively, but lack such special transcendental computing capabilities. If our current approach is preferred on such devices, maybe I will use another CMake variable to guide the compute accordingly

AMD does have V_SIN_FP16, so it could be faster, but I am shying away from fp16 compute at the moment, as it might hamper portability ?

…izations_1

on the fly twiddle calc

c5087f3

FMarno reviewed Sep 26, 2023

View reviewed changes

AD2605 added 2 commits October 9, 2023 09:35

updated branch

a5c93f5

Merge remote-tracking branch 'origin/main' into atharva/generic_optim…

f594ed4

…izations_1

hjabird reviewed Oct 10, 2023

View reviewed changes

AD2605 added 3 commits October 22, 2023 14:13

add the PORTFFT_USE_FAST_TRIG_APPROX cmake

62db385

Merge remote-tracking branch 'origin/main' into atharva/generic_optim…

eb31194

…izations_1

format

21c6207

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Generic Optimizations #79

[WIP]Generic Optimizations #79

AD2605 commented Sep 25, 2023

FMarno Sep 26, 2023

FMarno Sep 26, 2023

FMarno Sep 26, 2023

FMarno Sep 26, 2023

hjabird Oct 10, 2023

hjabird Oct 10, 2023

t4c1 Oct 10, 2023

AD2605 Oct 10, 2023 •

edited

Loading

		#include <common/twiddle.hpp>
		#include <common/twiddle_calc.hpp>

		const T multi_re = static_cast<T>(sycl::cos(theta));
		const T multi_im = static_cast<T>(sycl::sin(theta));

[WIP]Generic Optimizations #79

Are you sure you want to change the base?

[WIP]Generic Optimizations #79

Conversation

AD2605 commented Sep 25, 2023

Checklist

FMarno Sep 26, 2023

Choose a reason for hiding this comment

FMarno Sep 26, 2023

Choose a reason for hiding this comment

FMarno Sep 26, 2023

Choose a reason for hiding this comment

FMarno Sep 26, 2023

Choose a reason for hiding this comment

hjabird Oct 10, 2023

Choose a reason for hiding this comment

hjabird Oct 10, 2023

Choose a reason for hiding this comment

t4c1 Oct 10, 2023

Choose a reason for hiding this comment

AD2605 Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

AD2605 Oct 10, 2023 •

edited

Loading