-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Generic Optimizations #79
base: main
Are you sure you want to change the base?
Conversation
src/common/subgroup.hpp
Outdated
@@ -100,12 +101,15 @@ __attribute__((always_inline)) inline void cross_sg_naive_dft(T& real, T& imag, | |||
T res_imag = 0; | |||
|
|||
unrolled_loop<0, N, 1>([&](int idx_in) __attribute__((always_inline)) { | |||
const T multi_re = twiddle<T>::Re[N][idx_in * idx_out % N]; | |||
const T multi_re = static_cast<T>( | |||
sycl::cos(static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) / static_cast<float>(N))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) / static_cast<float>(N)
in a variable, it's a lot to read.
src/common/subgroup.hpp
Outdated
} | ||
return -twiddle<T>::Im[N][idx_in * idx_out % N]; | ||
return static_cast<T>(sycl::sin(static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
negation missing
src/common/subgroup.hpp
Outdated
} | ||
return -twiddle<T>::Im[N * M][k * n]; | ||
return static_cast<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
negation missing
src/common/subgroup.hpp
Outdated
#include <common/twiddle.hpp> | ||
#include <common/twiddle_calc.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are twiddle.hpp
and twiddle_calc.hpp
needed?
src/common/subgroup.hpp
Outdated
const T multi_re = static_cast<T>(sycl::cos(theta)); | ||
const T multi_im = static_cast<T>(sycl::sin(theta)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip the multiplication by M_PI
(which - fun-fact - is not part of the standard) if we use cospi
and sinpi
.
src/common/subgroup.hpp
Outdated
using theta_t = T; | ||
#else | ||
using theta_t = float; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we're doing a double
precision DFT? If its acceptable to use single-precision twiddles in a double-precision DFT, do we get a speedup if we use half-precision twiddles in a single-precision DFT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not acceptable. We previously had accuracy issues with twiddles being calculated to lower precision due to fast math. Lower precision twiddles will create the same issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we're doing a double precision DFT? If its acceptable to use single-precision twiddles in a double-precision DFT
Hence the introduction of PORTFFT_USE_FAST_TRIG_APPROX
, that is left to the user, if they are okay with some mixed precision compute, at the cost of some accuracy ( I haven't run into it yet until now, for both float and double tests, on Nvidia, AMD and locally, maybe because other twiddles are computed at full precision on host)
do we get a speedup if we use half-precision twiddles in a single-precision DFT
No, the speedup does not come from lowering the precision for computing sin/cos, but rather the idea is to generate very specific instructions and depending on the target device, may or may not be scheduled on a different hardware.
For Example On Nvidia:
- With optimization level O3, and when the input to
sycl::sin/cos
( I have not tested it usingsycl::sinpi
yet) is float. this generatessin.approx.ftz
, resulting in theSASS
MUFU.SIN
. MUFU (previously known as SFU on pre Volta ( I Think ?)), is a separate set of ALUs who's function is to just calculate transcendental functions, and some other like square root and inverse etc etc. When sin/cos get scheduled on this (rather than using the long polynomial approximation), they use a bit less precise approximation, which is much faster to compute ( see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions for accuracy ranges). The MUFU takes much less cycles to compute(I remember reading MUFU utilizes just 4 cycles to generate the result per op per warp, but which doc was it, I dont have it at the top of my head). In addition to faster compute, this leads to ILP possibilities as sin/cos compute runs on different hardware, normal ALUs ops / shuffle ops can be performed till the time there are no data dependencies (see this line). As far as my knowledge goes, there are no special units on AMD to compute sin/cos, but this will generateV_SIN_FP32
vector operation, and this is faster than fetching each twiddle from global memory ( see Vega Instruction set for more, especially global fetching mechanisms.), hence getting a speedup there as well. - This being said, faster compute should be able to beat the caching effect. For example, if MUFU.SIN is not generated, that would result in a long polynomial approximation, and is a lot lot slower than our current implementation. I am yet to compare the two approaches on GPUs which have a large amount of L1 (like 512KB) and cache loads aggressively, but lack such special transcendental computing capabilities. If our current approach is preferred on such devices, maybe I will use another CMake variable to guide the compute accordingly
- AMD does have V_SIN_FP16, so it could be faster, but I am shying away from fp16 compute at the moment, as it might hamper portability ?
Some generic optimizations benefiting subgroup and workgroup DFTs and bringing performance improvements on all devices
Checklist
Tick if relevant: