-
-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it a good idea to use GCN cross lane instruction for optimization? #510
Comments
Seems like even sub_group functions are not used for AMD but only for Intel. |
Is this related to the subgroup shuffling, which is already implemented for NVIDIA and Intel but not used for AMD? |
Partially yes. |
I'm happy to review a pull request for this feature and/or provide some guidance for anyone that wants to develop this. I don't have time myself (nor the hardware to test on), so we'll have to rely on the community. |
@CNugteren without modifying the logic much, just replacing LDS r/w, not sure if that can improve the performance a lot. Seems like "invert" and "transpose" can be improved a lot. Basicly any frequency data exchange between threads in a wavefront coud potentially improve the speed. Any suggestions on this? |
Regarding optimizing the loads/stores from memory, I'm not sure there is that much to gain, but it depends on the matrix dimensions of course. In the ideal case GEMM is compute-bound and not memory-bound. But I'm not familiar with AMD's recent GPU architectures and thus I can't say much about the actual benefits of these load instructions you are talking about. Regarding improving transpose or invert functions, I also don't think that is where the big gains are, because ideally they don't consume much time, it is the matrix-multiplication kernel itself afterwards that matters most. But again this depends on the actual parameters the user supplies to the CLBlast program. And also every small bit can help, so contributions there are also welcome. I think the main benefit could be by using these cross-lane operations on AMD GPUs in the same way the current 'shuffle' instructions are used: to move data across threads in a cheap way, instead of going through the local SRAM memories or caches. But again I haven't studied recent AMD architectures much so I don't know about the impact these instructions can have on the total picture. |
Is "shuffle" can be applied to any opencl kernel? Any candidate kernel to investigate on? |
Found this article interesting: https://cnugteren.github.io/tutorial/pages/page10.html CLBlast/src/kernels/level3/xgemm_part3.opencl Line 240 in bcd294a
seems like we can replace it with AMD opencl's extension for subgroup shuffling. Not sure how much that could improve the speed. Time saved in r/w LDS may not be much. The may be potential more wavefront can run if we save some LDS usage. |
The main kernel would be the level 3 GEMM kernel (the regular, not 'direct' one). That kernel covers most of the compute heavy computations of CLBlast.
Yes that is the same I think, although that tutorial is quite old compared to the current CLBlast kernel implementation, so some things might have changed.
Indeed, see also the links above I posted to point at the Intel and NVIDIA implementations. You can probably add an AMD version there, and then run the CLBLast GEMM tuner and see if you get more performance out. |
@CNugteren while I am working on a PR for using cross lane instruction to do subgroup shuffling, I have a question: Here seems like the instruction: |
You can see the definition of CLBlast/src/kernels/level3/xgemm_part1.opencl Line 162 in bcd294a
And thus, you can use the define #if VWN == 1
// your code
#else
// regular fallback code
#endif Or you could have a specific implementation for |
Current AMD PR doesn't work with precision 64 when there needs two registers for double number. I will change the PR. |
@tyler-utah what do you think? It's only using one instruction with 32 bit operand. How that supposed to work with 64 bit precision or N greater than 2? |
That NVIDIA feature is simply guarded to only activate in single precision: You can do something similar for AMD. |
Many cuda optimization methods can be migrated to AMD opencl. Besides smaller LDS, one big barrier is that opencl doesn’t have cross lane function of shfl as cuda has. However, in-line assembly is well supported with rocm compiler on Navi cards. We can use dpp instructions to exchange registers between threads even faster. Anyone interested in this work?
The text was updated successfully, but these errors were encountered: