[Perf] Is it possible that the kernels wrapped with AOT have the similar performance comparing with the original ones? #37

xinji1 · 2024-07-24T06:56:46Z

Thanks for the related updates! I just updated my kernels with latest aot and found that there're no big difference when it comes to the performance. Here're 2 little questions:

Have you ever tested the performance of kernels with aot wrapped? How about the gain after reducing the launching overhead？
Take the kernel attn_fwd as an example. My first implementation of aot + kernels (similar to attn_fwd but with paged_attention setting) was based on 24a3fe9cb57. Is there any performance boosting code changes from 24a3fe9cb57 to latest branch? Specifically,

I just noticed that the original attn_fwd.py has been split into fwd_kernel_common.py, fwd_kernel_inner.py and fwd_kernel.py. Is it helpful to aot?

I appreciate it if you could take some time and help answer these questions.

The text was updated successfully, but these errors were encountered:

xinyazhang · 2024-07-24T15:13:14Z

Have you ever tested the performance of kernels with aot wrapped

Yes, they are close in TFLOPS numbers. Corresponding tests can be found in test/performance_forward.py and tritonsrc/performance_forward.py. (They are mostly identical, but using different backend because they are under different directory and consequently loading different attn_torch_function)

Is there any performance boosting code changes from 24a3fe9cb57 to latest branch? Specifically,

A few notable changes (but not merged yet)

Bump the Triton compiler to the latest upstream
Migrate from tl.make_block_ptr since upstream Triton is not maintaining it
Extra autotune Configs when generating tuning database

They can be found in #36

I just noticed that the original attn_fwd.py has been split into fwd_kernel_common.py, fwd_kernel_inner.py and fwd_kernel.py. Is it helpful to aot?

No, it only helps the readability of the code by avoiding long files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Is it possible that the kernels wrapped with AOT have the similar performance comparing with the original ones? #37

[Perf] Is it possible that the kernels wrapped with AOT have the similar performance comparing with the original ones? #37

xinji1 commented Jul 24, 2024 •

edited

Loading

xinyazhang commented Jul 24, 2024 •

edited

Loading

[Perf] Is it possible that the kernels wrapped with AOT have the similar performance comparing with the original ones? #37

[Perf] Is it possible that the kernels wrapped with AOT have the similar performance comparing with the original ones? #37

Comments

xinji1 commented Jul 24, 2024 • edited Loading

xinyazhang commented Jul 24, 2024 • edited Loading

xinji1 commented Jul 24, 2024 •

edited

Loading

xinyazhang commented Jul 24, 2024 •

edited

Loading