Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Is it possible that the kernels wrapped with AOT have the similar performance comparing with the original ones? #37

Open
xinji1 opened this issue Jul 24, 2024 · 1 comment

Comments

@xinji1
Copy link

xinji1 commented Jul 24, 2024

Thanks for the related updates! I just updated my kernels with latest aot and found that there're no big difference when it comes to the performance. Here're 2 little questions:

  1. Have you ever tested the performance of kernels with aot wrapped? How about the gain after reducing the launching overhead?
  2. Take the kernel attn_fwd as an example. My first implementation of aot + kernels (similar to attn_fwd but with paged_attention setting) was based on 24a3fe9cb57. Is there any performance boosting code changes from 24a3fe9cb57 to latest branch? Specifically,
  • I just noticed that the original attn_fwd.py has been split into fwd_kernel_common.py, fwd_kernel_inner.py and fwd_kernel.py. Is it helpful to aot?

I appreciate it if you could take some time and help answer these questions.

@xinyazhang
Copy link
Collaborator

xinyazhang commented Jul 24, 2024

Have you ever tested the performance of kernels with aot wrapped

Yes, they are close in TFLOPS numbers. Corresponding tests can be found in test/performance_forward.py and tritonsrc/performance_forward.py. (They are mostly identical, but using different backend because they are under different directory and consequently loading different attn_torch_function)

Is there any performance boosting code changes from 24a3fe9cb57 to latest branch? Specifically,

A few notable changes (but not merged yet)

  1. Bump the Triton compiler to the latest upstream
  2. Migrate from tl.make_block_ptr since upstream Triton is not maintaining it
  3. Extra autotune Configs when generating tuning database

They can be found in #36

I just noticed that the original attn_fwd.py has been split into fwd_kernel_common.py, fwd_kernel_inner.py and fwd_kernel.py. Is it helpful to aot?

No, it only helps the readability of the code by avoiding long files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants