Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit multiply-reduce GEMM kernel #621

Conversation

brunomazzottiamd
Copy link
Collaborator

Add kernel that implements GEMM with explicit multiply-reduce instructions for small block sizes. Such
small block sizes aren't natively supported by tl.dot operator.

Despite being numerically correct, this kernel performed worse than a corresponding GEMM kernel that
used tl.dot with minimum block size equal to $16$:

MI300 Results for FP16:

trans M N K Dot GiBps Multiply-Reduce GiBps Speedup
TN 1 8192 28672 3491.33 2869.87 0.82
TN 1 6144 6144 3858.22 2673.33 0.69
TN 1 4096 4096 2352.54 1680.93 0.71
TN 2 16384 16384 3412.17 3318.44 0.97

The code is based on tune_gemm matmul_kernel from commit cf44637 (see triton-mlir branch).

@brunomazzottiamd
Copy link
Collaborator Author

The related issue is https://github.com/ROCm/triton-internal/issues/169.

@brunomazzottiamd
Copy link
Collaborator Author

@xiaohuguo2023 told me we can merge PRs by ourselves once we have one approval. I'll do it.

@brunomazzottiamd brunomazzottiamd merged commit 1d2e066 into ROCm:main_perf Aug 6, 2024
4 checks passed
@brunomazzottiamd brunomazzottiamd deleted the 169-add-multreduce-gemm-kernel branch August 14, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants