fix the bug that for block_k=16 mma, the compilation crash on Ampere. #15

bingyizh233 · 2024-09-06T20:22:16Z

The origin issue is reported here: triton-lang#3435 The issue happens during compilation, when arith.sitofp (from i8 to fp16) operates on the tensor operand which has dot_op layout with the first dimension of the tensor being 16 and opidx = 1.
For example: %104 = arith.sitofp %103 : tensor<16x64xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #mma, kWidth = 4}>> to tensor<16x64xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mma, kWidth = 4}>>

Investigation shows that the bug happens in TritonGPUToLLVM pass. in the corner case (block_k = 16 and opidx = 1) extra elements will be unpacked in include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.h:line 186-194. The code unpacks extra elements due to an implicit assumption in lib/Dialect/TritonGPU/IR/Dialect.h, at line 2000, at least 4 rep (e.g., i32) will be loaded.

Therefore, in our patch, extra loaded elements are dropped in the corner case (block_k = 16 and opidx = 1).

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

[ x] I am not making a trivial change, such as fixing a typo in a comment.
[ x] I have written a PR description following these
rules.
[ x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- [x ] This PR does not need a test because FILL THIS IN.
Select one of the following.
- [ x] I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

…ng#4275)

…ng#4323)

…ng#4374) Update LLVM version to llvm/llvm-project@dd7d81e

…ng#4410) Included the use of the non-deprecated version of createMCObjectStreamer (needed after llvm/llvm-project@f1422a8).

…ng#4468)

…ng#4536)

…ich exists in gcc-defaults. (triton-lang#4548) The llvm build check is trying to get http://ftp.de.debian.org/debian/pool/main/g/gcc-defaults/gcc-aarch64-linux-gnu_13.2.0-7_amd64.deb, which does not exist and therefore fails. Updating the version to an existing one (14.1.0-2). [x] I am not making a trivial change, such as fixing a typo in a comment. [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. [x] This PR does not need a test because it is not a functional change, should fix git checks builds. [x] I have not added any `lit` tests.

Part of https://github.com/triton-lang/triton/pull/4548/files

…ng#4578)

The origin issue is reported here: triton-lang#3435 The issue happens during compilation, when arith.sitofp (from i8 to fp16) operates on the tensor operand which has dot_op layout with the first dimension of the tensor being 16 and opidx = 1. For example: %104 = arith.sitofp %103 : tensor<16x64xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #mma, kWidth = 4}>> to tensor<16x64xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mma, kWidth = 4}>> Investigation shows that the bug happens in TritonGPUToLLVM pass. in the corner case (block_k = 16 and opidx = 1) extra elements will be unpacked in include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.h:line 186-194. The code unpack extra elements due to an implicit assumption in lib/Dialect/TritonGPU/IR/Dialect.h, at line 2000, at least 4 rep will be loaded. Therefore, in our patch, extra loaded elements are dropped in the corner case.

gflegar · 2024-09-20T16:33:17Z

For tracking: this has been opened as a triton-lang#4768 upstream

vwbaker and others added 11 commits August 26, 2024 10:11

[BACKEND] Update LLVM version to llvm/llvm-project@de88b2c (triton-la…

b535b55

…ng#4275)

[BACKEND] Update LLVM version to llvm/llvm-project@9ddfe62 (triton-la…

7a88766

…ng#4323)

[BACKEND] Update LLVM version to llvm/llvm-project@dd7d81e (triton-la…

abdaaff

…ng#4374) Update LLVM version to llvm/llvm-project@dd7d81e

[BACKEND] Update LLVM version to llvm/llvm-project@99bb9a7 (triton-la…

3dd3657

…ng#4410) Included the use of the non-deprecated version of createMCObjectStreamer (needed after llvm/llvm-project@f1422a8).

[BACKEND] Update LLVM version to llvm/llvm-project@1a9acd7 (triton-la…

958e9a5

…ng#4468)

[BACKEND] Update LLVM version to llvm/llvm-project@1115dee (triton-la…

87538e5

…ng#4536)

Fix naming of gcc usage to new version

494f55c

Part of https://github.com/triton-lang/triton/pull/4548/files

[BACKEND] Update LLVM version to llvm/llvm-project@7f7f4fe (triton-la…

b2de88f

…ng#4578)

OpenXLA-specific changes

7a5940c

chsigg force-pushed the llvm-head branch from 7a5940c to b39a103 Compare September 13, 2024 10:17

reichlfl force-pushed the llvm-head branch from b39a103 to 3596dc5 Compare September 19, 2024 12:39

bingyizh233 closed this by deleting the head repository Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the bug that for block_k=16 mma, the compilation crash on Ampere. #15

fix the bug that for block_k=16 mma, the compilation crash on Ampere. #15

bingyizh233 commented Sep 6, 2024

gflegar commented Sep 20, 2024

fix the bug that for block_k=16 mma, the compilation crash on Ampere. #15

fix the bug that for block_k=16 mma, the compilation crash on Ampere. #15

Conversation

bingyizh233 commented Sep 6, 2024

gflegar commented Sep 20, 2024