[SWP] attempt to remove a workaround for a triton llvm codegen bug #4774

manman-ren · 2024-09-21T01:12:23Z

Triton LLVM codegen has a bug where local_loads from #shared to #mma layout can lead to invalid code if the loaded shape is smaller than the mma tile. Remove the workaround.

See #3561.

Verified that with test case: https://pastebin.com/xxP3cFmy (test.mlir), running
triton-opt test.mlir -tritongpu-pipeline=num-stages=3 --convert-scf-to-cf --allocate-shared-memory --convert-triton-gpu-to-llvm
has no issue.

Unit test case added in #4798 also shows no issue.

manman-ren · 2024-09-21T01:15:18Z

@jlebar I am not sure if I am checking the failure correctly, I tried running
triton-opt test.mlir -tritongpu-pipeline=num-stages=3 --convert-scf-to-cf --allocate-shared-memory --convert-triton-gpu-to-llvm
but didn't hit any issue. From #3549, sounds like we should hit an assertion.
Thanks!

CC @pawelszczerbuk

jlebar · 2024-09-22T15:06:39Z

From #3549, sounds like we should hit an assertion.

I believe that assertion was removed by @Jokeren when he fixed the relevant codepath.

Jokeren · 2024-09-22T16:17:50Z

One of our recent PRs may have fixed this problem. Do you have a test case BTW? If so, happy to dig deeper

manman-ren · 2024-09-22T20:02:14Z

I am using dont_pipeline_128x1 from test/TritonGPU/loop-pipeline.mlir (https://pastebin.com/xxP3cFmy), if we remove the HACK from the compiler, we will start pipelining the load in the test case, which should trigger the issue of invalid code for local_loads from #shared to #mma layout if the loaded shape is smaller than the mma tile.

Jokeren · 2024-09-22T20:17:48Z

I am using dont_pipeline_128x1 from test/TritonGPU/loop-pipeline.mlir (https://pastebin.com/xxP3cFmy), if we remove the HACK from the compiler, we will start pipelining the load in the test case, which should trigger the issue of invalid code for local_loads from #shared to #mma layout if the loaded shape is smaller than the mma tile.

To confirm, are you saying that the bug is triggered at the following IR?

%161 = triton_gpu.convert_layout %151 : tensor<128x1xi32, #blocked> -> tensor<128x1xi32, #mma>

Jokeren · 2024-09-22T20:29:25Z

Honestly I'm not sure if constructing an MMA layout greater than the tensor size makes sense?

In theory we could repeat values in registers and threads. Would like to get your thoughts

manman-ren · 2024-09-23T00:32:58Z

I am using dont_pipeline_128x1 from test/TritonGPU/loop-pipeline.mlir (https://pastebin.com/xxP3cFmy), if we remove the HACK from the compiler, we will start pipelining the load in the test case, which should trigger the issue of invalid code for local_loads from #shared to #mma layout if the loaded shape is smaller than the mma tile.

To confirm, are you saying that the bug is triggered at the following IR?

%161 = triton_gpu.convert_layout %151 : tensor<128x1xi32, #blocked> -> tensor<128x1xi32, #mma>

https://pastebin.com/eN5HP8XW has the mlir after SWP, if we pipeline the load.

Specifically, this op should trigger the issue
%14 = triton_gpu.local_load %arg5 : !tt.memdesc<128x1xi32, #shared, #triton_gpu.shared_memory, mutable> -> tensor<128x1xi32, #mma>
#mma = #triton_gpu.nvidia_mma<{versionMajor = 2, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [16, 8]}>

i.e loading a 128x1 tensor for an MMAv2 dot with tile {16,8} is bad because 1 < 8

This is how I understand it from the comments :]

Jokeren · 2024-09-24T21:39:46Z

#4798
@manman-ren

Summary: Triton LLVM codegen has a bug where local_loads from #shared to than the mma tile. Remove the workaround. See triton-lang#3561. Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

This change is causing failures in some internal tests. There must still be some miscompile associated with this.

This reverts commit 9e72047.

…odegen bug (#4873)" (#4973) After investigation of the differences caused by #4774 in the internal tests, we concluded that they were introduced by change in the layouts selected for the reduce operations. Re-introducing that change, as it is functionally correct and should be beneficial for performance.

…odegen bug (triton-lang#4873)" (triton-lang#4973) After investigation of the differences caused by triton-lang#4774 in the internal tests, we concluded that they were introduced by change in the layouts selected for the reduce operations. Re-introducing that change, as it is functionally correct and should be beneficial for performance.

manman-ren requested a review from ptillet as a code owner September 21, 2024 01:12

manman-ren marked this pull request as draft September 21, 2024 01:12

Jokeren self-assigned this Sep 22, 2024

manman-ren force-pushed the remove-hack-swp branch from bc63de3 to bd4459f Compare September 24, 2024 22:46

manman-ren marked this pull request as ready for review September 25, 2024 20:15

Jokeren approved these changes Sep 26, 2024

View reviewed changes

manman-ren added 2 commits September 26, 2024 18:50

[SWP] attempt to remove a workaround for a bug

314d5d1

Summary: Triton LLVM codegen has a bug where local_loads from #shared to than the mma tile. Remove the workaround. See triton-lang#3561. Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove #if0

21dd72e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

manman-ren force-pushed the remove-hack-swp branch from bd4459f to 21dd72e Compare September 27, 2024 01:51

manman-ren merged commit e7ec3fe into triton-lang:main Sep 27, 2024
7 checks passed

ThomasRaoux added a commit that referenced this pull request Oct 8, 2024

Revert #4774 and #4721

5ab3827

ThomasRaoux added a commit that referenced this pull request Oct 8, 2024

revert #4774

2307497

ThomasRaoux added a commit that referenced this pull request Oct 8, 2024

revert #4774 (#4873)

9e72047

This change is causing failures in some internal tests. There must still be some miscompile associated with this.

sfzhu93 pushed a commit to sfzhu93/triton that referenced this pull request Oct 11, 2024

revert triton-lang#4774 (triton-lang#4873)

636b0c6

This change is causing failures in some internal tests. There must still be some miscompile associated with this.

pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Oct 23, 2024

Revert "revert triton-lang#4774 (triton-lang#4873)"

696e546

This reverts commit 9e72047.

pawelszczerbuk mentioned this pull request Oct 23, 2024

[DO NOT MERGE] Reintroduce "[SWP] attempt to remove a workaround for a triton llvm codegen bug (#4873)" #4973

Merged

pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Oct 24, 2024

Revert "revert triton-lang#4774 (triton-lang#4873)"

35d3ecc

This reverts commit 9e72047.

pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Oct 25, 2024

Revert "revert triton-lang#4774 (triton-lang#4873)"

9eacf0f

This reverts commit 9e72047.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SWP] attempt to remove a workaround for a triton llvm codegen bug #4774

[SWP] attempt to remove a workaround for a triton llvm codegen bug #4774

manman-ren commented Sep 21, 2024 •

edited

Loading

manman-ren commented Sep 21, 2024

jlebar commented Sep 22, 2024

Jokeren commented Sep 22, 2024

manman-ren commented Sep 22, 2024

Jokeren commented Sep 22, 2024

Jokeren commented Sep 22, 2024

manman-ren commented Sep 23, 2024

Jokeren commented Sep 24, 2024

[SWP] attempt to remove a workaround for a triton llvm codegen bug #4774

[SWP] attempt to remove a workaround for a triton llvm codegen bug #4774

Conversation

manman-ren commented Sep 21, 2024 • edited Loading

manman-ren commented Sep 21, 2024

jlebar commented Sep 22, 2024

Jokeren commented Sep 22, 2024

manman-ren commented Sep 22, 2024

Jokeren commented Sep 22, 2024

Jokeren commented Sep 22, 2024

manman-ren commented Sep 23, 2024

Jokeren commented Sep 24, 2024

manman-ren commented Sep 21, 2024 •

edited

Loading