Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SWP] attempt to remove a workaround for a triton llvm codegen bug #4774

Merged
merged 2 commits into from
Sep 27, 2024

Conversation

manman-ren
Copy link
Collaborator

@manman-ren manman-ren commented Sep 21, 2024

Triton LLVM codegen has a bug where local_loads from #shared to #mma layout can lead to invalid code if the loaded shape is smaller than the mma tile. Remove the workaround.

See #3561.

Verified that with test case: https://pastebin.com/xxP3cFmy (test.mlir), running
triton-opt test.mlir -tritongpu-pipeline=num-stages=3 --convert-scf-to-cf --allocate-shared-memory --convert-triton-gpu-to-llvm
has no issue.

Unit test case added in #4798 also shows no issue.

@manman-ren manman-ren marked this pull request as draft September 21, 2024 01:12
@manman-ren
Copy link
Collaborator Author

@jlebar I am not sure if I am checking the failure correctly, I tried running
triton-opt test.mlir -tritongpu-pipeline=num-stages=3 --convert-scf-to-cf --allocate-shared-memory --convert-triton-gpu-to-llvm
but didn't hit any issue. From #3549, sounds like we should hit an assertion.
Thanks!

CC @pawelszczerbuk

@jlebar
Copy link
Collaborator

jlebar commented Sep 22, 2024

From #3549, sounds like we should hit an assertion.

I believe that assertion was removed by @Jokeren when he fixed the relevant codepath.

@Jokeren
Copy link
Contributor

Jokeren commented Sep 22, 2024

One of our recent PRs may have fixed this problem. Do you have a test case BTW? If so, happy to dig deeper

@manman-ren
Copy link
Collaborator Author

I am using dont_pipeline_128x1 from test/TritonGPU/loop-pipeline.mlir (https://pastebin.com/xxP3cFmy), if we remove the HACK from the compiler, we will start pipelining the load in the test case, which should trigger the issue of invalid code for local_loads from #shared to #mma layout if the loaded shape is smaller than the mma tile.

@Jokeren
Copy link
Contributor

Jokeren commented Sep 22, 2024

I am using dont_pipeline_128x1 from test/TritonGPU/loop-pipeline.mlir (https://pastebin.com/xxP3cFmy), if we remove the HACK from the compiler, we will start pipelining the load in the test case, which should trigger the issue of invalid code for local_loads from #shared to #mma layout if the loaded shape is smaller than the mma tile.

To confirm, are you saying that the bug is triggered at the following IR?

%161 = triton_gpu.convert_layout %151 : tensor<128x1xi32, #blocked> -> tensor<128x1xi32, #mma>

@Jokeren Jokeren self-assigned this Sep 22, 2024
@Jokeren
Copy link
Contributor

Jokeren commented Sep 22, 2024

Honestly I'm not sure if constructing an MMA layout greater than the tensor size makes sense?

In theory we could repeat values in registers and threads. Would like to get your thoughts

@manman-ren
Copy link
Collaborator Author

I am using dont_pipeline_128x1 from test/TritonGPU/loop-pipeline.mlir (https://pastebin.com/xxP3cFmy), if we remove the HACK from the compiler, we will start pipelining the load in the test case, which should trigger the issue of invalid code for local_loads from #shared to #mma layout if the loaded shape is smaller than the mma tile.

To confirm, are you saying that the bug is triggered at the following IR?

%161 = triton_gpu.convert_layout %151 : tensor<128x1xi32, #blocked> -> tensor<128x1xi32, #mma>

https://pastebin.com/eN5HP8XW has the mlir after SWP, if we pipeline the load.

Specifically, this op should trigger the issue
%14 = triton_gpu.local_load %arg5 : !tt.memdesc<128x1xi32, #shared, #triton_gpu.shared_memory, mutable> -> tensor<128x1xi32, #mma>
#mma = #triton_gpu.nvidia_mma<{versionMajor = 2, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [16, 8]}>

i.e loading a 128x1 tensor for an MMAv2 dot with tile {16,8} is bad because 1 < 8

This is how I understand it from the comments :]

@Jokeren
Copy link
Contributor

Jokeren commented Sep 24, 2024

#4798
@manman-ren

@manman-ren manman-ren marked this pull request as ready for review September 25, 2024 20:15
Summary: Triton LLVM codegen has a bug where local_loads from #shared to
than the mma tile. Remove the workaround.

See triton-lang#3561.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@manman-ren manman-ren merged commit e7ec3fe into triton-lang:main Sep 27, 2024
7 checks passed
ThomasRaoux added a commit that referenced this pull request Oct 8, 2024
ThomasRaoux added a commit that referenced this pull request Oct 8, 2024
ThomasRaoux added a commit that referenced this pull request Oct 8, 2024
This change is causing failures in some internal tests. There must still
be some miscompile associated with this.
sfzhu93 pushed a commit to sfzhu93/triton that referenced this pull request Oct 11, 2024
This change is causing failures in some internal tests. There must still
be some miscompile associated with this.
pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Oct 23, 2024
pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Oct 24, 2024
pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Oct 25, 2024
pawelszczerbuk added a commit that referenced this pull request Oct 28, 2024
…odegen bug (#4873)" (#4973)

After investigation of the differences caused by
#4774 in the internal tests,
we concluded that they were introduced by change in the layouts selected
for the reduce operations. Re-introducing that change, as it is
functionally correct and should be beneficial for performance.
AlexAUT pushed a commit to AlexAUT/triton that referenced this pull request Oct 29, 2024
…odegen bug (triton-lang#4873)" (triton-lang#4973)

After investigation of the differences caused by
triton-lang#4774 in the internal tests,
we concluded that they were introduced by change in the layouts selected
for the reduce operations. Re-introducing that change, as it is
functionally correct and should be beneficial for performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants