-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SWP] attempt to remove a workaround for a triton llvm codegen bug #4774
Conversation
One of our recent PRs may have fixed this problem. Do you have a test case BTW? If so, happy to dig deeper |
I am using dont_pipeline_128x1 from test/TritonGPU/loop-pipeline.mlir (https://pastebin.com/xxP3cFmy), if we remove the HACK from the compiler, we will start pipelining the load in the test case, which should trigger the issue of invalid code for local_loads from #shared to #mma layout if the loaded shape is smaller than the mma tile. |
To confirm, are you saying that the bug is triggered at the following IR?
|
Honestly I'm not sure if constructing an MMA layout greater than the tensor size makes sense? In theory we could repeat values in registers and threads. Would like to get your thoughts |
https://pastebin.com/eN5HP8XW has the mlir after SWP, if we pipeline the load. Specifically, this op should trigger the issue i.e loading a 128x1 tensor for an MMAv2 dot with tile {16,8} is bad because 1 < 8 This is how I understand it from the comments :] |
bc63de3
to
bd4459f
Compare
Summary: Triton LLVM codegen has a bug where local_loads from #shared to than the mma tile. Remove the workaround. See triton-lang#3561. Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
bd4459f
to
21dd72e
Compare
This change is causing failures in some internal tests. There must still be some miscompile associated with this.
This reverts commit 9e72047.
This reverts commit 9e72047.
This reverts commit 9e72047.
…odegen bug (#4873)" (#4973) After investigation of the differences caused by #4774 in the internal tests, we concluded that they were introduced by change in the layouts selected for the reduce operations. Re-introducing that change, as it is functionally correct and should be beneficial for performance.
…odegen bug (triton-lang#4873)" (triton-lang#4973) After investigation of the differences caused by triton-lang#4774 in the internal tests, we concluded that they were introduced by change in the layouts selected for the reduce operations. Re-introducing that change, as it is functionally correct and should be beneficial for performance.
Triton LLVM codegen has a bug where local_loads from #shared to #mma layout can lead to invalid code if the loaded shape is smaller than the mma tile. Remove the workaround.
See #3561.
Verified that with test case: https://pastebin.com/xxP3cFmy (test.mlir), running
triton-opt test.mlir -tritongpu-pipeline=num-stages=3 --convert-scf-to-cf --allocate-shared-memory --convert-triton-gpu-to-llvm
has no issue.
Unit test case added in #4798 also shows no issue.