-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry pick upstream new passes #390
Commits on Nov 9, 2023
-
upgrade llvm to
b1115f8c
(NFC) (triton-lang#2403)Co-authored-by: Thomas Raoux <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Phil Tillet <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d2f8b50 - Browse repository at this point
Copy the full SHA d2f8b50View commit details -
Configuration menu - View commit details
-
Copy full SHA for b3276e7 - Browse repository at this point
Copy the full SHA b3276e7View commit details -
[OPTIMIZER] Thread local reduction optimization (triton-lang#2542)
Co-authored-by: Phil Tillet <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ad3b3ff - Browse repository at this point
Copy the full SHA ad3b3ffView commit details -
[BACKEND] Fix multiple bugs in WGMMA (triton-lang#2457)
Fix dependencies in wgmma_wait op to prevent the scheduler from moving it past the uses of wgmma accumulator. We need to explicitly represent the dependency between the wait and the accumulator uses otherwise LLVM is free to re-order those. This allows us to remove a workaround to prevent the re-ordering. We can also remove the wait op added in the loop during pipelining. Also fix the descritpor calcuation for wgmma, we should calculate the same descriptor for the whole warpgroup. Added a workaround for a bug that was exposed by different timing due to those changes. We shouldn't insert operations between the loop and async_wait or we may have race conditions.
Configuration menu - View commit details
-
Copy full SHA for c070b98 - Browse repository at this point
Copy the full SHA c070b98View commit details -
[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (trit…
…on-lang#2526) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.
Configuration menu - View commit details
-
Copy full SHA for ac4ee36 - Browse repository at this point
Copy the full SHA ac4ee36View commit details -
[BACKEND] Dedup elementwise in LLVM IR based on constancy (triton-lan…
…g#2512) ### Summary When Triton GPU IR is lowered into LLVM IR, we can make use of the constancy information about the result of the elementwise ops to deduplicate otherwise redundant computation. That is the contribution of this PR: the constancy is checked and, if possible, some of the values in LLVM IR are reused multiple times instead of computing equal values separately. The change is beneficial for the PyTorch 2 / TorchInductor-generated Triton code, as the leftmost sub-indices extracted from the flat index by div / mod operations can be equal, given sufficiently large 2^n factor in the rightmost rightmost dimension(s). This makes the computation resulting in those sub-indices redundant. Consequently, under the necessary constancy conditions, the redundant indexing arithmetics can be deduplicated. We observe up to 29% decrease in the latency of some of our jagged tensor kernels
Configuration menu - View commit details
-
Copy full SHA for 6d45d6c - Browse repository at this point
Copy the full SHA 6d45d6cView commit details -
[BACKEND] Pipeliner refactoring (triton-lang#2565)
Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.
Configuration menu - View commit details
-
Copy full SHA for d1de5aa - Browse repository at this point
Copy the full SHA d1de5aaView commit details