-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry pick upstream new passes #390
Conversation
@@ -188,7 +188,7 @@ void LoopPipeliner::collectValueDep(Value v, int stage, | |||
return; | |||
|
|||
// Loop-invariant value, skip | |||
if (v.getParentRegion() != &forOp.getLoopBody()) | |||
if (v.getParentRegion() != &forOp.getRegion()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjw36 Can you take a look at this part?
@@ -116,7 +116,7 @@ initialize_module(llvm::Module *module, const std::string &triple, | |||
opt.NoNaNsFPMath = true; | |||
llvm::TargetMachine *machine = target->createTargetMachine( | |||
module->getTargetTriple(), proc, features, opt, llvm::Reloc::PIC_, | |||
std::nullopt, llvm::CodeGenOpt::Aggressive); | |||
std::nullopt, llvm::CodeGenOptLevel::Aggressive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micmelesse Can you review changes in this file? It's due to recent change of LLVM.
b53d3b2
to
4fb831f
Compare
17e9099
to
85216ea
Compare
4fb831f
to
d0ab35d
Compare
Co-authored-by: Thomas Raoux <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Phil Tillet <[email protected]>
Co-authored-by: Phil Tillet <[email protected]>
Fix dependencies in wgmma_wait op to prevent the scheduler from moving it past the uses of wgmma accumulator. We need to explicitly represent the dependency between the wait and the accumulator uses otherwise LLVM is free to re-order those. This allows us to remove a workaround to prevent the re-ordering. We can also remove the wait op added in the loop during pipelining. Also fix the descritpor calcuation for wgmma, we should calculate the same descriptor for the whole warpgroup. Added a workaround for a bug that was exposed by different timing due to those changes. We shouldn't insert operations between the loop and async_wait or we may have race conditions.
…on-lang#2526) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.
…g#2512) ### Summary When Triton GPU IR is lowered into LLVM IR, we can make use of the constancy information about the result of the elementwise ops to deduplicate otherwise redundant computation. That is the contribution of this PR: the constancy is checked and, if possible, some of the values in LLVM IR are reused multiple times instead of computing equal values separately. The change is beneficial for the PyTorch 2 / TorchInductor-generated Triton code, as the leftmost sub-indices extracted from the flat index by div / mod operations can be equal, given sufficiently large 2^n factor in the rightmost rightmost dimension(s). This makes the computation resulting in those sub-indices redundant. Consequently, under the necessary constancy conditions, the redundant indexing arithmetics can be deduplicated. We observe up to 29% decrease in the latency of some of our jagged tensor kernels
Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.
d0ab35d
to
d1de5aa
Compare
Some of this PR is now merged from: #410 Included commits: |
This PR brought the upstream's new pipeliner. And inorder to resolve the conflict, it also cherry picked the following PRs:
Among these commits, only triton-lang#2542 improves the performance of FA by about 1 tflops.
This PR is not planned to be merged into the current IFU. It will wait for the IFU to merge first.