Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick upstream new passes #390

Closed
wants to merge 7 commits into from
Closed

Conversation

zhanglx13
Copy link

@zhanglx13 zhanglx13 commented Nov 3, 2023

This PR brought the upstream's new pipeliner. And inorder to resolve the conflict, it also cherry picked the following PRs:

Among these commits, only triton-lang#2542 improves the performance of FA by about 1 tflops.

This PR is not planned to be merged into the current IFU. It will wait for the IFU to merge first.

@zhanglx13 zhanglx13 changed the title Cherry pick upstream new passes [DO NOT MERGE] Cherry pick upstream new passes Nov 3, 2023
@@ -188,7 +188,7 @@ void LoopPipeliner::collectValueDep(Value v, int stage,
return;

// Loop-invariant value, skip
if (v.getParentRegion() != &forOp.getLoopBody())
if (v.getParentRegion() != &forOp.getRegion())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjw36 Can you take a look at this part?

@@ -116,7 +116,7 @@ initialize_module(llvm::Module *module, const std::string &triple,
opt.NoNaNsFPMath = true;
llvm::TargetMachine *machine = target->createTargetMachine(
module->getTargetTriple(), proc, features, opt, llvm::Reloc::PIC_,
std::nullopt, llvm::CodeGenOpt::Aggressive);
std::nullopt, llvm::CodeGenOptLevel::Aggressive);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micmelesse Can you review changes in this file? It's due to recent change of LLVM.

@zhanglx13 zhanglx13 changed the base branch from ifu231005-rebase to triton-mlir November 7, 2023 15:36
@zhanglx13 zhanglx13 changed the title [DO NOT MERGE] Cherry pick upstream new passes Cherry pick upstream new passes Nov 7, 2023
joker-eph and others added 7 commits November 9, 2023 14:22
Co-authored-by: Thomas Raoux <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: Phil Tillet <[email protected]>
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.

Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
…on-lang#2526)

Now that the bug related to attribute is fixed in MLIR we can use arith
ops for cmp and select ops.
…g#2512)

### Summary

When Triton GPU IR is lowered into LLVM IR, we can make use of the
constancy information about the result of the elementwise ops to
deduplicate otherwise redundant computation. That is the contribution of
this PR: the constancy is checked and, if possible, some of the values
in LLVM IR are reused multiple times instead of computing equal values
separately.

The change is beneficial for the PyTorch 2 / TorchInductor-generated
Triton code, as the leftmost sub-indices extracted from the flat index
by div / mod operations can be equal, given sufficiently large 2^n
factor in the rightmost rightmost dimension(s). This makes the
computation resulting in those sub-indices redundant. Consequently,
under the necessary constancy conditions, the redundant indexing
arithmetics can be deduplicated. We observe up to 29% decrease in the
latency of some of our jagged tensor kernels
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.

This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
@jayfurmanek
Copy link
Collaborator

Some of this PR is now merged from: #410

Included commits:
triton-lang#2542
triton-lang#2526
triton-lang#2512
triton-lang#2565

@zhanglx13 zhanglx13 closed this Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants