Cherry pick upstream new passes #390

zhanglx13 · 2023-11-03T21:02:27Z

This PR brought the upstream's new pipeliner. And inorder to resolve the conflict, it also cherry picked the following PRs:

Among these commits, only triton-lang#2542 improves the performance of FA by about 1 tflops.

This PR is not planned to be merged into the current IFU. It will wait for the IFU to merge first.

zhanglx13 · 2023-11-06T04:16:31Z

lib/Dialect/TritonGPU/Transforms/StreamPipeline.cpp

@@ -188,7 +188,7 @@ void LoopPipeliner::collectValueDep(Value v, int stage,
    return;

  // Loop-invariant value, skip
-  if (v.getParentRegion() != &forOp.getLoopBody())
+  if (v.getParentRegion() != &forOp.getRegion())


@sjw36 Can you take a look at this part?

zhanglx13 · 2023-11-06T04:19:11Z

lib/Target/HSACO/HSACOTranslation.cpp

@@ -116,7 +116,7 @@ initialize_module(llvm::Module *module, const std::string &triple,
  opt.NoNaNsFPMath = true;
  llvm::TargetMachine *machine = target->createTargetMachine(
      module->getTargetTriple(), proc, features, opt, llvm::Reloc::PIC_,
-      std::nullopt, llvm::CodeGenOpt::Aggressive);
+      std::nullopt, llvm::CodeGenOptLevel::Aggressive);


@micmelesse Can you review changes in this file? It's due to recent change of LLVM.

Co-authored-by: Thomas Raoux <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Phil Tillet <[email protected]>

Co-authored-by: Phil Tillet <[email protected]>

Fix dependencies in wgmma_wait op to prevent the scheduler from moving it past the uses of wgmma accumulator. We need to explicitly represent the dependency between the wait and the accumulator uses otherwise LLVM is free to re-order those. This allows us to remove a workaround to prevent the re-ordering. We can also remove the wait op added in the loop during pipelining. Also fix the descritpor calcuation for wgmma, we should calculate the same descriptor for the whole warpgroup. Added a workaround for a bug that was exposed by different timing due to those changes. We shouldn't insert operations between the loop and async_wait or we may have race conditions.

…on-lang#2526) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.

…g#2512) ### Summary When Triton GPU IR is lowered into LLVM IR, we can make use of the constancy information about the result of the elementwise ops to deduplicate otherwise redundant computation. That is the contribution of this PR: the constancy is checked and, if possible, some of the values in LLVM IR are reused multiple times instead of computing equal values separately. The change is beneficial for the PyTorch 2 / TorchInductor-generated Triton code, as the leftmost sub-indices extracted from the flat index by div / mod operations can be equal, given sufficiently large 2^n factor in the rightmost rightmost dimension(s). This makes the computation resulting in those sub-indices redundant. Consequently, under the necessary constancy conditions, the redundant indexing arithmetics can be deduplicated. We observe up to 29% decrease in the latency of some of our jagged tensor kernels

Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.

jayfurmanek · 2023-12-15T15:27:25Z

Some of this PR is now merged from: #410

Included commits:
triton-lang#2542
triton-lang#2526
triton-lang#2512
triton-lang#2565

zhanglx13 changed the title ~~Cherry pick upstream new passes~~ [DO NOT MERGE] Cherry pick upstream new passes Nov 3, 2023

zhanglx13 commented Nov 6, 2023

View reviewed changes

zhanglx13 requested review from sjw36 and micmelesse November 6, 2023 04:20

zhanglx13 force-pushed the cherry-pick-upstream branch from b53d3b2 to 4fb831f Compare November 7, 2023 00:15

jayfurmanek force-pushed the ifu231005-rebase branch from 17e9099 to 85216ea Compare November 7, 2023 04:33

zhanglx13 force-pushed the cherry-pick-upstream branch from 4fb831f to d0ab35d Compare November 7, 2023 15:35

zhanglx13 changed the base branch from ifu231005-rebase to triton-mlir November 7, 2023 15:36

zhanglx13 requested review from jayfurmanek, scxiao, oplavsic and alefimov-amd November 7, 2023 15:41

zhanglx13 changed the title ~~[DO NOT MERGE] Cherry pick upstream new passes~~ Cherry pick upstream new passes Nov 7, 2023

joker-eph and others added 7 commits November 9, 2023 14:22

upgrade llvm to b1115f8c (NFC) (triton-lang#2403)

d2f8b50

Co-authored-by: Thomas Raoux <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Phil Tillet <[email protected]>

Fix issues with llvm update

b3276e7

[OPTIMIZER] Thread local reduction optimization (triton-lang#2542)

ad3b3ff

Co-authored-by: Phil Tillet <[email protected]>

[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (trit…

ac4ee36

…on-lang#2526) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.

zhanglx13 force-pushed the cherry-pick-upstream branch from d0ab35d to d1de5aa Compare November 9, 2023 22:15

zhanglx13 closed this Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick upstream new passes #390

Cherry pick upstream new passes #390

zhanglx13 commented Nov 3, 2023 •

edited

Loading

zhanglx13 Nov 6, 2023

zhanglx13 Nov 6, 2023

jayfurmanek commented Dec 15, 2023

Cherry pick upstream new passes #390

Cherry pick upstream new passes #390

Conversation

zhanglx13 commented Nov 3, 2023 • edited Loading

zhanglx13 Nov 6, 2023

Choose a reason for hiding this comment

zhanglx13 Nov 6, 2023

Choose a reason for hiding this comment

jayfurmanek commented Dec 15, 2023

zhanglx13 commented Nov 3, 2023 •

edited

Loading