[AMD] Hoist Q out of the loop for FA optimization #4666

oplavsic · 2024-09-06T15:59:41Z

Move writing to LDS and reading from LDS right after the loading of a tensor from global memory. This PR does reordering by considering 2 possible patterns depending on whether writing to LDS is done using an optional local_alloc argument or a local_store instruction: 1) load -> local_alloc -> local_store -> local_load, 2) load -> local_alloc -> local_load.

sjw36

This change seems to undo some of the ordering that the new pipeliner sets. For instance, the pipeliner will issue the global loads for all buffers first, then write those to shared memory. Is there a perf benefit to interleaving these (ie. load(A)->local_store(A) - load(B)->local_store(B)?

sjw36 · 2024-09-06T16:13:26Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

+          Operation *loadOp = localStore->getOperand(0).getDefiningOp();
+
+          if (isa<tt::LoadOp>(loadOp)) {
+            localAlloc->moveAfter(loadOp);


What about all the pointer logic that these ops depend on?
You must capture/move the cone of logic for all operands of the localStore.
The local_store may also be in the loop and it may not be valid to hoist it.

Check localAlloc->getBlock() == localStore->getBlock() == loadOp->getBlock() to verify in the same block.

And check below where the logic cone is captured (line 130).

zhanglx13 · 2024-09-06T16:57:47Z

I think this is for the Q tensor in FA kernels. And for Q the load is outside of the loop so stream-pipeline will not touch it.
Then if you see local_alloc/load/store inside the loop, you can find its definingOp. And if the definingOp is outside the loop, instead of inside the loop or being a blockArg, we know this is for Q.
More generally, such local_xxx can be hoisted out of the loop since we don't want to 1) use more LDS, and 2) load from LDS at every iteration of the loop.

sjw36 · 2024-09-06T17:33:56Z

I think this is for the Q tensor in FA kernels. And for Q the load is outside of the loop so stream-pipeline will not touch it. Then if you see local_alloc/load/store inside the loop, you can find its definingOp. And if the definingOp is outside the loop, instead of inside the loop or being a blockArg, we know this is for Q. More generally, such local_xxx can be hoisted out of the loop since we don't want to 1) use more LDS, and 2) load from LDS at every iteration of the loop.

Ok, all the logic must be checked to verify loop invariance. This can be checked when gathering the cone.

antiagainst

Thanks! A few nits. Also +1 to what Simon said. I think we are starting to see conflicting needs so we should restructure this pass. I'm fine landing this fix first and then doing that next.

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

antiagainst · 2024-09-08T17:33:54Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

+          Operation *loadOp = localStore->getOperand(0).getDefiningOp();
+
+          if (isa<tt::LoadOp>(loadOp)) {
+            localAlloc->moveAfter(loadOp);


oplavsic · 2024-09-09T14:51:24Z

Thanks for detailed review! I addressed @antiagainst code cleanup comments and added some more checks to better verify that the code is affecting only these very specific patterns. @sjw36 I'm not sure I quite follow all of the comments. The way I see it, this code affects only two very particular cases. It should not undo anything that pipeliner has done. If the tensor is pipelined and argument to local_load (or even local_store) is the block argument, this code will bail (on getDefiningOp(); if(SomeOp) ... checks that are in the code). So the code should only affect tensors that are not pipelined. Moreover, it affects only cases where localAlloc has 1 or 2 uses (local_load, or local_load and local_store), and again, the code bails if it's not exactly this pattern with this number of uses. So I'm not sure I understand what pointer arithmetic can be broken here. Pipeliner lit tests are passing. Is there some particular test case or config in general you are worried about that I can run and see in more detail (by examining TTGIR) cases that you are worried about?

sjw36

Code reviewed with @oplavsic, the pattern makes sense and captures all necessary ops.
Minor nits and please add check for specifically moving local_ ops out of loop.

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

Bring back hoist Q out of the loop FA optimization

14160f5

sjw36 reviewed Sep 6, 2024

View reviewed changes

antiagainst requested changes Sep 8, 2024

View reviewed changes

oplavsic force-pushed the hoist_q branch 4 times, most recently from 8871709 to 6a967e0 Compare September 9, 2024 13:33

Address review comments

ef61037

oplavsic force-pushed the hoist_q branch from 6a967e0 to ef61037 Compare September 9, 2024 14:39

sjw36 reviewed Sep 9, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp Outdated Show resolved Hide resolved

antiagainst force-pushed the hoist_q branch from b0df2ca to 7eb8121 Compare September 9, 2024 21:49

Fix a new style nits

b74fd23

antiagainst force-pushed the hoist_q branch from 7eb8121 to b74fd23 Compare September 9, 2024 21:50

antiagainst approved these changes Sep 9, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into hoist_q

399e113

antiagainst marked this pull request as ready for review September 9, 2024 21:51

antiagainst requested review from zhanglx13 and ptillet as code owners September 9, 2024 21:51

antiagainst changed the title ~~[AMD] Bring back hoist Q out of the loop FA optimization~~ [AMD] Hoist Q out of the loop FA optimization Sep 9, 2024

antiagainst changed the title ~~[AMD] Hoist Q out of the loop FA optimization~~ [AMD] Hoist Q out of the loop for FA optimization Sep 9, 2024

antiagainst merged commit e192dba into triton-lang:main Sep 9, 2024
7 checks passed

lezcano mentioned this pull request Sep 24, 2024

Implement scaled_dot(mxfp8, fp8) via mma #4795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Hoist Q out of the loop for FA optimization #4666

[AMD] Hoist Q out of the loop for FA optimization #4666

oplavsic commented Sep 6, 2024

sjw36 left a comment

sjw36 Sep 6, 2024

sjw36 Sep 6, 2024

antiagainst Sep 8, 2024

zhanglx13 commented Sep 6, 2024

sjw36 commented Sep 6, 2024

antiagainst left a comment

antiagainst Sep 8, 2024

oplavsic commented Sep 9, 2024

sjw36 left a comment

[AMD] Hoist Q out of the loop for FA optimization #4666

[AMD] Hoist Q out of the loop for FA optimization #4666

Conversation

oplavsic commented Sep 6, 2024

sjw36 left a comment

Choose a reason for hiding this comment

sjw36 Sep 6, 2024

Choose a reason for hiding this comment

sjw36 Sep 6, 2024

Choose a reason for hiding this comment

antiagainst Sep 8, 2024

Choose a reason for hiding this comment

zhanglx13 commented Sep 6, 2024

sjw36 commented Sep 6, 2024

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst Sep 8, 2024

Choose a reason for hiding this comment

oplavsic commented Sep 9, 2024

sjw36 left a comment

Choose a reason for hiding this comment