-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] Hoist Q out of the loop for FA optimization #4666
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems to undo some of the ordering that the new pipeliner sets. For instance, the pipeliner will issue the global loads for all buffers first, then write those to shared memory. Is there a perf benefit to interleaving these (ie. load(A)->local_store(A) - load(B)->local_store(B)?
Operation *loadOp = localStore->getOperand(0).getDefiningOp(); | ||
|
||
if (isa<tt::LoadOp>(loadOp)) { | ||
localAlloc->moveAfter(loadOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about all the pointer logic that these ops depend on?
You must capture/move the cone of logic for all operands of the localStore.
The local_store
may also be in the loop and it may not be valid to hoist it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check localAlloc->getBlock() == localStore->getBlock() == loadOp->getBlock() to verify in the same block.
And check below where the logic cone is captured (line 130).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think this is for the Q tensor in FA kernels. And for Q the load is outside of the loop so stream-pipeline will not touch it. |
Ok, all the logic must be checked to verify loop invariance. This can be checked when gathering the cone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few nits. Also +1 to what Simon said. I think we are starting to see conflicting needs so we should restructure this pass. I'm fine landing this fix first and then doing that next.
third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp
Outdated
Show resolved
Hide resolved
Operation *loadOp = localStore->getOperand(0).getDefiningOp(); | ||
|
||
if (isa<tt::LoadOp>(loadOp)) { | ||
localAlloc->moveAfter(loadOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
8871709
to
6a967e0
Compare
Thanks for detailed review! I addressed @antiagainst code cleanup comments and added some more checks to better verify that the code is affecting only these very specific patterns. @sjw36 I'm not sure I quite follow all of the comments. The way I see it, this code affects only two very particular cases. It should not undo anything that pipeliner has done. If the tensor is pipelined and argument to local_load (or even local_store) is the block argument, this code will bail (on getDefiningOp(); if(SomeOp) ... checks that are in the code). So the code should only affect tensors that are not pipelined. Moreover, it affects only cases where localAlloc has 1 or 2 uses (local_load, or local_load and local_store), and again, the code bails if it's not exactly this pattern with this number of uses. So I'm not sure I understand what pointer arithmetic can be broken here. Pipeliner lit tests are passing. Is there some particular test case or config in general you are worried about that I can run and see in more detail (by examining TTGIR) cases that you are worried about? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code reviewed with @oplavsic, the pattern makes sense and captures all necessary ops.
Minor nits and please add check for specifically moving local_ ops out of loop.
Move writing to LDS and reading from LDS right after the loading of a tensor from global memory. This PR does reordering by considering 2 possible patterns depending on whether writing to LDS is done using an optional local_alloc argument or a local_store instruction: 1) load -> local_alloc -> local_store -> local_load, 2) load -> local_alloc -> local_load.