Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Hoist Q out of the loop for FA optimization #4666

Merged
merged 4 commits into from
Sep 9, 2024

Conversation

oplavsic
Copy link
Contributor

@oplavsic oplavsic commented Sep 6, 2024

Move writing to LDS and reading from LDS right after the loading of a tensor from global memory. This PR does reordering by considering 2 possible patterns depending on whether writing to LDS is done using an optional local_alloc argument or a local_store instruction: 1) load -> local_alloc -> local_store -> local_load, 2) load -> local_alloc -> local_load.

Copy link
Contributor

@sjw36 sjw36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to undo some of the ordering that the new pipeliner sets. For instance, the pipeliner will issue the global loads for all buffers first, then write those to shared memory. Is there a perf benefit to interleaving these (ie. load(A)->local_store(A) - load(B)->local_store(B)?

Operation *loadOp = localStore->getOperand(0).getDefiningOp();

if (isa<tt::LoadOp>(loadOp)) {
localAlloc->moveAfter(loadOp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about all the pointer logic that these ops depend on?
You must capture/move the cone of logic for all operands of the localStore.
The local_store may also be in the loop and it may not be valid to hoist it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check localAlloc->getBlock() == localStore->getBlock() == loadOp->getBlock() to verify in the same block.

And check below where the logic cone is captured (line 130).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@zhanglx13
Copy link
Collaborator

I think this is for the Q tensor in FA kernels. And for Q the load is outside of the loop so stream-pipeline will not touch it.
Then if you see local_alloc/load/store inside the loop, you can find its definingOp. And if the definingOp is outside the loop, instead of inside the loop or being a blockArg, we know this is for Q.
More generally, such local_xxx can be hoisted out of the loop since we don't want to 1) use more LDS, and 2) load from LDS at every iteration of the loop.

@sjw36
Copy link
Contributor

sjw36 commented Sep 6, 2024

I think this is for the Q tensor in FA kernels. And for Q the load is outside of the loop so stream-pipeline will not touch it. Then if you see local_alloc/load/store inside the loop, you can find its definingOp. And if the definingOp is outside the loop, instead of inside the loop or being a blockArg, we know this is for Q. More generally, such local_xxx can be hoisted out of the loop since we don't want to 1) use more LDS, and 2) load from LDS at every iteration of the loop.

Ok, all the logic must be checked to verify loop invariance. This can be checked when gathering the cone.

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few nits. Also +1 to what Simon said. I think we are starting to see conflicting needs so we should restructure this pass. I'm fine landing this fix first and then doing that next.

Operation *loadOp = localStore->getOperand(0).getDefiningOp();

if (isa<tt::LoadOp>(loadOp)) {
localAlloc->moveAfter(loadOp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@oplavsic oplavsic force-pushed the hoist_q branch 4 times, most recently from 8871709 to 6a967e0 Compare September 9, 2024 13:33
@oplavsic
Copy link
Contributor Author

oplavsic commented Sep 9, 2024

Thanks for detailed review! I addressed @antiagainst code cleanup comments and added some more checks to better verify that the code is affecting only these very specific patterns. @sjw36 I'm not sure I quite follow all of the comments. The way I see it, this code affects only two very particular cases. It should not undo anything that pipeliner has done. If the tensor is pipelined and argument to local_load (or even local_store) is the block argument, this code will bail (on getDefiningOp(); if(SomeOp) ... checks that are in the code). So the code should only affect tensors that are not pipelined. Moreover, it affects only cases where localAlloc has 1 or 2 uses (local_load, or local_load and local_store), and again, the code bails if it's not exactly this pattern with this number of uses. So I'm not sure I understand what pointer arithmetic can be broken here. Pipeliner lit tests are passing. Is there some particular test case or config in general you are worried about that I can run and see in more detail (by examining TTGIR) cases that you are worried about?

Copy link
Contributor

@sjw36 sjw36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed with @oplavsic, the pattern makes sense and captures all necessary ops.
Minor nits and please add check for specifically moving local_ ops out of loop.

@antiagainst antiagainst marked this pull request as ready for review September 9, 2024 21:51
@antiagainst antiagainst changed the title [AMD] Bring back hoist Q out of the loop FA optimization [AMD] Hoist Q out of the loop FA optimization Sep 9, 2024
@antiagainst antiagainst changed the title [AMD] Hoist Q out of the loop FA optimization [AMD] Hoist Q out of the loop for FA optimization Sep 9, 2024
@antiagainst antiagainst merged commit e192dba into triton-lang:main Sep 9, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants