-
Notifications
You must be signed in to change notification settings - Fork 2k
[BACKEND] Implement generic swizzling when lowering convert_layout
#6982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
convert_layout
convert_layout
convert_layout
convert_layout
I'll run benchmarks and do a couple minor clean-ups tomorrow. Will also add a couple lit tests, although there is already one for the fp8 transpose which shows that we can indeed vectorise it. |
@@ -68,7 +68,7 @@ tt.func @matmul_loop(%lb : index, %ub : index, %step : index, %A : !tt.ptr<f16>, | |||
|
|||
// Shared memory is available after a tensor's liveness range ends | |||
// expected-remark @below {{reusable}} | |||
// expected-remark @below {{size = 4608}} | |||
// expected-remark @below {{size = 8192}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like shared memory usage has been increased a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these often come from being able to vectorise more than before (and as such, not being abl eto do so many reps).
return smem.getTotalOutDimSize() / reps; | ||
} | ||
|
||
static unsigned getNumScratchElemsPaddedCvt(RankedTensorType srcTy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it there only for the isStMatrix
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
auto logBankConflicts = std::min<int32_t>( | ||
std::max<int32_t>(0, lenSegment - A.size() - segment.size()), A.size()); | ||
// Conflict-free | ||
for (int i = logBankConflicts; i < A.size(); ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ^
operator here isn't clear to me, but we can chat offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is in the explanation of the algorithm in the paper, but yes, I agree it is quite a tricky part
We implement a generic swizzling algorithm by @apgoucher that, given two linear layouts, finds the optimal shared memory layout that maximises read/write vectorisation and, provided that, minimises bank conflicts.
We also implement an algorithm to find the minimum tile size necessary to perform the
convert_layout
given the restrictions above, and we use it to perform theconvert_layout
iteratively.This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do that in a future PR.