Skip to content

[BACKEND] Implement generic swizzling when lowering convert_layout #6982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

lezcano
Copy link
Contributor

@lezcano lezcano commented May 29, 2025

We implement a generic swizzling algorithm by @apgoucher that, given two linear layouts, finds the optimal shared memory layout that maximises read/write vectorisation and, provided that, minimises bank conflicts.

We also implement an algorithm to find the minimum tile size necessary to perform the convert_layout given the restrictions above, and we use it to perform the convert_layout iteratively.

This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do that in a future PR.

@lezcano lezcano requested review from Jokeren and ptillet as code owners May 29, 2025 16:29
@lezcano lezcano marked this pull request as draft May 29, 2025 16:30
@lezcano lezcano changed the title [BACKEND][DNR] Implement generic swizzling when lowering convert_layout [BACKEND][WIP] Implement generic swizzling when lowering convert_layout May 29, 2025
@lezcano lezcano closed this May 29, 2025
@lezcano lezcano reopened this May 29, 2025
@lezcano lezcano marked this pull request as ready for review May 29, 2025 21:38
@lezcano lezcano changed the title [BACKEND][WIP] Implement generic swizzling when lowering convert_layout [BACKEND] Implement generic swizzling when lowering convert_layout May 29, 2025
@lezcano lezcano requested a review from ThomasRaoux May 29, 2025 21:38
@lezcano
Copy link
Contributor Author

lezcano commented May 29, 2025

I'll run benchmarks and do a couple minor clean-ups tomorrow. Will also add a couple lit tests, although there is already one for the fp8 transpose which shows that we can indeed vectorise it.

@@ -68,7 +68,7 @@ tt.func @matmul_loop(%lb : index, %ub : index, %step : index, %A : !tt.ptr<f16>,

// Shared memory is available after a tensor's liveness range ends
// expected-remark @below {{reusable}}
// expected-remark @below {{size = 4608}}
// expected-remark @below {{size = 8192}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like shared memory usage has been increased a lot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these often come from being able to vectorise more than before (and as such, not being abl eto do so many reps).

return smem.getTotalOutDimSize() / reps;
}

static unsigned getNumScratchElemsPaddedCvt(RankedTensorType srcTy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it there only for the isStMatrix case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

auto logBankConflicts = std::min<int32_t>(
std::max<int32_t>(0, lenSegment - A.size() - segment.size()), A.size());
// Conflict-free
for (int i = logBankConflicts; i < A.size(); ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ^ operator here isn't clear to me, but we can chat offline

Copy link
Contributor Author

@lezcano lezcano Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is in the explanation of the algorithm in the paper, but yes, I agree it is quite a tricky part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants