[AMD] Introduce an OptimizeLDSUsage pass #3730

binarman · 2024-04-22T20:39:10Z

This PR inroduces OptimizeLDSUsage pass which generalizes LDS optimization,
which was part of DecomposeUnsupportedLayouts pass.

Overall it tries to reduce LDS usage of convert op by adding intermediate layout
in conversion.

binarman · 2024-04-22T20:41:01Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+    int tmpCvtLDS = getCvtOpLDSUsage(tmpCvt);
+    int newCvtLDS = getCvtOpLDSUsage(newEpilogueCvt);
+    if (tmpCvtLDS <= LDSSize && newCvtLDS <= LDSSize) {
+      int LDSUsage = std::max(tmpCvtLDS, newCvtLDS);


@oplavsic
I've changed this part of the algorithm: https://github.com/openai/triton/pull/3730/files#diff-0d63e5cd9cf58151489fd9a5206b43a0902939004e58f3a7ec5258fa7d473267L227

Was it crucial?

binarman · 2024-04-23T20:48:59Z

To clarify, what this PR is doing:

At the moment we have an optimization in DecomposeUnsupportedLayouts pass, which is looking for convert_layout operations that requires more shared memory, than we have. Optimization tries to decompose such convert_layouts in two converts with some intermediate layout, In many cases this helps to reduce LDS usage.

Current approach can not optimize convert_layout in hopper flash attention test, so LDS overflows.
This PR introduces two things:

adding more intermediate layouts variants
doing global analysis, to catch convert_layout operation which do not overflow LDS on its own, but overflows memory if there are some shared tensors.

First item is needed, because old set of intermediate layouts was not able to optimize conversions found int hopper FA.

Second item is needed to generalize optimization. For example, take a look at this example:

 %1 = triton_gpu.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !tt.memdesc<128x128xf16, #shared>
 %2 = triton_gpu.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
 %3 = triton_gpu.local_load %1 : !tt.memdesc<128x128xf16, #shared> -> tensor<128x128xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #mma}>>

%1 consumes 16 KB of LDS, %2 requires ~64KB of lds for a scratch buffer.
If there are no padding, %2 can be exactly 64KB, which fits into LDS, but %1 and %2 together do not.

P.s. I had some concerns that new optimization can affect existing benchmarks. I had an offline conversation with author of original optimization (@oplavsic) and we decided that best to leave old optimization functionally same, but move some functions in common place and make them parameterizable.

zhanglx13 · 2024-04-24T01:58:00Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+ * ->
+ * %1 = cvtOp %0 (srcLayout -> dstLayout)
+ * %2 = cvtOp %0 (srcLayout -> tmpLayout)
+ * %3 = cvtOp %1 (tmpLayout -> dstLayout)


Should this be %3 = cvtOp %2?

This function creates two cvtOps based on a given cvtOps. Could you be more specific about which cvtOp is the new one and which is the old one in the comment?

zhanglx13 · 2024-04-24T02:03:33Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+// LDS reduction is possible by changing the shape of WarpsPerCta attribute in
+// mfma layout. The implicit LDS usage of cvt(mfma->blocked) op depends on the
+// number of warps per CTA that mfma layout uses along x dimension and block
+// layout uses across y dimension.


It's a little confusing whether x refers to the row or column. We can use dim 0 and dim 1 instead.

zhanglx13 · 2024-04-24T02:06:26Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+// LDS usage of this op is roughly calculated as:
+// LDS_USAGE = getShapePerCTA(mfma_layout)[0] * getShapePerCTA(blocked_layout)[1] * sizeof(data_type)
+// LDS_USAGE = warpsPerCTA(mfma_layout)[0] * warpsPerCta(blocked_layout)[1] * C,
+// where C = 32 * sizePerWarp(blocked_layout)[1] * threadsPerWarp(blocked_layout)[1] * sizeof(data_type)


Why is 32 hardcoded? Is it assuming mfma32 is used?

To be honest, I did not look deep into this comment, just copied it from original algorithm.
It was implemented a log ago, we probably had only mfma32 at the time.

I'll take a closer look and adjust.

zhanglx13 · 2024-04-24T02:24:57Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+  for (int i = 0; i < tmpLayouts.size(); i++) {
+    auto tmpLayout = tmpLayouts[i];
+    std::tie(tmpCvt, newEpilogueCvt) =
+        createNewConvertOps(builder, cvtOp, tmpLayout);


In this loop, we only want to know the index of the tmpLayout that gives us the min LDS usage. Do we really need to create the cvtOps and erase them at the end of each iteration?

This creation/deletion is needed because algorithm use getScratchConfigForCvtLayout(ConvertLayoutOp unsigned&, unsigned&) function from Allocation.cpp to estimate LDS usage.

I can introduce new interface, so we can avoid these redundant stuff.

zhanglx13 · 2024-04-24T02:30:12Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+ * @return mapping from operation to list of live LDS buffers
+ */
+std::map<mlir::Operation *, SmallVector<Allocation::BufferId>>
+analyzeBufferLiveness(FunctionOpInterface func, const Allocation *allocations) {


This is not AMD specific. Maybe we should put it in Analysis/Allocation.cpp?

zhanglx13 · 2024-04-24T02:38:03Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+}
+
+SmallVector<triton::gpu::ConvertLayoutOp>
+findLDSBottleneck(ModuleAllocation &allocAnalysis, FunctionOpInterface func) {


We can also put this to the common part since it can benefit NV path. But after realizing NV GPUs have pretty large shared memory ....

zhanglx13 · 2024-04-24T02:49:03Z

@binarman I have a question regarding tryMinimizeLDS.

%1 consumes 16 KB of LDS, %2 requires ~64KB of lds for a scratch buffer.
If there are no padding, %2 can be exactly 64KB, which fits into LDS, but %1 and %2 together do not.

In this example, %2 will be a candidate from findLDSBottleneck, and tryMinimizeLDS is called on it. However, tryMinimizeLDS will early return since currLDSUsage <= LDSSize. I think the problem is that tryMinimizeLDS should not take LDSSize as target, instead it should take LDSSize - offset as target, where offset can be kept when we look for candidates in findLDSBottleneck.

zhanglx13 · 2024-04-24T02:52:24Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+
+namespace {
+
+constexpr int LDSSize = 65536;


Could we not hardcode it but pass it from the front end?

binarman · 2024-04-24T14:39:15Z

@zhanglx13 about tryMinimizeLDS

Condition is filters out cases which will definitely overflow LDS and there are no early exit.
We can actually remove this condition at all, because we are looking for the smallest LDS usage anyway.

zhanglx13 · 2024-04-24T18:53:37Z

yes, at least the early return condition needs to be removed
And when you find the minLDSUsage, it could still be larger than LDSSize - offset, so tryMinimizeLDS should also return nothing in this case.

binarman · 2024-04-24T22:11:47Z

the early return condition needs to be removed

Now I see, I've missed this early return, thank you!
At first I thought you were talking about early exit from loop.

zhanglx13 · 2024-04-25T04:34:37Z

test/TritonGPU/amd/optimize-lds-usage.mlir

+module attributes {"triton_gpu.num-warps" = 8 : i32, "triton_gpu.threads-per-warp" = 64 : i32} {
+  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) attributes {noinline = false} {
+    %1 = triton_gpu.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !tt.memdesc<128x128xf16, #shared>
+    %2 = triton_gpu.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>


Sorry I forgot to mention that I think this cvtOp is decomposed just because it uses more than 64 KB of LDS since padding is used. Therefore, this test does not test the functionality that a cvtOp could still be decomposed even it uses less than 64 KB LDS.

Added new test: it uses fp16 instead of fp32, so cvt scratch buffer is x2 smaller

binarman · 2024-04-30T12:22:58Z

third_party/amd/backend/compiler.py

@@ -147,6 +147,8 @@ def make_llir(src, metadata, options):
        pm = ir.pass_manager(mod.context)
        pm.enable_debug()
        amd.passes.ttgpuir.add_decompose_unsupported_conversions(pm)
+        lds_size = 65536


I am not sure, where to place code choosing LDS size, so it is plain constant at this point.
Let's introduce some interface in later PR.

It should be convenient to rebase onto Lei's PR #3808

antiagainst · 2024-04-30T22:16:17Z

(coverting to draft as we chatted--need to first get all issues addressed from AMD side before making it as open)

binarman · 2024-05-01T14:20:05Z

@antiagainst @zhanglx13
This PR is ready for review, PTAL 🙂

zhanglx13 · 2024-05-01T14:50:14Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUtility.cpp

+namespace triton {
+namespace AMD {
+
+constexpr int kPtrBitWidth = 64;


Do we really need to hardcode the pointer bitwidth? Can we just use inline constant?

This part is copied from Allocation.cpp (it is not part of public interface).
Maybe I can actually take this part in some public interface, for example in Analysis/Utility module.

This is what I was talking about: binarman#6

zhanglx13 · 2024-05-01T14:53:16Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUtility.cpp

+  res.LDS = std::numeric_limits<typeof(res.LDS)>::max();
+
+  triton::gpu::ConvertLayoutOp tmpCvt;
+  triton::gpu::ConvertLayoutOp newEpilogueCvt;


The above three lines are not used.

zhanglx13 · 2024-05-01T15:11:32Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+      threadsPerWarp[rank - 2] = warpSize / threadsPerWarp[rank - 1];
+      auto order = triton::gpu::getOrder(srcEnc);
+      auto layoutCTA = triton::gpu::getCTALayout(srcEnc);
+      auto fallbackLayout = triton::gpu::BlockedEncodingAttr::get(


For this fallbackLayout, all the components, except warpsPerCTA, are loop invariants. Maybe we can create a base BlockLayout out of the loop and use createTmpLayout(blockEnc, warpsPerCTA) inside the loop to update the warpsPerCTA only?

Why is 8 chosen in warpSize / 8?

In general, why we need this fallbackLayout? Is it covered by either srcEnc or dstEnc?

Why is 8 chosen in warpSize / 8

For wave64 it will be [8, 8], for wave32 it will be [4, 8]. This is done to make layout tile "square", so no dim size of minimal tile is dominating.

In general, why we need this fallbackLayout? Is it covered by either srcEnc or dstEnc?

In some cases different warpsPerCTA of src or dst layout is not enough to reduce LDS usage, but some other layouts can be appropriate. These fallback layouts are designed to have as compact tile as possible, i.e. elementsPerThread = [1, ... 1], and threadsPerWarp are as "square" as possible.

I believe, that in most cases fallback layout will be chosen as a temporary layout. This could be non optimal in terms of performance, but it is fine, because without this transformation kernel will not compile at all.

zhanglx13 · 2024-05-01T15:16:23Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+      return;
+    }
+
+    triton::gpu::ConvertLayoutOp tmpCvt;


are we using this tmpCvt?

Nope, will rewrite this part as done in DecomposeUnsupportedConversions pass.

zhanglx13 · 2024-05-01T16:22:55Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+        if (offset + size > LDSSize) {
+          auto maxScratchBufferSize = computeMaxScratchBufferSize(
+              cvtOp, funcAnalysis, liveBuffers[cvtOp]);
+          candidates.push_back({cvtOp, maxScratchBufferSize});


This function is very confusing to me.

Why do we need opBuffer? Just to check it's valid?

Does liveBuffers[cvtOp] include opBuffer? To put it another way, does one of the bufId's for the scratch buffer allocated for this cvtOp?

It seems to me that this function assumes that there is at most one extra buffer that can overlap with the buffer for this cvtOp? If there are more live buffers that overlap with this cvtOp, we should still only push cvtOp into candidates once, but compute maxScratchBufferSize based on all overlapped live buffers.

Why do we need opBuffer? Just to check it's valid?

Sorry, this is reminder after refactoring, I used to pass it to computeMaxScratchBufferSize, but then start compute it inside function.

Does liveBuffers[cvtOp] include opBuffer? To put it another way, does one of the bufId's for the scratch buffer allocated for this cvtOp?

Yes, scratch buffer is the same as "long-living" buffers, the only difference, that it's live time is limited to one operation.

It seems to me that this function assumes that there is at most one extra buffer that can overlap with the buffer for this cvtOp? If there are more live buffers that overlap with this cvtOp, we should still only push cvtOp into candidates once, but compute maxScratchBufferSize based on all overlapped live buffers.

No, there could be any number of buffers with live-time overlapping with scratch buffer.

let me remove loop from this function, it should make algorithm clearer.

zhanglx13 · 2024-05-01T16:24:35Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+    int64_t scratchBufferSize = allocation->getAllocatedSize(scratchBufferId);
+    size_t totalLDSConsumption = 0;
+    for (auto buf : liveBuffers)
+      totalLDSConsumption = std::max(


If all liveBuffers are live at this cvtOp, should we use sum instead of max here?

Max is more conservative metric in this sense. Let's consider that we have "holes" in memory:

let's consider that green buffer is scratch buffer that we want to optimize, viollet and blue are long-living buffers in shared layout.

Hole is created, because pink tensor is allocated on tick 1 and reallocated on tick 2, but previously allocated violet tensor continue live.

Summarizing buffer sizes will tell that we have 20 KB(3 * 8 KB) for scratch buffer, but in reality we probably wan to make it smaller.

zhanglx13 · 2024-05-01T16:25:40Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+   * space available for scratch buffer.
+   */
+  int64_t
+  computeMaxScratchBufferSize(triton::gpu::ConvertLayoutOp op,


Maybe computeTargetBufferSize? I feel like "target" or "desired" is more accurate about what we want to do here.

zhanglx13 · 2024-07-19T02:57:09Z

third_party/amd/backend/compiler.py

@@ -169,6 +169,9 @@ def make_llir(src, metadata, options):
        pm = ir.pass_manager(mod.context)
        pm.enable_debug()
        amd.passes.ttgpuir.add_decompose_unsupported_conversions(pm, options.arch)
+        # experimental parameter, specifies custom LDS usage limit


Can you elaborate on this parameter? What does it mean when it's set to a non-zero and zero?
Especially when it's set to non-zero value, does it mean the total LDS usage is guaranteed to be lower than that? Or is it just a hint?

zhanglx13

Approved.
Could you please add more comments for custom_lds_size? See below.

binarman · 2024-07-22T14:34:13Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+
+} // namespace
+
+namespace mlir::triton::AMD {


@antiagainst
You changed this part from multiple line to one, is it preferred code style now?

@xla-rotation

Imported from GitHub PR openxla/xla#15477 This build break is introduced by openxla/xla#15257 and ROcm has a new optimized LDS pass on openai/triton triton-lang/triton#3730 @xla-rotation Copybara import of the project: -- 6f86fdbd090a4fc3fa2346ba6969d7ddeae773e3 by Chao Chen <[email protected]>: updated rocm triton OptimizeLDSUsage pass due to triton-lang/triton#3730 Merging this change closes #15477 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#15477 from ROCm:ci_hotfix_20240730 6f86fdbd090a4fc3fa2346ba6969d7ddeae773e3 PiperOrigin-RevId: 657498811

@xla-rotation

Imported from GitHub PR #15477 This build break is introduced by #15257 and ROcm has a new optimized LDS pass on openai/triton triton-lang/triton#3730 @xla-rotation Copybara import of the project: -- 6f86fdb by Chao Chen <[email protected]>: updated rocm triton OptimizeLDSUsage pass due to triton-lang/triton#3730 Merging this change closes #15477 COPYBARA_INTEGRATE_REVIEW=#15477 from ROCm:ci_hotfix_20240730 6f86fdb PiperOrigin-RevId: 657634867

@xla-rotation

Imported from GitHub PR openxla/xla#15477 This build break is introduced by openxla/xla#15257 and ROcm has a new optimized LDS pass on openai/triton triton-lang/triton#3730 @xla-rotation Copybara import of the project: -- 6f86fdbd090a4fc3fa2346ba6969d7ddeae773e3 by Chao Chen <[email protected]>: updated rocm triton OptimizeLDSUsage pass due to triton-lang/triton#3730 Merging this change closes #15477 PiperOrigin-RevId: 657634867

binarman requested review from antiagainst, zhanglx13, Jokeren and ptillet as code owners April 22, 2024 20:39

binarman commented Apr 22, 2024

View reviewed changes

zhanglx13 reviewed Apr 24, 2024

View reviewed changes

zhanglx13 reviewed Apr 25, 2024

View reviewed changes

binarman force-pushed the reduce_lds_usage branch from 74d3bad to ada48d1 Compare April 26, 2024 00:04

binarman commented Apr 30, 2024

View reviewed changes

antiagainst marked this pull request as draft April 30, 2024 22:15

zhanglx13 reviewed May 1, 2024

View reviewed changes

binarman and others added 7 commits July 18, 2024 17:01

move inveriants out of tmp layout creation loop

4a1704b

move includes in proper order

4772207

post rebase fix

2e0c156

Fix macOS build issues

cc51df9

revert constant LDS size

55a7075

revert DecomposeUnsupportedConversions.cpp

be3be24

after rebase fixes

28620cd

binarman force-pushed the reduce_lds_usage branch from 67eaac0 to 28620cd Compare July 18, 2024 17:15

zhanglx13 reviewed Jul 19, 2024

View reviewed changes

zhanglx13 approved these changes Jul 19, 2024

View reviewed changes

zhanglx13 marked this pull request as ready for review July 19, 2024 02:59

Add more details on pass lds limit argument

b08d270

antiagainst force-pushed the reduce_lds_usage branch 3 times, most recently from a0fa2f8 to b825705 Compare July 20, 2024 06:05

Improve various pieces

01ab78f

antiagainst force-pushed the reduce_lds_usage branch from b825705 to 01ab78f Compare July 20, 2024 06:21

antiagainst changed the title ~~[AMD] OptimizeLDSUsage pass~~ [AMD] Introduce an OptimizeLDSUsage pass Jul 20, 2024

antiagainst approved these changes Jul 20, 2024

View reviewed changes

Merge branch 'main' into reduce_lds_usage

23419b0

antiagainst enabled auto-merge (squash) July 20, 2024 06:31

antiagainst merged commit aa3ac0a into triton-lang:main Jul 20, 2024
6 checks passed

binarman commented Jul 22, 2024

View reviewed changes

i-chaochen added a commit to ROCm/xla that referenced this pull request Jul 30, 2024

updated rocm triton OptimizeLDSUsage pass due to triton-lang/triton#3730

6f86fdb

i-chaochen mentioned this pull request Jul 30, 2024

[ROCm] hot fix rocm build due to triton update LDS pass openxla/xla#15477

Closed

copybara-service bot mentioned this pull request Jul 30, 2024

PR #15477: [ROCm] hot fix rocm build due to triton update LDS pass tensorflow/tensorflow#72793

Merged

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

[AMD] Introduce an OptimizeLDSUsage pass #3730

[AMD] Introduce an OptimizeLDSUsage pass #3730

Conversation

binarman commented Apr 22, 2024 • edited by antiagainst Loading

Choose a reason for hiding this comment

binarman commented Apr 23, 2024

zhanglx13 Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhanglx13 commented Apr 24, 2024

Choose a reason for hiding this comment

binarman commented Apr 24, 2024

zhanglx13 commented Apr 24, 2024

binarman commented Apr 24, 2024

Choose a reason for hiding this comment

binarman Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiagainst commented Apr 30, 2024

binarman commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhanglx13 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman commented Apr 22, 2024 •

edited by antiagainst

Loading

zhanglx13 Apr 24, 2024 •

edited

Loading

binarman Apr 26, 2024 •

edited

Loading