[BACKEND][AMD] Enable swizzling SMEM for transposed operand (#3666)

Transposed operand will be accessed in an opposite order from the original operand. Enabling swizzling seems to help performance. I'm seeing 10% performance improvement for our internal model. This is a backport of ROCm#474.
triton-lang · Apr 15, 2024 · d117047 · d117047
1 parent 3657381
commit d117047
Showing 1 changed file with 2 additions and 0 deletions.
diff --git a/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td b/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
@@ -230,6 +230,8 @@ compared to 1*64 when the hasLeadingOffset is false.
         // ---- begin GFX908/GFX90A ----
         if (auto mfmaEnc = dotOpEnc.getParent().dyn_cast<AMDMfmaEncodingAttr>()) {
           int kDimNum = dotOpEnc.getOpIdx() == 0 ? 1 : 0;
+          if (needTrans)
+            kDimNum = 1 - kDimNum;
           bool isKDimInner = (order[0] == kDimNum);
           if (isKDimInner) {
             const int numBanks = 32;