[Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests #260

binarman · 2023-07-13T10:58:27Z

This PR adds:

support of FP16 outputs for MFMA dot operations
fallback to FMA implementation in case MFMA can not handle input sizes

zhanglx13 · 2023-07-13T16:25:28Z

@micmelesse Can you try this PR to see if it resolve the issue in test_core?

binarman · 2023-07-13T19:24:38Z

Yes, It is fixing the issue, I'll move tests to test_core_amd.py soon

micmelesse · 2023-07-24T03:40:01Z

python/test/unit/language/test_core_amd.py

-@pytest.mark.parametrize("M, N, K, num_warps, col_a, col_b, epilogue, allow_tf32, dtype",
-                         [(*shape, 2, False, False, epilogue, allow_tf32, dtype)
+@pytest.mark.parametrize("M, N, K, num_warps, col_a, col_b, epilogue, allow_tf32, in_dtype, out_dtype",
+                         [(*shape, 2, False, False, epilogue, allow_tf32, in_dtype, out_dtype)
                          for shape in [(64, 64, 64), (32, 32, 32)]


Why are we not testing (16,16,16)? It is in upstream and when I try it I get a segfault.

If you are running them on MI100/200 GPU, it tries to use MFMA instructions with minimal M/N size of 32.

@micmelesse
Upd. I've added workaround with FMA instructions, so 16x16x16 tests work

binarman · 2023-07-26T14:11:58Z

python/triton/language/semantic.py

+        if ret_cast_scalar_ty == tl.float16:
+            _0 = builder.create_splat(builder.get_fp16(0), [M, N])
+        else:
+            _0 = builder.create_splat(builder.get_fp32(0), [M, N])
        ret_ty = tl.block_type(ret_cast_scalar_ty, [M, N])
        ret = tl.tensor(builder.create_dot(lhs.handle, rhs.handle, _0, allow_tf32),
                        ret_ty)
        return cast(ret, ret_scalar_ty, builder)
+    if is_hip() and mfma_supported(M, N, lhs.type.shape[1], allow_tf32, ret_scalar_ty) and ret_scalar_ty.primitive_bitwidth < 32:
+        if lhs.type.scalar.is_int():
+            ret_dot_scalar_ty = tl.int32
+            _0 = builder.create_splat(builder.get_int32(0), [M, N])
+        else:
+            ret_dot_scalar_ty = tl.float32
+            _0 = builder.create_splat(builder.get_fp32(0), [M, N])
+        ret_ty = tl.block_type(ret_dot_scalar_ty, [M, N])
+        ret = tl.tensor(builder.create_dot(lhs.handle, rhs.handle, _0, allow_tf32),
+                        ret_ty)
+        return cast(ret, ret_scalar_ty, builder)


This part is related to support of FP16 output

binarman · 2023-07-26T14:12:50Z

lib/Analysis/Utility.cpp

+static bool supportMFMAGranularity(int dim_size) {
+  std::vector<int> supported_granularity{32};
+  for (int granularity: supported_granularity)
+    if (dim_size % granularity == 0)
+      return true;
+  return false;
+}
+
 bool supportMFMA(triton::DotOp op) {
-  auto aElemTy = op.getA().getType().cast<RankedTensorType>().getElementType();
-  auto bElemTy = op.getB().getType().cast<RankedTensorType>().getElementType();
+  auto aTy = op.getA().getType().cast<RankedTensorType>();
+  auto bTy = op.getB().getType().cast<RankedTensorType>();
+
+  auto aShape = aTy.getShape();
+  auto bShape = bTy.getShape();
+
+  assert(aShape[1] == bShape[0]);
+  if (!supportMFMAGranularity(aShape[0]) ||
+      !supportMFMAGranularity(aShape[1]) ||
+      !supportMFMAGranularity(bShape[1]))
+    return false;
+
+  auto aElemTy = aTy.getElementType();
+  auto bElemTy = bTy.getElementType();


This part enables fallback to FMA implementation for small matrix sizes in C++ part of compiler

binarman · 2023-07-26T14:13:03Z

python/triton/language/semantic.py

+def mfma_supported_granularity(dim_size) -> bool:
+    supported_granularity = [32]
+    for granularity in supported_granularity:
+        if dim_size % granularity == 0:
+            return True
+    return False
+
 def mfma_supported(M, N, K, allow_tf32, ret_scalar_ty) -> bool:
    if not gpu_has_mfma():
        return False
+    if not mfma_supported_granularity(M) or \
+       not mfma_supported_granularity(N) or \
+       not mfma_supported_granularity(K):
+        return False


This part enables fallback to FMA implementation for small matrix sizes in python part of compiler

jayfurmanek · 2023-08-02T21:53:19Z

lib/Analysis/Utility.cpp

@@ -149,13 +149,34 @@ bool supportMMA(triton::DotOp op, int version) {
 }

 #ifdef USE_ROCM
+static bool supportMFMAGranularity(int m, int n, int k) {
+  // these limitations are dtype dependent, in future we may relax them
+  int granularityMN = 32;


Should we define these as constants somewhere so we can change easily if needed?

Right, this should be a constant.

About place: I actually can not think of better place for this constants.

P.s. If you are worried about duplication in C++ and python code: I want to refactor this and remove python part eventually, so there will be only one place with these constants.

This PR adds cast of output tensor to requested data type.

Michael is out; Request addressed.

alefimov-amd requested review from micmelesse and zhanglx13 July 13, 2023 11:02

jayfurmanek self-requested a review July 14, 2023 16:11

alefimov-amd force-pushed the mfma_dot_out_fp16 branch from c16e849 to 851a103 Compare July 20, 2023 15:18

micmelesse previously requested changes Jul 24, 2023

View reviewed changes

binarman force-pushed the mfma_dot_out_fp16 branch from 851a103 to a429da0 Compare July 26, 2023 14:08

binarman requested a review from micmelesse July 26, 2023 14:08

binarman changed the title ~~[Dot] [MFMA] Support FP16 output of MFMA dot~~ [Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests Jul 26, 2023

binarman commented Jul 26, 2023

View reviewed changes

binarman force-pushed the mfma_dot_out_fp16 branch 2 times, most recently from 020d111 to defaa55 Compare August 2, 2023 13:44

binarman marked this pull request as draft August 2, 2023 14:29

binarman marked this pull request as ready for review August 2, 2023 16:46

jayfurmanek reviewed Aug 2, 2023

View reviewed changes

binarman force-pushed the mfma_dot_out_fp16 branch from 6ffd66a to f0659ee Compare August 3, 2023 12:36

jayfurmanek approved these changes Aug 3, 2023

View reviewed changes

binarman added 7 commits August 3, 2023 10:33

[Dot] [MFMA] Support FP16 output of MFMA dot

1715cfc

This PR adds cast of output tensor to requested data type.

add tests

37b3b5d

fix test for FMA implementation

b5f0014

loose fp16xfp16->fp16 tolerance

2b434e8

enable FMA fallback for unsupported sizes of dot operation

bfffab9

rework granularity check

9807546

add constant modifier to granularity

a1e8311

jayfurmanek force-pushed the mfma_dot_out_fp16 branch from f0659ee to a1e8311 Compare August 3, 2023 15:33

jayfurmanek merged commit 86f8b64 into ROCm:triton-mlir Aug 3, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests #260

[Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests #260

binarman commented Jul 13, 2023 •

edited

Loading

zhanglx13 commented Jul 13, 2023

binarman commented Jul 13, 2023 •

edited

Loading

micmelesse Jul 24, 2023

binarman Jul 26, 2023

binarman Aug 2, 2023

binarman Jul 26, 2023

binarman Jul 26, 2023

binarman Jul 26, 2023

jayfurmanek Aug 2, 2023

binarman Aug 3, 2023

[Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests #260

[Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests #260

Conversation

binarman commented Jul 13, 2023 • edited Loading

zhanglx13 commented Jul 13, 2023

binarman commented Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman commented Jul 13, 2023 •

edited

Loading

binarman commented Jul 13, 2023 •

edited

Loading