Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MFMA] Switch between MFMA types #352

Merged

Conversation

binarman
Copy link

@binarman binarman commented Oct 9, 2023

This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.

@binarman
Copy link
Author

binarman commented Oct 9, 2023

First need to merge #251

@binarman binarman force-pushed the mfma16_support_kernel_parameter branch from 15c204d to 1b3f6a0 Compare October 10, 2023 16:36
@alefimov-amd alefimov-amd marked this pull request as ready for review October 10, 2023 16:38
@@ -0,0 +1,251 @@
#include "mlir/IR/TypeUtilities.h"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhanglx13

I've Separated this code from common AccelerateMatmul pass, so I can add an additional option to it.

Do you think it is ok to do this in this PR or is it better to separate it?

@binarman binarman changed the title [WIP] Mfma16 support kernel parameter [MFMA] Switch between MFMA types Oct 10, 2023
@@ -309,14 +310,15 @@ def make_hash(fn, arch, env_vars, **kwargs):
num_ctas = kwargs.get("num_ctas", 1)
num_stages = kwargs.get("num_stages", 3)
waves_per_eu = kwargs.get("waves_per_eu", 0)
matrix_instr_nonkdim = kwargs.get("matrix_instr_nonkdim", 0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alefimov-amd @oplavsic After dealing with tuning parameters for a while, I'm wondering why we need to add new tuning parameters explicitly, instead of treating them as constants?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only benefit to add them explicitly is that we can still tune them even they are not explicitly defined as kernel arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have am example of such use?

Maybe I did not understand your idea correctly, I feel that this could be more error prone.

P.s. I also feel that adding tons of parameters is not the best way, and we probably need to find some more elegant way to add them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I added pre_load_v as a tuning parameter, I just added it in the config of the autotuner and kernel argument as a tl.constexpr. Nothing is changed in the compiler.py. And it is treated as BLOCK_M instead of num_warps.

It seems that there are two kinds of kernel arguments: one is meta-parameters like BLOCK_M, the other is compilation options like num_warps according to the explanation here: https://github.com/ROCmSoftwarePlatform/triton/blob/461d72e5477d1659dc05e10060db4db3907c958f/python/tutorials/03-matrix-multiplication.py#L162
And the only difference between to two is whether we can set default values to them. For meta-parameters, if nothing is set, there will be an error like missing 1 required positional argument: 'pre_load_v'.

P.s. I think both kinds are compilation options, since the kernel needs to be recompiled if the values is changed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting approach, it definitely worth to try it.

I have only one concern about it.
User have to declare this constant by itself, and if he/she make a mistake, this mistake will not be reported.

For example, we use MATRIX_INSTR_NONKDIM constant to control MFMA behavior, user can write this code:

@triton.jit
def kernel(MTRIX_INSTR_NONKDIM: tl.constexpr):
    ...

kernel[grid](MTRIX_INSTR_NONKDIM = 16)

This code is correct in therm of a language, but it does not do what we want and it mistake is not reported.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking, maybe we can introduce some additional decorator to pass AMD specific options to kernel without messing with upstream interfaces.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point.
MATRIX_INSTR_NONKDIM and waves_per_eu are needed explicitly in the lowering passes. However, pre_load_v and BLOCK_M are only needed in the python level frontend, so the compile() function don't care about them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream has added a lot of hopper specific parameters to the list already, which is far from clean at all.
I agree with you that we should have a "bag" for AMD options. And we should also suggest to upstream to put all these NVIDIA parameters into another bag.

@binarman binarman force-pushed the mfma16_support_kernel_parameter branch from 1b3f6a0 to f43b54e Compare October 12, 2023 18:30
@binarman binarman force-pushed the mfma16_support_kernel_parameter branch from f43b54e to fcdb690 Compare October 16, 2023 20:49
Option<"matrixCoreVersion", "matrix-core-version",
"int32_t", /*default*/"0",
"device matrix core version">,
Option<"matrixInstructionSize", "matrix-instructio-size",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

"device matrix core version">,
Option<"matrixInstructionSize", "matrix-instructio-size",
"int32_t", /*default*/"0",
"enforce matrix intrucion MN size">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

// layout are 32 apart: [[0 0 0 0 32 32 32 32 ...] [1 1 1 1 33 33 33 33
// ...] ...]. for mfma 16x16 adjacent threads in y dimension in
// transposed MFMA layout are 16 apart: [[0 0 0 0 16 16 16 16 32 32 32
// 32 ...] [1 1 1 1 33 33 33 33 ...] ...].
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get the waveSize from the gpu dialect or mfma layout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no...
However! MFMA layout appears in IR only if target is CDNA architecture, which has only 64 waves mode.

I think it should be safe to use constant here.
In my opinion we should report MFMA layout on non CDNA GPU as an error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really want, it is possible to infer waveSize from mfmaLayout by computing a product of mfmaLayout.threadsPerWarp. But that is a little "ugly" in my opinion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
Copy link

@zhanglx13 zhanglx13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Some notes:

@binarman binarman force-pushed the mfma16_support_kernel_parameter branch from fcdb690 to c0a0664 Compare October 18, 2023 13:11
@alefimov-amd alefimov-amd merged commit 20f316b into ROCm:triton-mlir Oct 18, 2023
2 checks passed
scxiao pushed a commit that referenced this pull request Oct 20, 2023
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants