-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304
base: master
Are you sure you want to change the base?
[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304
Conversation
a5b829d
to
6ca4f1b
Compare
982e2c2
to
6e05cb1
Compare
@a-sidorova, Could you please review as well, as you are reviewing #28229. The test cases passed on arm for snippets MatMul. Thank you! |
# define SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64(PASS_PLACE, PASS, ...) \ | ||
backend_passes.emplace_back(PassPosition(PASS_PLACE), std::make_shared<PASS>(__VA_ARGS__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to improve readability and since we don't use SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64
yet, we can lave only SNIPPETS_REGISTER_PASS_RELATIVE_ARM64
definition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64
is removed, thanks Alexandra!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate why we need to have the separate aarch64-specific pass BrgemmTPPBlocking
? This pass already exists and used on x64. Can we use one pass for the both arhs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the block numbers could be different on aarch64 and x64 to add separate pass. It seems the pass implementation could be the same, we should just update where(such as here) the block size is set if different block is needed. One common BrgemmTPPBlocking pass is used now. Thanks Alexandra!
void jit_brgemm_emitter::emit_impl(const std::vector<size_t>& in, const std::vector<size_t>& out) const { | ||
validate_arguments(in, out); | ||
std::unordered_set<size_t> exclude = {}; | ||
store_context(exclude); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that we will merge #27391 soon. This PR efficently provides efficient work with reg spills - we will able to spill only needed (live) registers
Just for information and to align with other our activities 😊
if (ENABLE_SNIPPETS_LIBXSMM_TPP) | ||
ov_add_compiler_flags(-Wno-missing-declarations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why you need to add this flag? Can we avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to suppress "warning as error" in compile libxsmm. Otherwise there are compile error such as "intel_cpu/thirdparty/libxsmm/src/generator_common_aarch64.c:60:6: error: no previous declaration for ‘libxsmm_generator_vcvt_f32i8_aarch64_sve’ [-Werror=missing-declarations]"
@@ -52,7 +52,7 @@ ov_dependent_option (ENABLE_GPU_DEBUG_CAPS "enable GPU debug capabilities at run | |||
ov_dependent_option (ENABLE_CPU_DEBUG_CAPS "enable CPU debug capabilities at runtime" ON "ENABLE_DEBUG_CAPS;ENABLE_INTEL_CPU" OFF) | |||
ov_dependent_option (ENABLE_SNIPPETS_DEBUG_CAPS "enable Snippets debug capabilities at runtime" ON "ENABLE_DEBUG_CAPS" OFF) | |||
|
|||
ov_dependent_option (ENABLE_SNIPPETS_LIBXSMM_TPP "allow Snippets to use LIBXSMM Tensor Processing Primitives" OFF "ENABLE_INTEL_CPU AND X86_64" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also RISCV64, AArch32 etc. Can we add only supported archs to condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ENABLE_INTEL_CPU AND (X86_64 OR AARCH64)" is set to TPP condition. Thanks Alexandra!
#ifndef OPENVINO_ARCH_X86_64 | ||
config.update(DIM_CAST(M), DIM_CAST(N), DIM_CAST(K), 0, 0, 0, beta); | ||
return; | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that there should be common cross-arch base class with method init_runtime_params(M,N,K,LDA,LDB,LDC)
.
Then x64 dnnl executors update LDB
and LDA
if needed. aarch64 (tpp) should call update_config(...)
with these parameters.
If we use ifdef
in common code, I think this is a sign of problematic code and we should resolve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base class is created in src/plugins/intel_cpu/src/emitters/snippets/brgemm_base.hpp
which has M, N, K, beta, LDA, LDB and LDC that are common for TPP and onednn. TPP and onednn have 1. their own static params and hash as different type and params needed. 2) have own LDA, LDB and LDC update method. Thanks Alexandra!
size_t BrgemmKernelConfig::StaticParams::compute_hash(dnnl::impl::cpu::aarch64::cpu_isa_t aarch_isa) { | ||
return hash_combine(0, aarch_isa); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use aarch_isa
in this config/kernel/executor. Then this attribute should be missed in params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aarch_isa
is removed, Thanks Alexandra!
const auto num_ins = expr->get_node()->get_input_size(); | ||
const auto num_outs = expr->get_node()->get_output_size(); | ||
|
||
size_t io_strides[num_ins + num_outs]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it variable length array? If it is, as far as I know, ISO C++ forbids it. Can we use std::vector
at least here or just 3 size_t variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector
is used, Thanks Alexandra!
auto refreshed_compile_flag = | ||
config.get_beta() == 0 ? config.get_compile_flags() | LIBXSMM_GEMM_FLAG_BETA_0 : compile_flag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already update compile flags on L121 in update_config
. Then config.get_compile_flags()
will return already updated flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, removed the repeated update
. Thanks Alexandra!
std::shared_ptr<libxsmm_gemmfunction> brgemm_kernel = nullptr; | ||
}; | ||
|
||
class BrgemmKernelExecutor : public BrgemmBaseKernelExecutor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, is there any differences of config/executor/kernel between aarch64?
Is there any chance to have only one kernel executor for x64 and aarc64? As far as I know, all aarch-dependent params are hidden in TPP functions/kernels (this is pro of libxsmm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, you meant to reuse config/executor/kernel in TPP scope on x64 and aarch64. If yes I have created common tpp BrgemmKernelConfig
and BrgemmKernelExecutor
This could be reused for both x64 and aarch64. Onednn backed kernel have differernt JIT config, could not be the same. I also applied the comment . A super base brgemm config and executor created for tpp and onedenn backend. Thank you! Alexandra!
6e05cb1
to
96e274c
Compare
654635d
to
6b55e68
Compare
6b55e68
to
cebafe9
Compare
cebafe9
to
5a8b56e
Compare
@a-sidorova, your comments are applied, could you please take a look? |
@a-sidorova, @v-Golubev , @IvanNovoselov , I covered #28229 into this PR, as It will depend on this PR. After the refactor, most is reused and easy to cover x64-TPP on this PR. Thanks. |
Details:
Tickets: