[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

chenhu-wang · 2025-01-08T06:59:48Z

Details:

Snippets MatMul via brgemm emitter and executor on aarch64 with TPP
Snippets MatMul via brgemm emitter and executor on x64 with TPP

Tickets:

CVS-151344

chenhu-wang · 2025-01-15T08:17:21Z

@a-sidorova, Could you please review as well, as you are reviewing #28229. The test cases passed on arm for snippets MatMul. Thank you!

a-sidorova · 2025-01-15T08:55:24Z

src/plugins/intel_cpu/src/nodes/subgraph.cpp

+#    define SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64(PASS_PLACE, PASS, ...) \
+        backend_passes.emplace_back(PassPosition(PASS_PLACE), std::make_shared<PASS>(__VA_ARGS__))


Just to improve readability and since we don't use SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64 yet, we can lave only SNIPPETS_REGISTER_PASS_RELATIVE_ARM64 definition here.

SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64is removed, thanks Alexandra!

a-sidorova · 2025-01-15T09:02:10Z

src/plugins/intel_cpu/src/transformations/tpp/aarch64/pass/lowered/brgemm_tpp_blocking.cpp

Could you please elaborate why we need to have the separate aarch64-specific pass BrgemmTPPBlocking? This pass already exists and used on x64. Can we use one pass for the both arhs?

I thought the block numbers could be different on aarch64 and x64 to add separate pass. It seems the pass implementation could be the same, we should just update where(such as here) the block size is set if different block is needed. One common BrgemmTPPBlocking pass is used now. Thanks Alexandra!

a-sidorova · 2025-01-15T09:08:09Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/jit_brgemm_emitter.cpp

+void jit_brgemm_emitter::emit_impl(const std::vector<size_t>& in, const std::vector<size_t>& out) const {
+    validate_arguments(in, out);
+    std::unordered_set<size_t> exclude = {};
+    store_context(exclude);


Please note that we will merge #27391 soon. This PR efficently provides efficient work with reg spills - we will able to spill only needed (live) registers

Just for information and to align with other our activities 😊

a-sidorova · 2025-01-15T09:18:35Z

src/plugins/intel_cpu/thirdparty/CMakeLists.txt

 if (ENABLE_SNIPPETS_LIBXSMM_TPP)
+    ov_add_compiler_flags(-Wno-missing-declarations)


Could you elaborate why you need to add this flag? Can we avoid it?

This is to suppress "warning as error" in compile libxsmm. Otherwise there are compile error such as "intel_cpu/thirdparty/libxsmm/src/generator_common_aarch64.c:60:6: error: no previous declaration for ‘libxsmm_generator_vcvt_f32i8_aarch64_sve’ [-Werror=missing-declarations]"

a-sidorova · 2025-01-15T09:20:07Z

cmake/features.cmake

@@ -52,7 +52,7 @@ ov_dependent_option (ENABLE_GPU_DEBUG_CAPS "enable GPU debug capabilities at run
 ov_dependent_option (ENABLE_CPU_DEBUG_CAPS "enable CPU debug capabilities at runtime" ON "ENABLE_DEBUG_CAPS;ENABLE_INTEL_CPU" OFF)
 ov_dependent_option (ENABLE_SNIPPETS_DEBUG_CAPS "enable Snippets debug capabilities at runtime" ON "ENABLE_DEBUG_CAPS" OFF)

-ov_dependent_option (ENABLE_SNIPPETS_LIBXSMM_TPP "allow Snippets to use LIBXSMM Tensor Processing Primitives" OFF "ENABLE_INTEL_CPU AND X86_64" OFF)


There is also RISCV64, AArch32 etc. Can we add only supported archs to condition?

"ENABLE_INTEL_CPU AND (X86_64 OR AARCH64)" is set to TPP condition. Thanks Alexandra!

a-sidorova · 2025-01-15T09:26:43Z

src/plugins/intel_cpu/src/emitters/snippets/brgemm_base.cpp

+#ifndef OPENVINO_ARCH_X86_64
+    config.update(DIM_CAST(M), DIM_CAST(N), DIM_CAST(K), 0, 0, 0, beta);
+    return;
+#endif
+


I'd say that there should be common cross-arch base class with method init_runtime_params(M,N,K,LDA,LDB,LDC).
Then x64 dnnl executors update LDB and LDA if needed. aarch64 (tpp) should call update_config(...) with these parameters.

If we use ifdef in common code, I think this is a sign of problematic code and we should resolve it.

The base class is created in src/plugins/intel_cpu/src/emitters/snippets/brgemm_base.hpp which has M, N, K, beta, LDA, LDB and LDC that are common for TPP and onednn. TPP and onednn have 1. their own static params and hash as different type and params needed. 2) have own LDA, LDB and LDC update method. Thanks Alexandra!

a-sidorova · 2025-01-15T09:27:49Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp

+size_t BrgemmKernelConfig::StaticParams::compute_hash(dnnl::impl::cpu::aarch64::cpu_isa_t aarch_isa) {
+    return hash_combine(0, aarch_isa);
+}


We don't use aarch_isa in this config/kernel/executor. Then this attribute should be missed in params

aarch_isa is removed, Thanks Alexandra!

a-sidorova · 2025-01-15T09:37:59Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp

+    const auto num_ins = expr->get_node()->get_input_size();
+    const auto num_outs = expr->get_node()->get_output_size();
+
+    size_t io_strides[num_ins + num_outs];


Is it variable length array? If it is, as far as I know, ISO C++ forbids it. Can we use std::vector at least here or just 3 size_t variables?

std::vector is used, Thanks Alexandra!

a-sidorova · 2025-01-15T09:43:49Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp

+    auto refreshed_compile_flag =
+        config.get_beta() == 0 ? config.get_compile_flags() | LIBXSMM_GEMM_FLAG_BETA_0 : compile_flag;


You already update compile flags on L121 in update_config. Then config.get_compile_flags() will return already updated flags

Yes, you are right, removed the repeated update. Thanks Alexandra!

a-sidorova · 2025-01-15T09:47:40Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.hpp

+    std::shared_ptr<libxsmm_gemmfunction> brgemm_kernel = nullptr;
+};
+
+class BrgemmKernelExecutor : public BrgemmBaseKernelExecutor,


By the way, is there any differences of config/executor/kernel between aarch64?
Is there any chance to have only one kernel executor for x64 and aarc64? As far as I know, all aarch-dependent params are hidden in TPP functions/kernels (this is pro of libxsmm).

If I understand correctly, you meant to reuse config/executor/kernel in TPP scope on x64 and aarch64. If yes I have created common tpp BrgemmKernelConfig and BrgemmKernelExecutor This could be reused for both x64 and aarch64. Onednn backed kernel have differernt JIT config, could not be the same. I also applied the comment . A super base brgemm config and executor created for tpp and onedenn backend. Thank you! Alexandra!

chenhu-wang · 2025-01-26T09:02:54Z

@a-sidorova, your comments are applied, could you please take a look?

chenhu-wang · 2025-01-27T08:05:08Z

@a-sidorova, @v-Golubev , @IvanNovoselov , I covered #28229 into this PR, as It will depend on this PR. After the refactor, most is reused and easy to cover x64-TPP on this PR. Thanks.

chenhu-wang requested review from a team as code owners January 8, 2025 06:59

github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 8, 2025

chenhu-wang marked this pull request as draft January 8, 2025 07:28

github-actions bot added the category: build OpenVINO cmake script / infra label Jan 9, 2025

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 3 times, most recently from a5b829d to 6ca4f1b Compare January 9, 2025 08:09

chenhu-wang marked this pull request as ready for review January 13, 2025 06:37

chenhu-wang requested a review from a team as a code owner January 13, 2025 06:37

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 15 times, most recently from 982e2c2 to 6e05cb1 Compare January 15, 2025 05:33

a-sidorova reviewed Jan 15, 2025

View reviewed changes

a-sidorova self-assigned this Jan 15, 2025

v-Golubev self-assigned this Jan 16, 2025

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 6e05cb1 to 96e274c Compare January 22, 2025 06:36

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 3 times, most recently from 654635d to 6b55e68 Compare January 26, 2025 04:19

chenhu-wang added 5 commits January 26, 2025 13:25

brgemm emitter and executor

f5399c1

executor cache

db698f3

update arm passes, test enable

84199b6

update cmake

abd6f3d

refactor tpp on x64 and aarch64

9a52cf2

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 6b55e68 to cebafe9 Compare January 26, 2025 05:29

apply Alexandra comments

5a8b56e

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from cebafe9 to 5a8b56e Compare January 26, 2025 08:23

Matmul with TPP on x64

668504d

chenhu-wang mentioned this pull request Jan 27, 2025

[CPU] Matmul emitters with TPP backend executor #28229

Open

chenhu-wang changed the title ~~[CPU][ARM]Snippets MatMul via brgemm emitter and executor~~ [CPU][ARM][X64]Snippets MatMul via brgemm emitter and executor Jan 27, 2025

chenhu-wang changed the title ~~[CPU][ARM][X64]Snippets MatMul via brgemm emitter and executor~~ [CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

chenhu-wang commented Jan 8, 2025 •

edited

Loading

chenhu-wang commented Jan 15, 2025 •

edited

Loading

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025 •

edited

Loading

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025 •

edited

Loading

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025 •

edited

Loading

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025

a-sidorova Jan 15, 2025

chenhu-wang Jan 26, 2025

chenhu-wang commented Jan 26, 2025

chenhu-wang commented Jan 27, 2025

		# define SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64(PASS_PLACE, PASS, ...) \
		backend_passes.emplace_back(PassPosition(PASS_PLACE), std::make_shared<PASS>(__VA_ARGS__))

		if (ENABLE_SNIPPETS_LIBXSMM_TPP)
		ov_add_compiler_flags(-Wno-missing-declarations)

		auto refreshed_compile_flag =
		config.get_beta() == 0 ? config.get_compile_flags() \| LIBXSMM_GEMM_FLAG_BETA_0 : compile_flag;

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

Are you sure you want to change the base?

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

Conversation

chenhu-wang commented Jan 8, 2025 • edited Loading

Details:

Tickets:

chenhu-wang commented Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

chenhu-wang Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenhu-wang Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenhu-wang Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenhu-wang commented Jan 26, 2025

chenhu-wang commented Jan 27, 2025

chenhu-wang commented Jan 8, 2025 •

edited

Loading

chenhu-wang commented Jan 15, 2025 •

edited

Loading

chenhu-wang Jan 26, 2025 •

edited

Loading

chenhu-wang Jan 26, 2025 •

edited

Loading

chenhu-wang Jan 26, 2025 •

edited

Loading