optimize thread local cache for brgemm #353

crazydemo · 2024-09-23T08:29:59Z

This PR uses std::vector for thread local cache.

crazydemo · 2024-09-23T08:31:21Z

This PR brings slight performance gain.

ciyongch · 2024-09-23T08:38:05Z

How much perf gain from this patch?
Does this perf gain also comes from "desc" to "*desc"? There could be some bubbles in the new vector as each individual thread might have less "kernel_idx" than global cache?

crazydemo · 2024-09-24T01:58:39Z

The perf gain partially comes from desc -> *desc (brings 1% perf gain), and more significantly from using std::vector (brings another 5% perf gain).

main	main+ptr	this PR	main / main+ptr	main / this PR	dtype	cmd
0.103666	0.106254	0.103906	0.975646	0.99769	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.343486	0.318651	0.31864	1.077938	1.077972	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.5479	2.518856	2.525348	1.011531	1.00893	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
7.437604	7.396616	7.19349	1.005541	1.033935	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.080631	0.079522	0.058027	1.013946	1.389532	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.171087	0.16779	0.167892	1.019654	1.019032	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --dtype=bf16 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.77902	0.788544	0.788094	0.987921	0.988486	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.101495	2.119327	2.059455	0.991586	1.020413	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
geomean			1.010062	1.060706

huanghaixin008 · 2024-09-24T02:01:50Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

  return tl_cache;
 }
+brgemm_desc_t desc;


We are using a global temporary desc without syncronization?

This is for save the desc object's address, which will be used in g_cache.push_back(brgemm_cache_info_t{&desc, kernel, palette_buffer});. Otherwise, the address will be invalid when dispatch func return.

But now every &desc points to the same object, and the brgemm_desc_init (will modify the global desc) has no synchronization at all?
If we really need to use pointer of desc, I think we can appoint an maximal amount of dispatched kernel, and reserve desc vector pool of that size to avoid vector resizing, then we can use pointer to vector element.

crazydemo · 2024-09-24T02:18:26Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

@@ -77,35 +77,36 @@ int64_t dnnl_brgemm_dispatch(int64_t M, int64_t N, int64_t K, int64_t LDA,
                             int64_t LDB, int64_t LDC, int64_t stride_a,
                             int64_t stride_b, float beta, int64_t dtypeA,
                             int64_t dtypeB) {
-  brgemm_desc_t desc;
+  std::shared_ptr<brgemm_desc_t> desc_ptr = std::make_shared<brgemm_desc_t>();


@huanghaixin008 how about in this way, store the desc_ptr in a shared_ptr

It's ok, but we need to change ptr of desc in brgemm_cache_info_t to shared_ptr as well, or the ptr would be released after the dispatch func.

Menooker · 2024-09-24T02:39:59Z

It comes to me that we might not need any locking at run time... Maybe we can fall back to the original solution. Here is my justification:

the kernel pool is append-only. Once a kernel is added to the array, it will not be modified or deleted
we can use std::deque instead of vector. It will not re-alloc the array when resizing. So it is safe to push_back a new desc even other threads are reading
We only need a lock for appending (writing) the pool. Reading the pool does not really need a lock
We can fall back to the shared unique std::queue<desc> solution. It does not require thread-local (TLS is fast, but still consumes some time), and no need for indirect pointer & shared ptr.

crazydemo · 2024-09-24T03:36:35Z

checked the current threal local solution performance and deque solution. The performance comparison is as below:
thread_local has 2.4% perf gain on the following case.

this PR	deque	this PR / deque	dtype	cmd
0.102334	0.105171	0.973028	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.289466	0.346502	0.835395	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.52066	2.520477	1.000073	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
7.481903	7.219393	1.036362	f32	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.081189	0.080492	1.00866	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.166998	0.169265	0.986609	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --dtype=bf16 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.775384	0.787742	0.984313	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.018033	2.014215	1.001896	bf16	OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
geomean		0.976508

crazydemo · 2024-09-24T06:21:47Z

With @zhczhong 's help, we get the following updated data on bf16, showing deque and thread local solution have similar performance.

data type	bs	deque	thread local	deque / thread local
bf16	128	0.005458	0.005775	0.945164
bf16	128	0.010434	0.010398	1.003462
bf16	128	0.006627	0.006268	1.057243
bf16	128	0.018755	0.0182	1.030526
bf16	128	0.013196	0.013407	0.984251
bf16	128	0.01079	0.0109	0.989908
bf16	128	0.009203	0.009742	0.944711
geomean				0.992869

crazydemo · 2024-09-25T07:36:46Z

Use static vector as global cache to store brgemm kernels with default size, got the best performance.

huanghaixin008 · 2024-09-25T08:56:25Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

 };

-static std::vector<brgemm_cache_info_t> g_cache;
+static std::vector<brgemm_cache_info_t> g_cache(DEFAULT_KERNEL_SIZE);
+static int64_t kernel_id = -1;


maybe change name to g_kernel_id? this var has the same name with some func parameters and could cause confusion,

thanks for the comment, fixed.

kurapov-peter · 2024-10-08T13:57:14Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

+
+  if (g_kernel_id >= DEFAULT_KERNEL_SIZE) {
+    if (g_kernel_id >= (int64_t)g_cache.size()) {
+      g_cache.resize(g_kernel_id + 1);


This should probably have some constraints and an eviction policy.

Thanks for the advice. Added a constraint. I believe an eviction policy is not necessary in this case. In the vast majority of scenarios, our kernel size does not exceed 1024 entries.

crazydemo added 7 commits September 19, 2024 18:20

add thread local cache for brgemm

29f25e2

encapsulate thread local cache

37267a7

best perf

880a3ff

use static func

64fe90e

fix comment

36de9b0

use vector

340c1ea

Merge remote-tracking branch 'origin/main' into zhangyan/fix_perf

5f51e49

crazydemo requested review from huanghaixin008, Menooker and ciyongch September 23, 2024 08:29

crazydemo linked an issue Sep 23, 2024 that may be closed by this pull request

Performance regression caused by read lock in brgemm #323

Closed

Merge remote-tracking branch 'origin/main' into zhangyan/fix_perf

cb77f94

huanghaixin008 reviewed Sep 24, 2024

View reviewed changes

crazydemo commented Sep 24, 2024

View reviewed changes

update

0fcf3e9

crazydemo force-pushed the zhangyan/fix_perf branch from 76346ac to 0fcf3e9 Compare September 24, 2024 03:37

lmontigny added this to the 0.1 CPU - Matmul functionnality support milestone Sep 24, 2024

Merge remote-tracking branch 'origin/main' into zhangyan/fix_perf

93f4031

crazydemo force-pushed the zhangyan/fix_perf branch from ecada2c to 866e332 Compare September 25, 2024 07:38

use static vector with size = 1024

5c04e70

crazydemo force-pushed the zhangyan/fix_perf branch from 866e332 to 5c04e70 Compare September 25, 2024 07:46

ciyongch approved these changes Sep 25, 2024

View reviewed changes

huanghaixin008 reviewed Sep 25, 2024

View reviewed changes

crazydemo added 2 commits September 25, 2024 18:40

fix comment

6e92528

Merge remote-tracking branch 'origin/main' into zhangyan/fix_perf

a9e8265

lmontigny requested a review from zhczhong October 8, 2024 13:56

kurapov-peter approved these changes Oct 8, 2024

View reviewed changes

fix comment

88437bd

crazydemo merged commit d794dc7 into main Oct 9, 2024
6 checks passed

crazydemo deleted the zhangyan/fix_perf branch October 9, 2024 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize thread local cache for brgemm #353

optimize thread local cache for brgemm #353

crazydemo commented Sep 23, 2024

crazydemo commented Sep 23, 2024

ciyongch commented Sep 23, 2024

crazydemo commented Sep 24, 2024

huanghaixin008 Sep 24, 2024

crazydemo Sep 24, 2024

huanghaixin008 Sep 24, 2024 •

edited

Loading

crazydemo Sep 24, 2024

huanghaixin008 Sep 24, 2024

Menooker commented Sep 24, 2024

crazydemo commented Sep 24, 2024

crazydemo commented Sep 24, 2024

crazydemo commented Sep 25, 2024

huanghaixin008 Sep 25, 2024 •

edited

Loading

crazydemo Sep 26, 2024

kurapov-peter Oct 8, 2024

crazydemo Oct 9, 2024

optimize thread local cache for brgemm #353

optimize thread local cache for brgemm #353

Conversation

crazydemo commented Sep 23, 2024

crazydemo commented Sep 23, 2024

ciyongch commented Sep 23, 2024

crazydemo commented Sep 24, 2024

huanghaixin008 Sep 24, 2024

Choose a reason for hiding this comment

crazydemo Sep 24, 2024

Choose a reason for hiding this comment

huanghaixin008 Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

crazydemo Sep 24, 2024

Choose a reason for hiding this comment

huanghaixin008 Sep 24, 2024

Choose a reason for hiding this comment

Menooker commented Sep 24, 2024

crazydemo commented Sep 24, 2024

crazydemo commented Sep 24, 2024

crazydemo commented Sep 25, 2024

huanghaixin008 Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

crazydemo Sep 26, 2024

Choose a reason for hiding this comment

kurapov-peter Oct 8, 2024

Choose a reason for hiding this comment

crazydemo Oct 9, 2024

Choose a reason for hiding this comment

huanghaixin008 Sep 24, 2024 •

edited

Loading

huanghaixin008 Sep 25, 2024 •

edited

Loading