Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize thread local cache for brgemm #353

Merged
merged 14 commits into from
Oct 9, 2024
Merged

optimize thread local cache for brgemm #353

merged 14 commits into from
Oct 9, 2024

Conversation

crazydemo
Copy link
Contributor

This PR uses std::vector for thread local cache.

@crazydemo
Copy link
Contributor Author

This PR brings slight performance gain.

@ciyongch
Copy link
Contributor

How much perf gain from this patch?
Does this perf gain also comes from "desc" to "*desc"? There could be some bubbles in the new vector as each individual thread might have less "kernel_idx" than global cache?

@crazydemo
Copy link
Contributor Author

The perf gain partially comes from desc -> *desc (brings 1% perf gain), and more significantly from using std::vector (brings another 5% perf gain).

<style> </style>
main main+ptr this PR main / main+ptr main / this PR dtype cmd
0.103666 0.106254 0.103906 0.975646 0.99769 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.343486 0.318651 0.31864 1.077938 1.077972 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.5479 2.518856 2.525348 1.011531 1.00893 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
7.437604 7.396616 7.19349 1.005541 1.033935 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.080631 0.079522 0.058027 1.013946 1.389532 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.171087 0.16779 0.167892 1.019654 1.019032 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --dtype=bf16 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.77902 0.788544 0.788094 0.987921 0.988486 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.101495 2.119327 2.059455 0.991586 1.020413 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
geomean     1.010062 1.060706    

return tl_cache;
}
brgemm_desc_t desc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using a global temporary desc without syncronization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for save the desc object's address, which will be used in g_cache.push_back(brgemm_cache_info_t{&desc, kernel, palette_buffer});. Otherwise, the address will be invalid when dispatch func return.

Copy link
Contributor

@huanghaixin008 huanghaixin008 Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now every &desc points to the same object, and the brgemm_desc_init (will modify the global desc) has no synchronization at all?
If we really need to use pointer of desc, I think we can appoint an maximal amount of dispatched kernel, and reserve desc vector pool of that size to avoid vector resizing, then we can use pointer to vector element.

@@ -77,35 +77,36 @@ int64_t dnnl_brgemm_dispatch(int64_t M, int64_t N, int64_t K, int64_t LDA,
int64_t LDB, int64_t LDC, int64_t stride_a,
int64_t stride_b, float beta, int64_t dtypeA,
int64_t dtypeB) {
brgemm_desc_t desc;
std::shared_ptr<brgemm_desc_t> desc_ptr = std::make_shared<brgemm_desc_t>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huanghaixin008 how about in this way, store the desc_ptr in a shared_ptr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, but we need to change ptr of desc in brgemm_cache_info_t to shared_ptr as well, or the ptr would be released after the dispatch func.

@Menooker
Copy link

It comes to me that we might not need any locking at run time... Maybe we can fall back to the original solution. Here is my justification:

  • the kernel pool is append-only. Once a kernel is added to the array, it will not be modified or deleted
  • we can use std::deque instead of vector. It will not re-alloc the array when resizing. So it is safe to push_back a new desc even other threads are reading
  • We only need a lock for appending (writing) the pool. Reading the pool does not really need a lock
  • We can fall back to the shared unique std::queue<desc> solution. It does not require thread-local (TLS is fast, but still consumes some time), and no need for indirect pointer & shared ptr.

@crazydemo
Copy link
Contributor Author

checked the current threal local solution performance and deque solution. The performance comparison is as below:
thread_local has 2.4% perf gain on the following case.

<style> </style>
this PR deque this PR / deque dtype cmd
0.102334 0.105171 0.973028 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.289466 0.346502 0.835395 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.52066 2.520477 1.000073 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
7.481903 7.219393 1.036362 f32 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.081189 0.080492 1.00866 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=16x512x256x128 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.166998 0.169265 0.986609 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=512x1024x1024x512x256 --dtype=bf16 --has_bias=1x1x1x1 --act_type=relu --warm_up 50 --repeat 50
0.775384 0.787742 0.984313 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=32 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
2.018033 2.014215 1.001896 bf16 OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 python -m benchgc --mode=P --driver=pattern --case mlp --batch_size=128 --hidden_size_list=4096x4096x11008x4096 --dtype=bf16 --has_bias=1x1x1 --act_type=relu --warm_up 50 --repeat 50
geomean   0.976508    

@crazydemo
Copy link
Contributor Author

With @zhczhong 's help, we get the following updated data on bf16, showing deque and thread local solution have similar performance.

<style> </style>
data type bs deque thread local deque / thread local
bf16 128 0.005458 0.005775 0.945164
bf16 128 0.010434 0.010398 1.003462
bf16 128 0.006627 0.006268 1.057243
bf16 128 0.018755 0.0182 1.030526
bf16 128 0.013196 0.013407 0.984251
bf16 128 0.01079 0.0109 0.989908
bf16 128 0.009203 0.009742 0.944711
geomean       0.992869

@crazydemo
Copy link
Contributor Author

Use static vector as global cache to store brgemm kernels with default size, got the best performance.

};

static std::vector<brgemm_cache_info_t> g_cache;
static std::vector<brgemm_cache_info_t> g_cache(DEFAULT_KERNEL_SIZE);
static int64_t kernel_id = -1;
Copy link
Contributor

@huanghaixin008 huanghaixin008 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change name to g_kernel_id? this var has the same name with some func parameters and could cause confusion,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment, fixed.


if (g_kernel_id >= DEFAULT_KERNEL_SIZE) {
if (g_kernel_id >= (int64_t)g_cache.size()) {
g_cache.resize(g_kernel_id + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably have some constraints and an eviction policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. Added a constraint. I believe an eviction policy is not necessary in this case. In the vast majority of scenarios, our kernel size does not exceed 1024 entries.

@crazydemo crazydemo merged commit d794dc7 into main Oct 9, 2024
6 checks passed
@crazydemo crazydemo deleted the zhangyan/fix_perf branch October 9, 2024 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression caused by read lock in brgemm
6 participants