-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize thread local cache for brgemm #353
Conversation
This PR brings slight performance gain. |
How much perf gain from this patch? |
The perf gain partially comes from
|
return tl_cache; | ||
} | ||
brgemm_desc_t desc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using a global temporary desc without syncronization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for save the desc object's address, which will be used in g_cache.push_back(brgemm_cache_info_t{&desc, kernel, palette_buffer});
. Otherwise, the address will be invalid when dispatch func
return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now every &desc
points to the same object, and the brgemm_desc_init
(will modify the global desc) has no synchronization at all?
If we really need to use pointer of desc, I think we can appoint an maximal amount of dispatched kernel, and reserve desc vector pool of that size to avoid vector resizing, then we can use pointer to vector element.
@@ -77,35 +77,36 @@ int64_t dnnl_brgemm_dispatch(int64_t M, int64_t N, int64_t K, int64_t LDA, | |||
int64_t LDB, int64_t LDC, int64_t stride_a, | |||
int64_t stride_b, float beta, int64_t dtypeA, | |||
int64_t dtypeB) { | |||
brgemm_desc_t desc; | |||
std::shared_ptr<brgemm_desc_t> desc_ptr = std::make_shared<brgemm_desc_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huanghaixin008 how about in this way, store the desc_ptr in a shared_ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok, but we need to change ptr of desc in brgemm_cache_info_t
to shared_ptr as well, or the ptr would be released after the dispatch func.
It comes to me that we might not need any locking at run time... Maybe we can fall back to the original solution. Here is my justification:
|
checked the current
|
76346ac
to
0fcf3e9
Compare
With @zhczhong 's help, we get the following updated data on bf16, showing
|
Use static vector as global cache to store brgemm kernels with default size, got the best performance. |
ecada2c
to
866e332
Compare
866e332
to
5c04e70
Compare
}; | ||
|
||
static std::vector<brgemm_cache_info_t> g_cache; | ||
static std::vector<brgemm_cache_info_t> g_cache(DEFAULT_KERNEL_SIZE); | ||
static int64_t kernel_id = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change name to g_kernel_id
? this var has the same name with some func parameters and could cause confusion,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comment, fixed.
|
||
if (g_kernel_id >= DEFAULT_KERNEL_SIZE) { | ||
if (g_kernel_id >= (int64_t)g_cache.size()) { | ||
g_cache.resize(g_kernel_id + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably have some constraints and an eviction policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. Added a constraint. I believe an eviction policy is not necessary in this case. In the vast majority of scenarios, our kernel size does not exceed 1024 entries.
This PR uses std::vector for thread local cache.