-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets] SplitDimensionM: heuristic update #28180
[Snippets] SplitDimensionM: heuristic update #28180
Conversation
205e16c
to
78a43ee
Compare
@a-sidorova @IvanNovoselov could you please take a look? Thanks in advance |
src/common/snippets/include/snippets/pass/split_dimension_m.hpp
Outdated
Show resolved
Hide resolved
return splited; | ||
} | ||
|
||
std::pair<size_t, size_t> SplitDimensionM::compute_aggressive_heuristic(size_t batch_dim, size_t m_dim, size_t optimal_parallelism_work_amount) { | ||
constexpr size_t min_kernel_m = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In second case in compute_ideal_cases_heuristic
min_kernel_m
is 64 while this is 32 here.
What's about to use always 64 and set as const static attribute of the class?
Or is there difference between heuristics and we really need to have smaller min_kernel_m
in aggressive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's better to have one min_kernel_m
value. But I think that it should be 32, not 64. 64 value was set to empirically avoid the cases in which external repacking feature doesn't work, and overheads on repacking duplication inside kernel are bigger than benefits from the splitting. If external repacking works (and it seems like it will work in all cases after tokenization adjustments), we can easily lower min_kernel_m
for compute_ideal_cases_heuristic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied your suggestion: min_kernel_m
is now a static class member, it equals to 32
// If M dim is big enough, aggressive heuristic is used for kernel_m minimization. | ||
// For smaller M dim, conservative heuristic is used to preserve old behavour. | ||
const bool big_m_dim = m_dim >= 4000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, we also can support the case with small M
.
If batch < optimal_parallelism_work_amount
and M
is quite small (for example, M
< 64), nothing is needed to be updated and splitted - let's execute as it is.
I don't insist to do it in this PR but I have some models (for example, action-recognition or levit) with small values of batch and M
where this pass is applied and there will be M = 4
or even M = 1
. And this action leads to perf degrdation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea. In third-party brgemm heuristics, I saw that minimal allowed m_kernel
is 16. Probably we can take this into account in our heuristics.
But it's also important is that SplitDimensionM::split
is used in CPU callback (via can_be_optimized
), so if it returns false, the MHA tokenization doesn't happen. So another question that we need to answer is whether we need to even tokenize such MHA's or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should tokenize these cases too but just without splitting dimension M. These smalls kernels will be quickly processed anyway with Snippets.
At the moment, I suggest to come back when we have time.
Could you please add the comment with this small discussion to the ticket? To remind all needed things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I changed
SplitDimensionM::split
behavior based on perf experiments. Now, there is no limitations on M dim for heuristics. We just call heuristics in the following order:split_ideally
->split_minimize_kernel_wa
->split_fallback_increase_parallel_wa
. The next function is called only if previous one didn't finish successfully - Based on the ticket's description, I can conclude that this PR already solves the described problem (I added a corresponding test case to
split_dim_m.cpp
), so I created another ticket for the small MHAs tokenization task: CVS-160154
78a43ee
to
4242663
Compare
// If M dim is big enough, aggressive heuristic is used for kernel_m minimization. | ||
// For smaller M dim, conservative heuristic is used to preserve old behavour. | ||
const bool big_m_dim = m_dim >= 4000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should tokenize these cases too but just without splitting dimension M. These smalls kernels will be quickly processed anyway with Snippets.
At the moment, I suggest to come back when we have time.
Could you please add the comment with this small discussion to the ticket? To remind all needed things.
4242663
to
321e549
Compare
The changes will be merged into master within PR #28179 |
Details:
Tickets: