-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelization of ConstProp compilation #3042
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return [fun = std::move(fun)](llvm::MutableArrayRef<WideNum> data) -> void { | ||
for (WideNum &n : data) | ||
n = fun(n); | ||
static inline Transformer functionTransformer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: if you make this non-static, I think you can get the context from disposablePool.getContext() and then you don't need to pass ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorenlassen Nice to see you back giving us advice! Thanks for the feedback, always appreciated :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorenlassen Nice to see you again. That's what I wanted to do. Thanks!
std::mutex mtx; | ||
size_t beginOffset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe slightly simpler if you omit the mutex and make beginOffset atomic and use beginOffset.fetch_add() to increment it and at the same time read its old value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, no need to use a lock to compute the lower and upper bound of the work to be done by a given thread.
batch.emplace_back(std::make_pair(idxoffs.flattenedIndex, idxoffs[0])); | ||
|
||
std::mutex mtx; | ||
size_t beginOffset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to know a bit more about the range of threadNumber
, but it feels that every threads in the pool is assigned a number from 0
to ctx->getNumThreads()
.
If that is the case, there is absolutely no need to use a lock and an induction on the value beginOffset
.
Here is the code you need, assuming that the work starts at 0
int t = threadNumber;
int tmax = ctx->getNumThreads();
int UB = batch.size(); // Assume here that LB=0, and batch.size() is the total number of iterations.
int tileSize = floor(UB / tmax);
int leftovers = UB % tmax;
int beginOffset;
if (t < leftovers) {
tileSize++; // for the first few threads, it is as if the block size is larger by 1.
beginOffset = t * tileSize;
} else {
beginOffset = t * tileSize + leftovers; // for the last threads, its as we shift the start by leftovers.
}
int endOffset = beginOffset + tileSize
Check it out with UB=18
and tmax=4
, you will see you get the ranges (0,5(, (5, 10(, (10, 14(, (14, 18(
Note that the code compute all its value from scratch, no induction value. All it needs is t
, tmax
, and UB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
} | ||
} | ||
}; | ||
parallelFor(ctx, 0, ctx->getNumThreads(), work); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also assume that the work up there assumes that there are batch.size()
reductions that can all be done in parallel.
Since we have for quantization "whole tensor" quantization, we have cases where we have only 1 reduction.
That can also be done in parallel. Say you have 1000 elements and 10 threads. Each thread process its own 100 numbers, and save its result in its location in an array of 10 partial sum. Then after the parallel region, just reduce these 10 values sequentially. You will still get a near 10x speedup.
Also, should we check if that if the batch.size
is small, we may want to do things sequentially? It would probably be good in case we have a few very small tensors. You can easily print out the sizes on stderr for a few benchmarks and see if you have such cases.
std::mutex mtx; | ||
size_t beginOffset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, no need to use a lock to compute the lower and upper bound of the work to be done by a given thread.
for (WideNum &n : batch) | ||
n = fun(n); | ||
}; | ||
parallelFor(ctx, 0, ctx->getNumThreads(), work); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before, please check that there is enough work to go to parallel computations. I suspect that if the reduction is very small, then we really want to do it sequentially and it will be faster.
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
@imaihal please ping me when you have implemented the changes, I will then review it again. Thanks for working on accelerating the compiler, much appreciated. If you know of other opportunities that are not exploited yet, maybe you can add a "todo" in the code or in the description of this PR so that we don't lose such opportunities. |
Signed-off-by: Haruki Imai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the changes looks good to me now.
Do you want to test if there are cases where there is very little work and the code would go in sequential mode? Maybe report here if you have seen this in benchmarks?
So I approved the PR, would be good to know if we should consider doing some of the work in sequential mode (as this PR could help 90% of the cases, but for small one hurt perf, so while it may look in general good, we would still let performance on the table by not doing the small case sequentially).
Yes. I will test and add a threshold (maybe if loop length is small, it should run in sequential mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding the new algo for selecting the batch bounds.
Will you add a lit test to make sure parallel version of the algo works?
To accelerate compilation time, this PR parallelizes compilation of ConstProp using
parallelFor()
API in MLIR. Basically this improve constant propagation for reduction computation.