Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of ConstProp compilation #3042

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

imaihal
Copy link
Collaborator

@imaihal imaihal commented Jan 14, 2025

To accelerate compilation time, this PR parallelizes compilation of ConstProp using parallelFor() API in MLIR. Basically this improve constant propagation for reduction computation.

@imaihal imaihal marked this pull request as ready for review January 17, 2025 06:29
Copy link
Member

@sorenlassen sorenlassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return [fun = std::move(fun)](llvm::MutableArrayRef<WideNum> data) -> void {
for (WideNum &n : data)
n = fun(n);
static inline Transformer functionTransformer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: if you make this non-static, I think you can get the context from disposablePool.getContext() and then you don't need to pass ctx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorenlassen Nice to see you back giving us advice! Thanks for the feedback, always appreciated :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorenlassen Nice to see you again. That's what I wanted to do. Thanks!

Comment on lines 254 to 255
std::mutex mtx;
size_t beginOffset = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe slightly simpler if you omit the mutex and make beginOffset atomic and use beginOffset.fetch_add() to increment it and at the same time read its old value

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, no need to use a lock to compute the lower and upper bound of the work to be done by a given thread.

batch.emplace_back(std::make_pair(idxoffs.flattenedIndex, idxoffs[0]));

std::mutex mtx;
size_t beginOffset = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to know a bit more about the range of threadNumber, but it feels that every threads in the pool is assigned a number from 0 to ctx->getNumThreads().

If that is the case, there is absolutely no need to use a lock and an induction on the value beginOffset.

Here is the code you need, assuming that the work starts at 0

   int t = threadNumber;
   int tmax = ctx->getNumThreads();
   int UB = batch.size(); // Assume here that LB=0, and batch.size() is the total number of iterations.
   int tileSize = floor(UB / tmax);
   int leftovers = UB % tmax;
   int beginOffset;
   if (t < leftovers) {
      tileSize++; // for the first few threads, it is as if the block size is larger by 1.
      beginOffset = t * tileSize;
    } else {
      beginOffset = t * tileSize + leftovers; // for the last threads, its as we shift the start by leftovers.
   }
   int endOffset = beginOffset + tileSize

Check it out with UB=18 and tmax=4, you will see you get the ranges (0,5(, (5, 10(, (10, 14(, (14, 18(

Note that the code compute all its value from scratch, no induction value. All it needs is t, tmax, and UB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

}
}
};
parallelFor(ctx, 0, ctx->getNumThreads(), work);
});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also assume that the work up there assumes that there are batch.size() reductions that can all be done in parallel.

Since we have for quantization "whole tensor" quantization, we have cases where we have only 1 reduction.
That can also be done in parallel. Say you have 1000 elements and 10 threads. Each thread process its own 100 numbers, and save its result in its location in an array of 10 partial sum. Then after the parallel region, just reduce these 10 values sequentially. You will still get a near 10x speedup.

Also, should we check if that if the batch.size is small, we may want to do things sequentially? It would probably be good in case we have a few very small tensors. You can easily print out the sizes on stderr for a few benchmarks and see if you have such cases.

Comment on lines 254 to 255
std::mutex mtx;
size_t beginOffset = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, no need to use a lock to compute the lower and upper bound of the work to be done by a given thread.

for (WideNum &n : batch)
n = fun(n);
};
parallelFor(ctx, 0, ctx->getNumThreads(), work);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, please check that there is enough work to go to parallel computations. I suspect that if the reduction is very small, then we really want to do it sequentially and it will be faster.

@AlexandreEichenberger
Copy link
Collaborator

@imaihal please ping me when you have implemented the changes, I will then review it again. Thanks for working on accelerating the compiler, much appreciated.

If you know of other opportunities that are not exploited yet, maybe you can add a "todo" in the code or in the description of this PR so that we don't lose such opportunities.

Signed-off-by: Haruki Imai <[email protected]>
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes looks good to me now.

Do you want to test if there are cases where there is very little work and the code would go in sequential mode? Maybe report here if you have seen this in benchmarks?

So I approved the PR, would be good to know if we should consider doing some of the work in sequential mode (as this PR could help 90% of the cases, but for small one hurt perf, so while it may look in general good, we would still let performance on the table by not doing the small case sequentially).

@imaihal
Copy link
Collaborator Author

imaihal commented Jan 28, 2025

@AlexandreEichenberger

Do you want to test if there are cases where there is very little work and the code would go in sequential mode? Maybe report here if you have seen this in benchmarks?

Yes. I will test and add a threshold (maybe if loop length is small, it should run in sequential mode)
Also, I will add test code you suggested in another PR.

Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding the new algo for selecting the batch bounds.

Will you add a lit test to make sure parallel version of the algo works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants