Parallelization of ConstProp compilation #3042

imaihal · 2025-01-14T07:37:30Z

To accelerate compilation time, this PR parallelizes compilation of ConstProp using parallelFor() API in MLIR. Basically this improve constant propagation for reduction computation.

Signed-off-by: Haruki Imai <[email protected]>

sorenlassen

LGTM

sorenlassen · 2025-01-23T18:58:47Z

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.hpp

-    return [fun = std::move(fun)](llvm::MutableArrayRef<WideNum> data) -> void {
-      for (WideNum &n : data)
-        n = fun(n);
+  static inline Transformer functionTransformer(


suggestion: if you make this non-static, I think you can get the context from disposablePool.getContext() and then you don't need to pass ctx

@sorenlassen Nice to see you back giving us advice! Thanks for the feedback, always appreciated :-)

@sorenlassen Nice to see you again. That's what I wanted to do. Thanks!

sorenlassen · 2025-01-23T19:00:45Z

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.hpp

+      std::mutex mtx;
+      size_t beginOffset = 0;


maybe slightly simpler if you omit the mutex and make beginOffset atomic and use beginOffset.fetch_add() to increment it and at the same time read its old value

Same comment as above, no need to use a lock to compute the lower and upper bound of the work to be done by a given thread.

AlexandreEichenberger · 2025-01-23T19:34:34Z

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.cpp

+      batch.emplace_back(std::make_pair(idxoffs.flattenedIndex, idxoffs[0]));
+
+    std::mutex mtx;
+    size_t beginOffset = 0;


I need to know a bit more about the range of threadNumber, but it feels that every threads in the pool is assigned a number from 0 to ctx->getNumThreads().

If that is the case, there is absolutely no need to use a lock and an induction on the value beginOffset.

Here is the code you need, assuming that the work starts at 0

int t = threadNumber; int tmax = ctx->getNumThreads(); int UB = batch.size(); // Assume here that LB=0, and batch.size() is the total number of iterations. int tileSize = floor(UB / tmax); int leftovers = UB % tmax; int beginOffset; if (t < leftovers) { tileSize++; // for the first few threads, it is as if the block size is larger by 1. beginOffset = t * tileSize; } else { beginOffset = t * tileSize + leftovers; // for the last threads, its as we shift the start by leftovers. } int endOffset = beginOffset + tileSize

Check it out with UB=18 and tmax=4, you will see you get the ranges (0,5(, (5, 10(, (10, 14(, (14, 18(

Note that the code compute all its value from scratch, no induction value. All it needs is t, tmax, and UB

Updated. Thanks!

AlexandreEichenberger · 2025-01-23T19:41:30Z

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.cpp

      }
-    }
+    };
+    parallelFor(ctx, 0, ctx->getNumThreads(), work);
  });
 }


I also assume that the work up there assumes that there are batch.size() reductions that can all be done in parallel.

Since we have for quantization "whole tensor" quantization, we have cases where we have only 1 reduction.
That can also be done in parallel. Say you have 1000 elements and 10 threads. Each thread process its own 100 numbers, and save its result in its location in an array of 10 partial sum. Then after the parallel region, just reduce these 10 values sequentially. You will still get a near 10x speedup.

Also, should we check if that if the batch.size is small, we may want to do things sequentially? It would probably be good in case we have a few very small tensors. You can easily print out the sizes on stderr for a few benchmarks and see if you have such cases.

AlexandreEichenberger · 2025-01-23T19:45:13Z

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.hpp

+      std::mutex mtx;
+      size_t beginOffset = 0;


Same comment as above, no need to use a lock to compute the lower and upper bound of the work to be done by a given thread.

AlexandreEichenberger · 2025-01-23T19:46:05Z

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.hpp

+        for (WideNum &n : batch)
+          n = fun(n);
+      };
+      parallelFor(ctx, 0, ctx->getNumThreads(), work);


As mentioned before, please check that there is enough work to go to parallel computations. I suspect that if the reduction is very small, then we really want to do it sequentially and it will be faster.

Signed-off-by: Haruki Imai <[email protected]>

AlexandreEichenberger · 2025-01-24T19:58:59Z

@imaihal please ping me when you have implemented the changes, I will then review it again. Thanks for working on accelerating the compiler, much appreciated.

If you know of other opportunities that are not exploited yet, maybe you can add a "todo" in the code or in the description of this PR so that we don't lose such opportunities.

Signed-off-by: Haruki Imai <[email protected]>

AlexandreEichenberger

All the changes looks good to me now.

Do you want to test if there are cases where there is very little work and the code would go in sequential mode? Maybe report here if you have seen this in benchmarks?

So I approved the PR, would be good to know if we should consider doing some of the work in sequential mode (as this PR could help 90% of the cases, but for small one hurt perf, so while it may look in general good, we would still let performance on the table by not doing the small case sequentially).

imaihal · 2025-01-28T00:46:56Z

@AlexandreEichenberger

Do you want to test if there are cases where there is very little work and the code would go in sequential mode? Maybe report here if you have seen this in benchmarks?

Yes. I will test and add a threshold (maybe if loop length is small, it should run in sequential mode)
Also, I will add test code you suggested in another PR.

AlexandreEichenberger

LGTM, thanks for adding the new algo for selecting the batch bounds.

Will you add a lit test to make sure parallel version of the algo works?

imaihal added 4 commits January 14, 2025 02:28

Parallelize compilation of ConstProp.

eb4a777

Signed-off-by: Haruki Imai <[email protected]>

Add tentative implementation for parallelizing inner loop.

74291c5

Signed-off-by: Haruki Imai <[email protected]>

Remove tentative implementation.

2056cfb

Signed-off-by: Haruki Imai <[email protected]>

Merge branch 'main' into constprop_parallel

96f98ce

imaihal marked this pull request as ready for review January 17, 2025 06:29

Merge branch 'main' into constprop_parallel

a476c36

imaihal requested review from sorenlassen, tungld, AlexandreEichenberger and chentong319 January 22, 2025 04:50

sorenlassen approved these changes Jan 23, 2025

View reviewed changes

AlexandreEichenberger reviewed Jan 23, 2025

View reviewed changes

imaihal added 3 commits January 24, 2025 01:43

Remove an argument for MLIRcontext in functionTransform().

adf6ca9

Signed-off-by: Haruki Imai <[email protected]>

Merge branch 'main' into constprop_parallel

abf14f9

Restore necessary MLIRContext.

4ba7d33

Signed-off-by: Haruki Imai <[email protected]>

Remove mutex.

f42244d

Signed-off-by: Haruki Imai <[email protected]>

imaihal mentioned this pull request Jan 27, 2025

Option to set the number of threads for parallel compilation #3048

Merged

AlexandreEichenberger approved these changes Jan 27, 2025

View reviewed changes

AlexandreEichenberger approved these changes Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization of ConstProp compilation #3042

Parallelization of ConstProp compilation #3042

imaihal commented Jan 14, 2025 •

edited

Loading

sorenlassen left a comment

sorenlassen Jan 23, 2025

AlexandreEichenberger Jan 23, 2025

imaihal Jan 24, 2025

sorenlassen Jan 23, 2025

AlexandreEichenberger Jan 23, 2025

AlexandreEichenberger Jan 23, 2025

imaihal Jan 27, 2025

AlexandreEichenberger Jan 23, 2025

AlexandreEichenberger Jan 23, 2025

AlexandreEichenberger Jan 23, 2025

AlexandreEichenberger commented Jan 24, 2025

AlexandreEichenberger left a comment

imaihal commented Jan 28, 2025

AlexandreEichenberger left a comment

Parallelization of ConstProp compilation #3042

Are you sure you want to change the base?

Parallelization of ConstProp compilation #3042

Conversation

imaihal commented Jan 14, 2025 • edited Loading

sorenlassen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexandreEichenberger commented Jan 24, 2025

AlexandreEichenberger left a comment

Choose a reason for hiding this comment

imaihal commented Jan 28, 2025

AlexandreEichenberger left a comment

Choose a reason for hiding this comment

imaihal commented Jan 14, 2025 •

edited

Loading