-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409
base: main
Are you sure you want to change the base?
Conversation
@jiahanxie353 this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged. |
I have a few C++/coding related comments, but I'll postpone these since you're in draft mode. So currently, you've created N components that have the same M cells, and are subsequently invoked in parallel in some
The alternative is what you've questioned,
i.e., create a component that takes as input M cells. This should be possible, but would require wiring the correct inputs and outputs.
|
The reason why I mark it draft for two reasons:
Makes sense! Discussion needed: calyx.group @bb0_5 {
....
calyx.assign %arg_mem_1.write_en = %true : i1
....
} across different component. And since it has multiple drivings, we can't lower it to Verilog. Any thoughts? |
Okay, the reasons for marking it a draft makes sense @jiahanxie353!
It is not sufficient to show that we are going to write to different memory locations. The problem is that a hardware memory can only support a single read or write in a clock cycle (because it is a single-ported memory). For cases like this, you need to "bank" the memory into separate parts so that it access entirely different physical memories. First, I'd recommend reading the Dahlia paper to get a sense of what is going on. Next, we'd need to think more carefully about how this should be supported. @andrewb1999's work is adding this capability in a more general fashion but we might need to bank memories somehow. Also tagging @ethanuppal since he's been thinking about similar problems in his language. |
Is there a general algorithm to determine an "optimal" bank size for each memory? Consider a somewhat made-up example with appropriate complexity: module {
func.func @main(%buffer : memref<16xi32>) {
%idx_one = arith.constant 1 : index
%idx_two = arith.constant 2 : index
%idx_three = arith.constant 3 : index
%idx_four = arith.constant 4 : index
%idx_five = arith.constant 5 : index
%one = arith.constant 1 : i32
%six = arith.constant 6 : i32
%val = arith.addi %one, %six : i32
%step1 = arith.constant 1 : index
%step2 = arith.constant 2 : index
%lb1 = arith.constant 0 : index
%lb2 = arith.constant 1 : index
%ub1 = arith.constant 5 : index
%ub2 = arith.constant 4 : index
%mem1 = memref.alloc() : memref<25xi32>
scf.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%step1, %step2) {
%mul1 = arith.muli %iv1, %idx_five : index
%load_idx1 = arith.addi %mul1, %idx_four : index
%load_val1 = memref.load %mem1[%load_idx1] : memref<25xi32>
%mul2 = arith.muli %iv1, %idx_two : index
%load_idx2 = arith.addi %mul2, %idx_three : index
%load_val2 = memref.load %mem1[%load_idx2] : memref<25xi32>
%sum1 = arith.addi %load_val1, %load_val2 : i32
%mul3 = arith.muli %iv2, %idx_one : index
%load_idx3 = arith.addi %mul3, %idx_one : index
%load_val3 = memref.load %mem1[%load_idx3] : memref<25xi32>
%mul4 = arith.muli %iv1, %idx_four : index
%load_idx4 = arith.addi %mul4, %idx_two : index
%load_val4 = memref.load %buffer[%load_idx4] : memref<16xi32>
%sum2 = arith.addi %load_val3, %load_val4 : i32
%store_idx = arith.muli %iv2, %idx_three : index
%store_val = arith.addi %sum1, %sum2 : i32
memref.store %store_val, %buffer[%store_idx] : memref<16xi32>
}
return
}
} And I have generated the memory access patterns:
And I realize, for this example, and according to the documentation:
Since (iv1=0, iv2=1) and (iv1=0, iv2=3) all access mem1's 4th, 3rd element, there will be potential data race and we cannot do any parallelism here. First, this example is completely artificial and may I'm over-complicating it. But if it were to happen, should we reject this kind of access patterns in the first place? And if we take a step back to consider a somewhat simplified access pattern with only one induction variable:
What is the algorithm for determining the banking factor?
"consume"s two banks: 1. the first bank (4 % 4 = 0); 2. the fourth bank (3 % 4 = 3)
"consume"s 1 bank, i.e., the second bank (9 % 4 = 1, 5 % 4 = 1);
"consume"s two banks: 1. third (14 % 4 = 2); 2. fourth (7 % 4 = 3).
"consume"s two banks: 1. fourth (19 % 4 = 3); 2. second (9 % 4 = 1)
"consume"s two banks: 1. first (24 % 4 = 0); 2. second (1 % 4 = 1). To put different groups together and determine which ones can be executed in parallel, this will almost sound like graph coloring problem. Is there any general way to do it?
Any suggestion to tackle this? |
You can look at the Tracebank paper for details on how to automatically determine good banking factors. I think optimality exists but for large programs, you might have to use heuristics. |
I second TraceBank. |
0498000
to
53549ec
Compare
Whoo, finally finished implementing everything! It's a lot trickier than I thought. Please let me know what you think! |
class CliqueGraph { | ||
public: | ||
CliqueGraph( | ||
const std::unordered_map<VertexType, std::set<VertexType>> &adjacencyList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing needs to be fixed is the data type used. I had to use standard vector
, map
s, etc to get it working, such as here; but I thought it'd be better to use LLVM native types such as DenseMap
, DenseSet
, etc. But somehow I had to make some extra work like implementing customized getTombstoneKey
and more. Can I get some pointers on that?
redundantRemoved.push_back(simplifiedStep); | ||
} | ||
// 2. Combine steps with identical access into a single step | ||
std::set<std::set<int>> seenStep; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly here, I couldn't get a DenseSet
of DenseSet
working unless I implement extra methods, so not entirely sure what to do.
53549ec
to
cbbc22d
Compare
Hey @jiahanxie353, I'm going to focus on the implementation first. There is >1,300 new lines of code here; is it possible to break this up into multiple PRs? Second, can you give some insight behind the approach you've taken? |
uint endIdx = bankSize * (bankCnt + 1); | ||
auto allAttrs = cstAttr.getValues<Attribute>(); | ||
SmallVector<Attribute, 8> extractedElements; | ||
for (auto idx = beginIdx; idx < endIdx; idx++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be:
for (auto idx = beginIdx; idx < endIdx; idx++) | |
for (uint i = 0; i < bankSize; i++) { | |
uint idx = bankCnt + availableBanks * i; | |
extractedElements.push_back(allAttrs[idx]); | |
} |
for round robin
Thanks, @cgyurgyik !
Sure, I'll try!
Sure! My strategy is to
module {
func.func @main() {
%c2 = arith.constant 2 : index
%c1 = arith.constant 1 : index
%c3 = arith.constant 3 : index
%c0 = arith.constant 0 : index
%alloc = memref.alloc() : memref<6xi32>
%alloc_1 = memref.alloc() : memref<6xi32>
scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
%4 = arith.shli %arg3, %c2 : index
%5 = arith.addi %4, %arg2 : index
%6 = memref.load %alloc_1[%5] : memref<6xi32>
%7 = arith.shli %arg2, %c1 : index
%8 = arith.addi %7, %arg3 : index
memref.store %6, %alloc[%8] : memref<6xi32>
scf.reduce
}
return
}
} it will have 4 blocks in total, corresponding to induction variables scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
^bb0:
%induction_var1, %induction_var2 = 0, 0
%4 = arith.shli %induction_var2, %c2 : index
%5 = arith.addi %4, %induction_var1 : index
%6 = memref.load %alloc_1[%5] : memref<6xi32>
%7 = arith.shli %induction_var1, %c1 : index
%8 = arith.addi %7, %induction_var2 : index
memref.store %6, %alloc[%8] : memref<6xi32>
scf.reduce
^bb1:
%induction_var1, %induction_var2 = 0, 1
%4 = arith.shli %induction_var2, %c2 : index
%5 = arith.addi %4, %induction_var1 : index
%6 = memref.load %alloc_1[%5] : memref<6xi32>
%7 = arith.shli %induction_var1, %c1 : index
%8 = arith.addi %7, %induction_var2 : index
memref.store %6, %alloc[%8] : memref<6xi32>
^bb2:
similar...
^bb3:
similar...
}
scf.parallel (%arg2) = (%c0) to (%c2) step (%c1) {
^bb0:
%induction_var = 0
memref.store %1, %alloc[%induction_var] : memref<2xi32>
scf.reduce
^bb1:
%induction_var = 1
memref.store %1, %alloc[%induction_var] : memref<2xi32> it will raise an error because we can't access
scf.parallel0{
^bb0_0:
scf.parallel0_0 {
^bb0_0_0:
op1;
op2;
^bb0_0_1:
op3;
op4;
}
^bb0_1:
scf.parallel0_1 {
^bb0_1_0:
op5;
op6;
^bb0_1_1:
op7;
op8;
}
op9
}
Please let me know if you have any question and/or suggestion, thanks! |
Very cool! Thanks for the detailed explanation, it definitely clears things up. It would be great if you could split this up in several PRs, e.g., leave the nested |
This patch lowers
scf.parallel
op to Calyx by invoking multiplecalyx.component
s in parallel. I'm not pushing test files for now because some required features are not upstream yet.But I can give an example, say we have:
The output is:
The design choice is worth discussing together:
Re. why I created multiple
FuncOp
s for the body ofscf.parallel
: There To invoke something in Calyx in parallel usingpar
:calyx.group
calyx.component
The former is not expressive enough for holding the body information of
scf.parallel
; the latter can, and since we knowFuncOp
will eventually be lowered tocalyx.component
, we can just invoke them in parallel in thecontrol
section by merely changing the operands passing to components.Question:
Can we invoke the same
calyx.component
together? If so, I can only create one@par_func
instance instead of multiple ones.