-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform][Vectorization] canonicalize vector with physical vector #100
base: main
Are you sure you want to change the base?
Conversation
Example: Give matmul + relu: func.func @fc_relu(%lhs: tensor<512x512xf32>, %rhs: tensor<512x512xf32>,
%bias: tensor<512x512xf32>, %output: tensor<512x512xf32>)
-> tensor<512x512xf32> {
%matmul = linalg.matmul ins(%lhs, %rhs: tensor<512x512xf32>, tensor<512x512xf32>)
outs(%output: tensor<512x512xf32>) -> tensor<512x512xf32>
// Elementwise addition.
%biased = linalg.elemwise_binary { fun = #linalg.binary_fn<add> }
ins(%matmul, %bias : tensor<512x512xf32>, tensor<512x512xf32>)
outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
// Elementwise max with 0 (ReLU).
%c0f = arith.constant 0.0 : f32
// expected-remark @below {{elementwise binary}}
%relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> }
ins(%biased, %c0f : tensor<512x512xf32>, f32)
outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
func.return %relued : tensor<512x512xf32>
} // -----// IR Dump After LowerToTileVector (lower-to-tile-vector) //----- // func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
%cst = arith.constant dense<0.000000e+00> : vector<512x512xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
// remain matmul op to brgemm to do the optimization
%0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
%1 = vector.transfer_read %0[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
%2 = vector.transfer_read %arg2[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
%3 = arith.addf %1, %2 : vector<512x512xf32>
%4 = arith.maximumf %3, %cst : vector<512x512xf32>
%5 = vector.transfer_write %4, %arg3[%c0, %c0] {in_bounds = [true, true]} : vector<512x512xf32>, tensor<512x512xf32>
return %5 : tensor<512x512xf32>
} // -----// IR Dump After CPUPhysicalRegisterPass (CPU-physical-register-pass) //----- // func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
%c16 = arith.constant 16 : index
%c512 = arith.constant 512 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%cst = arith.constant dense<0.000000e+00> : vector<16xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
// remain matmul op to brgemm to do the optimization
%0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
%1 = scf.for %arg4 = %c0 to %c512 step %c1 iter_args(%arg5 = %arg3) -> (tensor<512x512xf32>) {
%2 = scf.for %arg6 = %c0 to %c512 step %c16 iter_args(%arg7 = %arg5) -> (tensor<512x512xf32>) {
%3 = vector.transfer_read %arg2[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
%4 = vector.transfer_read %0[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
%5 = arith.addf %4, %3 : vector<16xf32>
%6 = arith.maximumf %5, %cst : vector<16xf32>
%7 = vector.transfer_write %6, %arg7[%arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<512x512xf32>
scf.yield %7 : tensor<512x512xf32>
}
scf.yield %2 : tensor<512x512xf32>
}
return %1 : tensor<512x512xf32>
} |
bccf326
to
d29f038
Compare
55348fb
to
8d65cbf
Compare
@BRUCE11111, do we need microkernel definition for this first? |
Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it? |
I think Petr's question is from your example, to fully handle matmul lowering, we need microkernel definition to provide brgemm lowering. Matmul lowering and brgemm lowering is not a part of this PR. Please consider providing another example to avoid confusion, like RMSNorm. |
Waiting for the community's PR merge to fix the remaining errors on CI. |
Performance data:
|
cc0b4c1
to
1242a74
Compare
1242a74
to
e8d7612
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Can we have in future an iterative process with smaller PR to review?
Okay~ Thanks~ |
Tracking issue 331
Tasks:
vector.multi_reduction
with graph compiler reduce implementation.vector.transpose
with graph compiler transpose implementation.vector.broadcast
.vector.shapecast
with graph compiler reorder implementation.Performance data: