Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform][Vectorization] canonicalize vector with physical vector #100

Open
wants to merge 81 commits into
base: main
Choose a base branch
from

Conversation

BRUCE11111
Copy link
Contributor

@BRUCE11111 BRUCE11111 commented May 27, 2024

Tracking issue 331

Tasks:

  • lower linalg named operations and some tensor operations to math/arith.
  • Elementwise operations lower to physical vector.
  • Elementwise operations fusion.
  • Migrate vector.multi_reduction with graph compiler reduce implementation.
  • Migrate vector.transpose with graph compiler transpose implementation.
  • Optimize vector.broadcast.
  • Migrate vector.shapecast with graph compiler reorder implementation.
  • Reduce operation fusion with elementwise operation.
  • Reorder operation fusion.
  • Transpose operation fusion.
  • Broadcast operation fusion.

Performance data:

operations shape no_current_PR current PR performance comments
linalg.transpose 16x1024xf32 0.006 0.003 +50% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.transpose 1024x1024xf32 0.99 0.75 +24% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.broadcast + linalg.add 1024xf32 broadcast 1024x1024 add 1.13 0.9 +20% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
tensor.pack 1x1024x1024 1.91 0.6 +68% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.reduce 128x1024x1024 xf32→ 128xf32 5.28 2.8 +46% Testing on ICX, current branch, compare with main branch

@BRUCE11111 BRUCE11111 added the WIP work in progress label May 27, 2024
@BRUCE11111 BRUCE11111 changed the title [Transform][Vectorize] lower linalg named op to math/arith and canonicalize vector with physical vector [Transform][Vectorization] lower linalg named op to math/arith and canonicalize vector with physical vector May 27, 2024
@BRUCE11111
Copy link
Contributor Author

BRUCE11111 commented May 27, 2024

Example:

Give matmul + relu:

func.func @fc_relu(%lhs: tensor<512x512xf32>, %rhs: tensor<512x512xf32>,
                   %bias: tensor<512x512xf32>, %output: tensor<512x512xf32>)
                   -> tensor<512x512xf32> {
  %matmul = linalg.matmul ins(%lhs, %rhs: tensor<512x512xf32>, tensor<512x512xf32>)
                          outs(%output: tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise addition.
  %biased = linalg.elemwise_binary { fun = #linalg.binary_fn<add> }
    ins(%matmul, %bias : tensor<512x512xf32>, tensor<512x512xf32>)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise max with 0 (ReLU).
  %c0f = arith.constant 0.0 : f32
  // expected-remark @below {{elementwise binary}}
  %relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> }
    ins(%biased, %c0f : tensor<512x512xf32>, f32)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
  func.return %relued : tensor<512x512xf32>
}

// -----// IR Dump After LowerToTileVector (lower-to-tile-vector) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<512x512xf32>
  %cst_0 = arith.constant 0.000000e+00 : f32
  %c0 = arith.constant 0 : index
  // remain matmul op to brgemm to do the optimization
  %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
  %1 = vector.transfer_read %0[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %2 = vector.transfer_read %arg2[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %3 = arith.addf %1, %2 : vector<512x512xf32>
  %4 = arith.maximumf %3, %cst : vector<512x512xf32>
  %5 = vector.transfer_write %4, %arg3[%c0, %c0] {in_bounds = [true, true]} : vector<512x512xf32>, tensor<512x512xf32>
  return %5 : tensor<512x512xf32>
}

// -----// IR Dump After CPUPhysicalRegisterPass (CPU-physical-register-pass) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
    %c16 = arith.constant 16 : index
    %c512 = arith.constant 512 : index
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %cst = arith.constant dense<0.000000e+00> : vector<16xf32>
    %cst_0 = arith.constant 0.000000e+00 : f32
    // remain matmul op to brgemm to do the optimization
    %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
    %1 = scf.for %arg4 = %c0 to %c512 step %c1 iter_args(%arg5 = %arg3) -> (tensor<512x512xf32>) {
      %2 = scf.for %arg6 = %c0 to %c512 step %c16 iter_args(%arg7 = %arg5) -> (tensor<512x512xf32>) {
        %3 = vector.transfer_read %arg2[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %4 = vector.transfer_read %0[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %5 = arith.addf %4, %3 : vector<16xf32>
        %6 = arith.maximumf %5, %cst : vector<16xf32>
        %7 = vector.transfer_write %6, %arg7[%arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<512x512xf32>
        scf.yield %7 : tensor<512x512xf32>
      }
      scf.yield %2 : tensor<512x512xf32>
    }
    return %1 : tensor<512x512xf32>
  }

@kurapov-peter
Copy link
Contributor

@BRUCE11111, do we need microkernel definition for this first?

@BRUCE11111
Copy link
Contributor Author

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

@ZhennanQin
Copy link
Contributor

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

I think Petr's question is from your example, to fully handle matmul lowering, we need microkernel definition to provide brgemm lowering.

Matmul lowering and brgemm lowering is not a part of this PR. Please consider providing another example to avoid confusion, like RMSNorm.

@BRUCE11111
Copy link
Contributor Author

Waiting for the community's PR merge to fix the remaining errors on CI.

@BRUCE11111
Copy link
Contributor Author

BRUCE11111 commented Sep 20, 2024

Performance data:

operations shape no_current_PR current PR performance comments
linalg.transpose 16x1024xf32 0.006 0.003 +50% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.transpose 1024x1024xf32 0.99 0.75 +24% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.broadcast + linalg.add 1024xf32 broadcast 1024x1024 add 1.13 0.9 +20% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
tensor.pack 1x1024x1024 1.91 0.6 +68% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.reduce 128x1024x1024 xf32→ 128xf32 5.28 2.8 +46% Testing on ICX, current branch, compare with main branch

lib/gc/Transforms/CPUPhysicalRegisterPass.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/CPUPhysicalRegisterPass.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/TilingVector.hpp Outdated Show resolved Hide resolved
lib/gc/Transforms/CPUPhysicalRegisterPass.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/CPUPhysicalRegisterPass.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/Utils/VectorUtils.cpp Show resolved Hide resolved
lib/gc/Transforms/Utils/VectorUtils.cpp Show resolved Hide resolved
lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved
@BRUCE11111 BRUCE11111 added this to the 0.1 CPU - General milestone Sep 24, 2024
@BRUCE11111 BRUCE11111 mentioned this pull request Sep 27, 2024
lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved
lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved
Copy link

@lmontigny lmontigny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Can we have in future an iterative process with smaller PR to review?

@BRUCE11111
Copy link
Contributor Author

Approved. Can we have in future an iterative process with smaller PR to review?

Okay~ Thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants