[Transform][Vectorization] canonicalize vector with physical vector #100

BRUCE11111 · 2024-05-27T07:41:53Z

Tasks:

Performance data:

operations	shape	no_current_PR	current PR	performance	comments
linalg.transpose	16x1024xf32	0.006	0.003	+50%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.transpose	1024x1024xf32	0.99	0.75	+24%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.broadcast + linalg.add	1024xf32 broadcast 1024x1024 add	1.13	0.9	+20%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
tensor.pack	1x1024x1024	1.91	0.6	+68%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.reduce	128x1024x1024 xf32→ 128xf32	5.28	2.8	+46%	Testing on ICX， current branch, compare with main branch

BRUCE11111 · 2024-05-27T07:55:15Z

Example:

Give matmul + relu:

func.func @fc_relu(%lhs: tensor<512x512xf32>, %rhs: tensor<512x512xf32>,
                   %bias: tensor<512x512xf32>, %output: tensor<512x512xf32>)
                   -> tensor<512x512xf32> {
  %matmul = linalg.matmul ins(%lhs, %rhs: tensor<512x512xf32>, tensor<512x512xf32>)
                          outs(%output: tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise addition.
  %biased = linalg.elemwise_binary { fun = #linalg.binary_fn<add> }
    ins(%matmul, %bias : tensor<512x512xf32>, tensor<512x512xf32>)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise max with 0 (ReLU).
  %c0f = arith.constant 0.0 : f32
  // expected-remark @below {{elementwise binary}}
  %relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> }
    ins(%biased, %c0f : tensor<512x512xf32>, f32)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
  func.return %relued : tensor<512x512xf32>
}

// -----// IR Dump After LowerToTileVector (lower-to-tile-vector) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<512x512xf32>
  %cst_0 = arith.constant 0.000000e+00 : f32
  %c0 = arith.constant 0 : index
  // remain matmul op to brgemm to do the optimization
  %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
  %1 = vector.transfer_read %0[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %2 = vector.transfer_read %arg2[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %3 = arith.addf %1, %2 : vector<512x512xf32>
  %4 = arith.maximumf %3, %cst : vector<512x512xf32>
  %5 = vector.transfer_write %4, %arg3[%c0, %c0] {in_bounds = [true, true]} : vector<512x512xf32>, tensor<512x512xf32>
  return %5 : tensor<512x512xf32>
}

// -----// IR Dump After CPUPhysicalRegisterPass (CPU-physical-register-pass) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
    %c16 = arith.constant 16 : index
    %c512 = arith.constant 512 : index
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %cst = arith.constant dense<0.000000e+00> : vector<16xf32>
    %cst_0 = arith.constant 0.000000e+00 : f32
    // remain matmul op to brgemm to do the optimization
    %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
    %1 = scf.for %arg4 = %c0 to %c512 step %c1 iter_args(%arg5 = %arg3) -> (tensor<512x512xf32>) {
      %2 = scf.for %arg6 = %c0 to %c512 step %c16 iter_args(%arg7 = %arg5) -> (tensor<512x512xf32>) {
        %3 = vector.transfer_read %arg2[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %4 = vector.transfer_read %0[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %5 = arith.addf %4, %3 : vector<16xf32>
        %6 = arith.maximumf %5, %cst : vector<16xf32>
        %7 = vector.transfer_write %6, %arg7[%arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<512x512xf32>
        scf.yield %7 : tensor<512x512xf32>
      }
      scf.yield %2 : tensor<512x512xf32>
    }
    return %1 : tensor<512x512xf32>
  }

kurapov-peter · 2024-05-28T10:50:39Z

@BRUCE11111, do we need microkernel definition for this first?

BRUCE11111 · 2024-05-29T00:49:16Z

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

ZhennanQin · 2024-05-29T01:22:08Z

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

I think Petr's question is from your example, to fully handle matmul lowering, we need microkernel definition to provide brgemm lowering.

Matmul lowering and brgemm lowering is not a part of this PR. Please consider providing another example to avoid confusion, like RMSNorm.

BRUCE11111 · 2024-09-19T10:13:51Z

Waiting for the community's PR merge to fix the remaining errors on CI.

BRUCE11111 · 2024-09-20T00:52:00Z

Performance data:

operations	shape	no_current_PR	current PR	performance	comments
linalg.transpose	16x1024xf32	0.006	0.003	+50%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.transpose	1024x1024xf32	0.99	0.75	+24%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.broadcast + linalg.add	1024xf32 broadcast 1024x1024 add	1.13	0.9	+20%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
tensor.pack	1x1024x1024	1.91	0.6	+68%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.reduce	128x1024x1024 xf32→ 128xf32	5.28	2.8	+46%	Testing on ICX， current branch, compare with main branch

test/mlir/test/gc/Transforms/cpu-phyaical-register.mlir

lib/gc/Transforms/CPUPhysicalRegisterPass.cpp

lib/gc/Transforms/TilingVector.hpp

lib/gc/Transforms/CPUPhysicalRegisterPass.cpp

…performance

lib/gc/Transforms/Utils/VectorUtils.cpp

lmontigny

Approved. Can we have in future an iterative process with smaller PR to review?

BRUCE11111 · 2024-10-15T05:26:09Z

Approved. Can we have in future an iterative process with smaller PR to review?

Okay~ Thanks~

BRUCE11111 added the WIP work in progress label May 27, 2024

BRUCE11111 changed the title ~~[Transform][Vectorize] lower linalg named op to math/arith and canonicalize vector with physical vector~~ [Transform][Vectorization] lower linalg named op to math/arith and canonicalize vector with physical vector May 27, 2024

init

d29f038

BRUCE11111 force-pushed the xiaohui/vectorization branch from bccf326 to d29f038 Compare May 27, 2024 08:22

add LoopInvariantCodeMotion and CSE

8d65cbf

BRUCE11111 force-pushed the xiaohui/vectorization branch from 55348fb to 8d65cbf Compare May 28, 2024 03:42

Xu, Xiaohui1 added 20 commits May 30, 2024 14:52

update for result use rewriter

e6037e2

move functions in class

24261e7

backup multireduction canonicalization

0709481

update reduce operation

e04eaf6

update reduce

75e5546

Merge branch 'main' into xiaohui/vectorization

a648a12

record

b80822b

update reduce

a7f0e21

record

9e4364b

fix tests

b01472a

temp record, please reset back

947e5c0

temp record, please reset back

1101e86

Merge branch 'main' into xiaohui/vectorization

a943154

add check test

7200565

simplify code

26b2ab8

refactor partial compitable operation fusion

380f173

fix reduce bug

7a71b25

add 16x16 transpose kernel

ab7c4d0

update reduce, add shapecast, add single matmul test

72f1a8b

Merge remote-tracking branch 'origin/main' into xiaohui/vectorization

6e1adc9

fix reduce loop indice

9b6e6c8

simplify code

aac20a0

BRUCE11111 mentioned this pull request Sep 20, 2024

[Transform][vector] lowering dynamic shape of tensor.pack to vector #351

Open

Merge branch 'main' into xiaohui/vectorization

9a62f0b

BRUCE11111 requested review from ciyongch and Yun-Fly September 20, 2024 09:32

update reduction rectify indice code

919dd11

niuxiaog reviewed Sep 23, 2024

View reviewed changes

test/mlir/test/gc/Transforms/cpu-phyaical-register.mlir Outdated Show resolved Hide resolved

test/mlir/test/gc/Transforms/cpu-phyaical-register.mlir Outdated Show resolved Hide resolved

ciyongch reviewed Sep 23, 2024

View reviewed changes

use CRTP and type trait to avoid virtual function to improve compile …

0e7794c

…performance

Yun-Fly reviewed Sep 24, 2024

View reviewed changes

BRUCE11111 added this to the 0.1 CPU - General milestone Sep 24, 2024

Xu, Xiaohui1 added 2 commits September 25, 2024 16:53

fix comments

705d249

remove unused function

d49715c

BRUCE11111 mentioned this pull request Sep 27, 2024

Update llvm commit #363

Merged

Xu, Xiaohui1 added 2 commits September 27, 2024 16:40

Merge branch 'main' into xiaohui/vectorization

ba356ab

Merge branch 'main' into xiaohui/vectorization

d8e968f

ciyongch reviewed Sep 30, 2024

View reviewed changes

lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved

lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved

lib/gc/Transforms/Utils/VectorUtils.cpp Outdated Show resolved Hide resolved

BRUCE11111 force-pushed the xiaohui/vectorization branch from cc0b4c1 to 1242a74 Compare October 8, 2024 07:18

Xu, Xiaohui1 added 2 commits October 8, 2024 15:23

fix comments

57bec50

Merge branch 'main' into xiaohui/vectorization

e8d7612

BRUCE11111 force-pushed the xiaohui/vectorization branch from 1242a74 to e8d7612 Compare October 8, 2024 07:29

ciyongch requested review from lmontigny and LongshengDu October 8, 2024 14:00

lmontigny approved these changes Oct 14, 2024

View reviewed changes

Merge branch 'main' into xiaohui/vectorization

028d4f4

ciyongch approved these changes Oct 15, 2024

View reviewed changes

[Transform][Vectorization] canonicalize vector with physical vector #100

Are you sure you want to change the base?

[Transform][Vectorization] canonicalize vector with physical vector #100

Uh oh!

Conversation

BRUCE11111 commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BRUCE11111 commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurapov-peter commented May 28, 2024

Uh oh!

BRUCE11111 commented May 29, 2024

Uh oh!

ZhennanQin commented May 29, 2024

Uh oh!

BRUCE11111 commented Sep 19, 2024

Uh oh!

BRUCE11111 commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lmontigny left a comment

Choose a reason for hiding this comment

Uh oh!

BRUCE11111 commented Oct 15, 2024

Uh oh!

Uh oh!

BRUCE11111 commented May 27, 2024 •

edited

Loading

BRUCE11111 commented May 27, 2024 •

edited

Loading

BRUCE11111 commented Sep 20, 2024 •

edited

Loading