Unnecessary buffer allocated when padding in tile's row dimension #142

EmilySillars · 2025-04-11T13:19:20Z

Changes made to ConfigureForSnitch.cpp:
Replaced [0,40,100] tiling configuration with [0,88,50]:

        if (funcOp.getName() ==
            "main$async_dispatch_1_matmul_transpose_b_1x1200x400_f64") {
          l1Tiles[0] = 0;
          l1Tiles[1] = 88;
          l1Tiles[2] = 50;
        }

Changes made to LowerL1Allocations.cpp:

I created a string stream and sent it debugging statements.
This string stream only gets printed out when the kernel does not fit in L1, when condition offset >= l1MemoryBytes is met.
attaching my modified LowerL1Allocations.cpp as a .txt
LowerL1Allocations.txt in case this makes my modifications clearer.

The kernel does not fit in L1:

<eval_with_key>.0 from /home/hoppip/Quidditch/venv/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:551 in wrapped:19:0: warning: Let's look at ALL the allocOps before doging ANYTHING! 

allocOp with memref shape 1 1200 

allocOp with memref shape 1 1232 

allocOp with memref shape 1 50 

allocOp with memref shape 88 50 

allocOp with memref shape 88 50 

allocOp with memref shape 1 1200 

allocOp with memref shape 1 1200 
Well,  those were all the allocOps... =_=

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 9600

allocOp with memref shape 1 1232 
memref size is 8
allocElements is 1232
NOW memref size is 9856
offset is 19456

allocOp with memref shape 1 50 
memref size is 8
allocElements is 50
NOW memref size is 400
offset is 19856

allocOp with memref shape 88 50 
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 55104

allocOp with memref shape 88 50 
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 90304

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 99904

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 109504

allocElements is 1200
memref size is 9600
offset is 109504
l1MemoryBytes is 100000, so 9504 too much
kernel does not fit into L1 memory and cannot be compiled

When the kernel does not fit L1, it's IR gets dumped to stderr, and we can see an unnecessary buffer of gets allocated (lines 99-124 of 0-88-56-build-failure-dispatch-1.mlir):

 // allocate a buffer of 1x1200 elements
  %25 = "arith.constant"() <{value = 0 : index}> : () -> index
  %26 = "memref.view"(%0, %25) : (memref<100000xi8>, index) -> memref<1200xf64>
  %27 = "memref.reinterpret_cast"(%26) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0>, static_sizes = array<i64: 1, 1200>, static_strides = array<i64: 1200, 1>}> : (memref<1200xf64>) -> memref<1x1200xf64>
  %28 = "memref.alloca"() <{alignment = 64 : i64, operandSegmentSizes = array<i32: 0, 0>}> 
  : () -> memref<1x1200xf64, #quidditch_snitch.l1_encoding>

  // set this buffer to all zeroes
  %29 = "quidditch_snitch.compute_core_index"() : () -> index
  %30 = "affine.apply"(%29) <{map = affine_map<()[s0] -> (s0 * 150)>}> : (index) -> index
  "scf.for"(%30, %1, %1) ({
  ^bb0(%arg25: index):
    %94 = "memref.subview"(%27, %arg25) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808>, static_sizes = array<i64: 1, 150>, static_strides = array<i64: 1, 1>}> : (memref<1x1200xf64>, index) -> memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>
    "quidditch_snitch.memref.microkernel"(%94) ({
    ^bb0(%arg26: memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>):
      %95 = "arith.constant"() <{value = 0.000000e+00 : f64}> : () -> f64
      "linalg.fill"(%95, %arg26) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg27: f64, %arg28: f64):
        "linalg.yield"(%arg27) : (f64) -> ()
      }) : (f64, memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
    }) : (memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
    "quidditch_snitch.microkernel_fence"() : () -> ()
    "scf.yield"() : () -> ()
  }) : (index, index, index) -> ()

  // after this point, the 1x1200 buffer allocated to %28 or equivalently %26 never gets used again!

The buffer assigned to %28 and %26 gets set to zero, and then is never used again.

build output attached as 0-88-56-build-failure.txt
0-88-56-build-failure.txt
annotated mlir for the dumped dispatch attached as 0-88-56-build-failure-dispatch-1.txt
0-88-56-build-failure-dispatch-1.txt

The text was updated successfully, but these errors were encountered:

superlopuh · 2025-04-14T20:54:02Z

%26 seems to get used, as it's cast to %27, which is used to create %94, which is used by the microkernel. You're right that the alloca is weird, have you looked into where it's created and which transformation removes its use? I wonder whether it's before or after bufferization.

EmilySillars · 2025-04-15T10:48:55Z

Yes, the %26 gets casted to %27 which gets used by the microkernel, but that microkernel only sets the buffer to zero.
The matrix-vector transpose followed by an elementwise addition ( actual meaningful computation from dispatch 1) never use this zeroed-out buffer.
Markus thought is was happening at the tensor level. I created a partial fix at the tensor level but it only works for dispatch 1.

superlopuh · 2025-04-15T10:59:05Z

Ah I see, that definitely seems like an error happening earlier in the pipeline. Did you try reproducing this behaviour, maybe with a smaller program than NSNet?

EmilySillars · 2025-04-16T08:21:47Z

I have not reproduced on a smaller program, because I struggled to get a single matrix multiplication running on its own with Quidditch. This is documented in RESOURCE_EXHAUSTED error when running linalg.matmul #139 .
I do know that this problem occurs when tiling other nsnet kernels, not just dispatch 1.
Markus thinks it has to do with the padding of the tensors, because a tensor.pad always results in a new allocop getting generated later. We talked about this on zulip [here](#Convolve > xDSL Interpreter Cost Model @ 💬).
I also think this problem can be solved by changing how Quidditch does padding - or at least, after the padding step, we can go back with a "clean-up" rewrite pattern to get rid of the unnecessary tensor.pad ops created.
My incomplete fix adds a rewrite pattern to Quidditch's padding pass PadToTilingConfig.cpp here. The code is messy and the rules needs to be generalized to more cases, but there is a helpful example of the kind of rewrite that needs to take place on lines 722-802.

EmilySillars · 2025-04-17T11:46:31Z

I created a PR that prints out operations before and after padding to illustrate where this issue can happen here: illustrate unnecessary pad operation #147
I will create a cleaner draft PR with my partial fix and add it here soon.

superlopuh · 2025-04-17T12:18:10Z

Nice! Thanks for opening the PR, it's nice to see the debugging process, great to hear you've found a fix.

EmilySillars mentioned this issue Apr 17, 2025

illustrate unnecessary pad operation #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unnecessary buffer allocated when padding in tile's row dimension #142

Unnecessary buffer allocated when padding in tile's row dimension #142

EmilySillars commented Apr 11, 2025

superlopuh commented Apr 14, 2025

Uh oh!

EmilySillars commented Apr 15, 2025 •

edited

Loading

Uh oh!

superlopuh commented Apr 15, 2025

Uh oh!

EmilySillars commented Apr 16, 2025 •

edited

Loading

Uh oh!

EmilySillars commented Apr 17, 2025

Uh oh!

superlopuh commented Apr 17, 2025

Uh oh!

Unnecessary buffer allocated when padding in tile's row dimension #142

Unnecessary buffer allocated when padding in tile's row dimension #142

Comments

EmilySillars commented Apr 11, 2025

superlopuh commented Apr 14, 2025

Uh oh!

EmilySillars commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superlopuh commented Apr 15, 2025

Uh oh!

EmilySillars commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmilySillars commented Apr 17, 2025

Uh oh!

superlopuh commented Apr 17, 2025

Uh oh!

EmilySillars commented Apr 15, 2025 •

edited

Loading

EmilySillars commented Apr 16, 2025 •

edited

Loading