Skip to content

Unnecessary buffer allocated when padding in tile's row dimension #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
EmilySillars opened this issue Apr 11, 2025 · 6 comments
Open

Comments

@EmilySillars
Copy link

Changes made to ConfigureForSnitch.cpp:
Replaced [0,40,100] tiling configuration with [0,88,50]:

        if (funcOp.getName() ==
            "main$async_dispatch_1_matmul_transpose_b_1x1200x400_f64") {
          l1Tiles[0] = 0;
          l1Tiles[1] = 88;
          l1Tiles[2] = 50;
        }

Changes made to LowerL1Allocations.cpp:

  • I created a string stream and sent it debugging statements.
  • This string stream only gets printed out when the kernel does not fit in L1, when condition offset >= l1MemoryBytes is met.
  • attaching my modified LowerL1Allocations.cpp as a .txt
    LowerL1Allocations.txt in case this makes my modifications clearer.

The kernel does not fit in L1:

<eval_with_key>.0 from /home/hoppip/Quidditch/venv/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:551 in wrapped:19:0: warning: Let's look at ALL the allocOps before doging ANYTHING! 

allocOp with memref shape 1 1200 

allocOp with memref shape 1 1232 

allocOp with memref shape 1 50 

allocOp with memref shape 88 50 

allocOp with memref shape 88 50 

allocOp with memref shape 1 1200 

allocOp with memref shape 1 1200 
Well,  those were all the allocOps... =_=

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 9600

allocOp with memref shape 1 1232 
memref size is 8
allocElements is 1232
NOW memref size is 9856
offset is 19456

allocOp with memref shape 1 50 
memref size is 8
allocElements is 50
NOW memref size is 400
offset is 19856

allocOp with memref shape 88 50 
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 55104

allocOp with memref shape 88 50 
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 90304

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 99904

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 109504

allocElements is 1200
memref size is 9600
offset is 109504
l1MemoryBytes is 100000, so 9504 too much
kernel does not fit into L1 memory and cannot be compiled

When the kernel does not fit L1, it's IR gets dumped to stderr, and we can see an unnecessary buffer of gets allocated (lines 99-124 of 0-88-56-build-failure-dispatch-1.mlir):

 // allocate a buffer of 1x1200 elements
  %25 = "arith.constant"() <{value = 0 : index}> : () -> index
  %26 = "memref.view"(%0, %25) : (memref<100000xi8>, index) -> memref<1200xf64>
  %27 = "memref.reinterpret_cast"(%26) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0>, static_sizes = array<i64: 1, 1200>, static_strides = array<i64: 1200, 1>}> : (memref<1200xf64>) -> memref<1x1200xf64>
  %28 = "memref.alloca"() <{alignment = 64 : i64, operandSegmentSizes = array<i32: 0, 0>}> 
  : () -> memref<1x1200xf64, #quidditch_snitch.l1_encoding>

  // set this buffer to all zeroes
  %29 = "quidditch_snitch.compute_core_index"() : () -> index
  %30 = "affine.apply"(%29) <{map = affine_map<()[s0] -> (s0 * 150)>}> : (index) -> index
  "scf.for"(%30, %1, %1) ({
  ^bb0(%arg25: index):
    %94 = "memref.subview"(%27, %arg25) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808>, static_sizes = array<i64: 1, 150>, static_strides = array<i64: 1, 1>}> : (memref<1x1200xf64>, index) -> memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>
    "quidditch_snitch.memref.microkernel"(%94) ({
    ^bb0(%arg26: memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>):
      %95 = "arith.constant"() <{value = 0.000000e+00 : f64}> : () -> f64
      "linalg.fill"(%95, %arg26) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg27: f64, %arg28: f64):
        "linalg.yield"(%arg27) : (f64) -> ()
      }) : (f64, memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
    }) : (memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
    "quidditch_snitch.microkernel_fence"() : () -> ()
    "scf.yield"() : () -> ()
  }) : (index, index, index) -> ()

  // after this point, the 1x1200 buffer allocated to %28 or equivalently %26 never gets used again!

The buffer assigned to %28 and %26 gets set to zero, and then is never used again.

@superlopuh
Copy link
Contributor

%26 seems to get used, as it's cast to %27, which is used to create %94, which is used by the microkernel. You're right that the alloca is weird, have you looked into where it's created and which transformation removes its use? I wonder whether it's before or after bufferization.

@EmilySillars
Copy link
Author

EmilySillars commented Apr 15, 2025

Yes, the %26 gets casted to %27 which gets used by the microkernel, but that microkernel only sets the buffer to zero.
The matrix-vector transpose followed by an elementwise addition ( actual meaningful computation from dispatch 1) never use this zeroed-out buffer.
Markus thought is was happening at the tensor level. I created a partial fix at the tensor level but it only works for dispatch 1.

@superlopuh
Copy link
Contributor

Ah I see, that definitely seems like an error happening earlier in the pipeline. Did you try reproducing this behaviour, maybe with a smaller program than NSNet?

@EmilySillars
Copy link
Author

EmilySillars commented Apr 16, 2025

  • I have not reproduced on a smaller program, because I struggled to get a single matrix multiplication running on its own with Quidditch. This is documented in RESOURCE_EXHAUSTED error when running linalg.matmul #139 .
  • I do know that this problem occurs when tiling other nsnet kernels, not just dispatch 1.
  • Markus thinks it has to do with the padding of the tensors, because a tensor.pad always results in a new allocop getting generated later. We talked about this on zulip [here](#Convolve > xDSL Interpreter Cost Model @ 💬).
  • I also think this problem can be solved by changing how Quidditch does padding - or at least, after the padding step, we can go back with a "clean-up" rewrite pattern to get rid of the unnecessary tensor.pad ops created.
  • My incomplete fix adds a rewrite pattern to Quidditch's padding pass PadToTilingConfig.cpp here. The code is messy and the rules needs to be generalized to more cases, but there is a helpful example of the kind of rewrite that needs to take place on lines 722-802.

@EmilySillars
Copy link
Author

  • I created a PR that prints out operations before and after padding to illustrate where this issue can happen here: illustrate unnecessary pad operation #147
  • I will create a cleaner draft PR with my partial fix and add it here soon.

@superlopuh
Copy link
Contributor

Nice! Thanks for opening the PR, it's nice to see the debugging process, great to hear you've found a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants