Do not approximate erf on rocm. #19969

bjacob · 2025-02-12T02:38:50Z

On ROCm, we want to use the device library functions, which we link as bitcode and inline. In this PR, we start with math.erf because that's the immediate use case, but this will likely be generalized to other functions in a subsequent PR.

MaheshRavishankar

Stamping, but please address comments

compiler/src/iree/compiler/Codegen/Common/MathTransformPass.cpp

bjacob · 2025-02-14T03:17:46Z

@MaheshRavishankar , the PkgCI failure on SDXL is minimized to the following testcase. Summary: unrealized_conversion_cast around math.erf when math.erf is the consumer of a matmul.

Testcase:

func.func @testcase(
 %10 : tensor<64x64xf32>, %11 : tensor<64x64xf32>
) -> tensor<64x64xf32> {
  %cst = arith.constant 0.000000e+00 : f32
  %12 = tensor.empty() : tensor<64x64xf32>
  %16 = linalg.fill ins(%cst : f32) outs(%12 : tensor<64x64xf32>) -> tensor<64x64xf32>
  %17 = linalg.matmul ins(%10, %11 : tensor<64x64xf32>, tensor<64x64xf32>) outs(%16 : tensor<64x64xf32>) -> tensor<64x64xf32>
  %18 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%17 : tensor<64x64xf32>) outs(%12 : tensor<64x64xf32>) {
  ^bb0(%in: f32, %out: f32):
    %23 = math.erf %in : f32
    linalg.yield %23 : f32
  } -> tensor<64x64xf32>
  return %18 : tensor<64x64xf32>
}

Compile (with this PR applied, so that ROCm implies preserving math.erf so that it is not approximated):

 tools/iree-compile -o ~/a.vmfb ~/testcase.mlir --iree-hal-target-backends=rocm --iree-hip-target=gfx942

Result:

failure after ConvertToLLVM with IR like:

      %436 = "math.erf"(%83) <{fastmath = #arith.fastmath<none>}> : (vector<4x1x1x1x4x1xf32>) -> vector<4x1x1x1x4x1xf32>
      %437 = "builtin.unrealized_conversion_cast"(%436) : (vector<4x1x1x1x4x1xf32>) -> !llvm.array<4 x array<1 x array<1 x array<1 x array<4 x vector<1xf32>>>>>>

If I drop the linalg.matmul from the testcase then it succeeds.

This sounds like it is complaining that math dialect is illegal after ConvertToLLVM but then why does it work when there isn't the matmul? Is it something that the specific codegen pipeline depending on the root op, does differently ?

MaheshRavishankar · 2025-02-14T05:47:06Z

Interesting. I think it just worked by chance earlier. Havent looked deeper, but it might be that the lowering to llvm for math.erf supports vector types. Earlier the math.erf operations

Not fused with GEMMs
Compiled with a vector size of 1.

(2) is being fixed by #19987 so chances are you would have hit this issue after that PR anyway.

bjacob · 2025-02-14T15:56:29Z

@MaheshRavishankar , unit dims strike again.

Without the linalg.matmul, before ConvertToROCDL pass, the IR is:

    %9 = math.erf %8 : vector<4xf32>

And ConvertToROCDL (which includes ConvertToLLVM) gives:

    %41 = llvm.mlir.constant(0 : i64) : i64
    %42 = llvm.extractelement %39[%41 : i64] : vector<4xf32>
    %43 = llvm.call @__ocml_erf_f32(%42) : (f32) -> f32
    %44 = llvm.insertelement %43, %40[%41 : i64] : vector<4xf32>
    %45 = llvm.mlir.constant(1 : i64) : i64
    %46 = llvm.extractelement %39[%45 : i64] : vector<4xf32>
    %47 = llvm.call @__ocml_erf_f32(%46) : (f32) -> f32
    %48 = llvm.insertelement %47, %44[%45 : i64] : vector<4xf32>
    %49 = llvm.mlir.constant(2 : i64) : i64
    %50 = llvm.extractelement %39[%49 : i64] : vector<4xf32>
    %51 = llvm.call @__ocml_erf_f32(%50) : (f32) -> f32
    %52 = llvm.insertelement %51, %48[%49 : i64] : vector<4xf32>
    %53 = llvm.mlir.constant(3 : i64) : i64
    %54 = llvm.extractelement %39[%53 : i64] : vector<4xf32>
    %55 = llvm.call @__ocml_erf_f32(%54) : (f32) -> f32

With the linalg.matmul, the IR before ConvertToROCDL is:

    %87 = math.erf %86 : vector<1x1x4x1xf32>

and ConvertToROCDL says "unit dims!!! run for your lives!!"

bjacob · 2025-02-14T16:14:04Z

The problem is here:

https://github.com/llvm/llvm-project/blob/1435c8ed95fa10a55c2f924984141e427b89c330/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp#L589-L596

Here llvm::IsaPred<VectorType> returns false because operandType is !llvm.array<1 x array<1 x array<4 x vector<1xf32>>>>.

By contrast, when it works (without the linalg.matmul) it sees operandType as vector<4xf32>.

So it seems that the unit dims are not the immediate problem here (or they might be indirectly). Rather, the problems is that the vector type vector<1x1x4x1xf32> has already been converted to !llvm.array<1 x array<1 x array<4 x vector<1xf32>>>>.

Might be some misuse of TypeConverter?

krzysz00 · 2025-02-14T21:05:57Z

@bjacob I don't think that's a misuse of the type converter - it's just code that assumes that you'll be using 1D vectors specifically and wasn't written to support multi-dimensional vectors ... probably because no one's complained about that support being missing.

Signed-off-by: Benoit Jacob <[email protected]>

bjacob · 2025-02-18T15:01:36Z

@qedawkins @MaheshRavishankar , the new commit 56aa457 fixes it in ConvertToROCDL, like we had discussed, minimally, just calling populateDropUnitDimWithShapeCastPatterns. It is enough to fix my immediate issue here, but for future-proof-ness, do you think I should

Move that elsewhere?
Do more than that, e.g. nest the whole DropVectorUnitDims pass or populate the same set of patterns as it does?
Do something similar in ConvertToNVVM or any other pipeline? Note that LLVMCPU/Passes is already nesting the DropVectorUnitDims pass.

Signed-off-by: Benoit Jacob <[email protected]>

qedawkins · 2025-02-18T15:54:12Z

Looking at the regression tests you're getting a lot of failures, which doesn't really surprise me. We probably need a different solution.

Move that elsewhere?

I would lean towards keeping ConvertToROCDL self contained and run it as a separate pass, but given the current state of ConvertToROCDL I wouldn't block.

Do more than that, e.g. nest the whole DropVectorUnitDims pass or populate the same set of patterns as it does?

If this works better (e.g. in benchmarks/tests) then yes, I'd run the whole thing.

Do something similar in ConvertToNVVM or any other pipeline? Note that LLVMCPU/Passes is already nesting the DropVectorUnitDims pass.

No need to worry about NVVM.

qedawkins · 2025-02-18T15:57:19Z

I had a branch that tried to add the unit dim dropping patterns to TIleAndFuse but I couldn't get it to work well and dropped it: https://github.com/qedawkins/iree/tree/tile_and_fuse_improvements

bjacob · 2025-02-18T17:26:47Z

Some of the CI failures were an outage. Retriggered after the CI hosts were rebooted, narrowed it down to real failures on SDXL. Reproduced locally, got this IR:

      %11630 = "builtin.unrealized_conversion_cast"(%11629) : (!llvm.array<4 x vector<4xf32>>) -> vector<4x4xf32>
      %11631 = "math.erf"(%11630) <{fastmath = #arith.fastmath<none>}> : (vector<4x4xf32>) -> vector<4x4xf32>
      %11632 = "builtin.unrealized_conversion_cast"(%11631) : (vector<4x4xf32>) -> !llvm.array<4 x vector<4xf32>>

So now this isn't unit dims anymore. We need a flattening pattern for element-wise ops that flattens all dims, not just unit dims. @qedawkins @MaheshRavishankar

qedawkins · 2025-02-18T17:32:20Z

That is a significantly broader change than initially thought but is something we've known for a long time (giving llvm big arrays is probably not best), but have been kicking the can down the road. This may be the tipping point, but I'd advocate for finding a temporary way to handle multi dimensional cases if we want to land this soon.

bjacob · 2025-02-18T17:44:50Z

How about adding a shapecast rewrite in MathTransforms, where we already do math-function-specific rewrites. It won't be as general is it could be (in principle it should apply to other elementwise ops) but it will then by construction be as specific as we need it to be to be able to land this in the near term. For example, we will be able to default to doing this only for erf and later generalize it to other math functions.

MaheshRavishankar · 2025-02-18T17:55:26Z

How about adding a shapecast rewrite in MathTransforms, where we already do math-function-specific rewrites. It won't be as general is it could be (in principle it should apply to other elementwise ops) but it will then by construction be as specific as we need it to be to be able to land this in the near term. For example, we will be able to default to doing this only for erf and later generalize it to other math functions.

That could work.. could you post a bit more of the IR from the error.... I dont think we should even have had a vector of 4x4 at this level.
There is an option 2, have a different lowering of math.erf than the current polynomial expansion. In some ways that is more closer to what we want than relying on libdevice implementations. For now, its just an option to consider.

bjacob requested review from lialan and MaheshRavishankar February 12, 2025 02:39

bjacob marked this pull request as ready for review February 12, 2025 02:40

bjacob requested a review from hanhanW as a code owner February 12, 2025 02:40

MaheshRavishankar approved these changes Feb 12, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/MathTransformPass.cpp Outdated Show resolved Hide resolved

bjacob enabled auto-merge (squash) February 12, 2025 03:03

bjacob mentioned this pull request Feb 12, 2025

[AMDGPU] Do not rewrite or approximate math functions on ROCm #19970

Draft

bjacob disabled auto-merge February 12, 2025 03:20

bjacob added 2 commits February 18, 2025 08:38

no-approx-erf-on-rocm

774319e

Signed-off-by: Benoit Jacob <[email protected]>

review-comment

5e4faf8

Signed-off-by: Benoit Jacob <[email protected]>

bjacob force-pushed the no-approx-erf-on-rocm branch from 95b4938 to 4c858bb Compare February 18, 2025 14:57

bjacob requested review from qedawkins, kuhar and Groverkss as code owners February 18, 2025 14:57

drop-unit-dims

56aa457

Signed-off-by: Benoit Jacob <[email protected]>

bjacob force-pushed the no-approx-erf-on-rocm branch from 4c858bb to 56aa457 Compare February 18, 2025 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not approximate erf on rocm. #19969

Do not approximate erf on rocm. #19969

bjacob commented Feb 12, 2025

MaheshRavishankar left a comment

bjacob commented Feb 14, 2025 •

edited

Loading

MaheshRavishankar commented Feb 14, 2025

bjacob commented Feb 14, 2025

bjacob commented Feb 14, 2025 •

edited

Loading

krzysz00 commented Feb 14, 2025

bjacob commented Feb 18, 2025 •

edited

Loading

qedawkins commented Feb 18, 2025

qedawkins commented Feb 18, 2025

bjacob commented Feb 18, 2025

qedawkins commented Feb 18, 2025

bjacob commented Feb 18, 2025

MaheshRavishankar commented Feb 18, 2025

Do not approximate erf on rocm. #19969

Are you sure you want to change the base?

Do not approximate erf on rocm. #19969

Conversation

bjacob commented Feb 12, 2025

MaheshRavishankar left a comment

Choose a reason for hiding this comment

bjacob commented Feb 14, 2025 • edited Loading

MaheshRavishankar commented Feb 14, 2025

bjacob commented Feb 14, 2025

bjacob commented Feb 14, 2025 • edited Loading

krzysz00 commented Feb 14, 2025

bjacob commented Feb 18, 2025 • edited Loading

qedawkins commented Feb 18, 2025

qedawkins commented Feb 18, 2025

bjacob commented Feb 18, 2025

qedawkins commented Feb 18, 2025

bjacob commented Feb 18, 2025

MaheshRavishankar commented Feb 18, 2025

bjacob commented Feb 14, 2025 •

edited

Loading

bjacob commented Feb 14, 2025 •

edited

Loading

bjacob commented Feb 18, 2025 •

edited

Loading