Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not approximate erf on rocm. #19969

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Feb 12, 2025

On ROCm, we want to use the device library functions, which we link as bitcode and inline. In this PR, we start with math.erf because that's the immediate use case, but this will likely be generalized to other functions in a subsequent PR.

@bjacob bjacob marked this pull request as ready for review February 12, 2025 02:40
@bjacob bjacob requested a review from hanhanW as a code owner February 12, 2025 02:40
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping, but please address comments

@bjacob
Copy link
Contributor Author

bjacob commented Feb 14, 2025

@MaheshRavishankar , the PkgCI failure on SDXL is minimized to the following testcase. Summary: unrealized_conversion_cast around math.erf when math.erf is the consumer of a matmul.

Testcase:

func.func @testcase(
 %10 : tensor<64x64xf32>, %11 : tensor<64x64xf32>
) -> tensor<64x64xf32> {
  %cst = arith.constant 0.000000e+00 : f32
  %12 = tensor.empty() : tensor<64x64xf32>
  %16 = linalg.fill ins(%cst : f32) outs(%12 : tensor<64x64xf32>) -> tensor<64x64xf32>
  %17 = linalg.matmul ins(%10, %11 : tensor<64x64xf32>, tensor<64x64xf32>) outs(%16 : tensor<64x64xf32>) -> tensor<64x64xf32>
  %18 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%17 : tensor<64x64xf32>) outs(%12 : tensor<64x64xf32>) {
  ^bb0(%in: f32, %out: f32):
    %23 = math.erf %in : f32
    linalg.yield %23 : f32
  } -> tensor<64x64xf32>
  return %18 : tensor<64x64xf32>
}

Compile (with this PR applied, so that ROCm implies preserving math.erf so that it is not approximated):

 tools/iree-compile -o ~/a.vmfb ~/testcase.mlir --iree-hal-target-backends=rocm --iree-hip-target=gfx942

Result:

failure after ConvertToLLVM with IR like:

      %436 = "math.erf"(%83) <{fastmath = #arith.fastmath<none>}> : (vector<4x1x1x1x4x1xf32>) -> vector<4x1x1x1x4x1xf32>
      %437 = "builtin.unrealized_conversion_cast"(%436) : (vector<4x1x1x1x4x1xf32>) -> !llvm.array<4 x array<1 x array<1 x array<1 x array<4 x vector<1xf32>>>>>>

If I drop the linalg.matmul from the testcase then it succeeds.

This sounds like it is complaining that math dialect is illegal after ConvertToLLVM but then why does it work when there isn't the matmul? Is it something that the specific codegen pipeline depending on the root op, does differently ?

@MaheshRavishankar
Copy link
Contributor

Interesting. I think it just worked by chance earlier. Havent looked deeper, but it might be that the lowering to llvm for math.erf supports vector types. Earlier the math.erf operations

  1. Not fused with GEMMs
  2. Compiled with a vector size of 1.

(2) is being fixed by #19987 so chances are you would have hit this issue after that PR anyway.

@bjacob
Copy link
Contributor Author

bjacob commented Feb 14, 2025

@MaheshRavishankar , unit dims strike again.

Without the linalg.matmul, before ConvertToROCDL pass, the IR is:

    %9 = math.erf %8 : vector<4xf32>

And ConvertToROCDL (which includes ConvertToLLVM) gives:

    %41 = llvm.mlir.constant(0 : i64) : i64
    %42 = llvm.extractelement %39[%41 : i64] : vector<4xf32>
    %43 = llvm.call @__ocml_erf_f32(%42) : (f32) -> f32
    %44 = llvm.insertelement %43, %40[%41 : i64] : vector<4xf32>
    %45 = llvm.mlir.constant(1 : i64) : i64
    %46 = llvm.extractelement %39[%45 : i64] : vector<4xf32>
    %47 = llvm.call @__ocml_erf_f32(%46) : (f32) -> f32
    %48 = llvm.insertelement %47, %44[%45 : i64] : vector<4xf32>
    %49 = llvm.mlir.constant(2 : i64) : i64
    %50 = llvm.extractelement %39[%49 : i64] : vector<4xf32>
    %51 = llvm.call @__ocml_erf_f32(%50) : (f32) -> f32
    %52 = llvm.insertelement %51, %48[%49 : i64] : vector<4xf32>
    %53 = llvm.mlir.constant(3 : i64) : i64
    %54 = llvm.extractelement %39[%53 : i64] : vector<4xf32>
    %55 = llvm.call @__ocml_erf_f32(%54) : (f32) -> f32

With the linalg.matmul, the IR before ConvertToROCDL is:

    %87 = math.erf %86 : vector<1x1x4x1xf32>

and ConvertToROCDL says "unit dims!!! run for your lives!!"

@bjacob
Copy link
Contributor Author

bjacob commented Feb 14, 2025

The problem is here:

https://github.com/llvm/llvm-project/blob/1435c8ed95fa10a55c2f924984141e427b89c330/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp#L589-L596

Here llvm::IsaPred<VectorType> returns false because operandType is !llvm.array<1 x array<1 x array<4 x vector<1xf32>>>>.

By contrast, when it works (without the linalg.matmul) it sees operandType as vector<4xf32>.

So it seems that the unit dims are not the immediate problem here (or they might be indirectly). Rather, the problems is that the vector type vector<1x1x4x1xf32> has already been converted to !llvm.array<1 x array<1 x array<4 x vector<1xf32>>>>.

Might be some misuse of TypeConverter?

@krzysz00
Copy link
Contributor

@bjacob I don't think that's a misuse of the type converter - it's just code that assumes that you'll be using 1D vectors specifically and wasn't written to support multi-dimensional vectors ... probably because no one's complained about that support being missing.

Signed-off-by: Benoit Jacob <[email protected]>
Signed-off-by: Benoit Jacob <[email protected]>
@bjacob
Copy link
Contributor Author

bjacob commented Feb 18, 2025

@qedawkins @MaheshRavishankar , the new commit 56aa457 fixes it in ConvertToROCDL, like we had discussed, minimally, just calling populateDropUnitDimWithShapeCastPatterns. It is enough to fix my immediate issue here, but for future-proof-ness, do you think I should

  • Move that elsewhere?
  • Do more than that, e.g. nest the whole DropVectorUnitDims pass or populate the same set of patterns as it does?
  • Do something similar in ConvertToNVVM or any other pipeline? Note that LLVMCPU/Passes is already nesting the DropVectorUnitDims pass.

Signed-off-by: Benoit Jacob <[email protected]>
@bjacob bjacob force-pushed the no-approx-erf-on-rocm branch from 4c858bb to 56aa457 Compare February 18, 2025 15:01
@qedawkins
Copy link
Contributor

Looking at the regression tests you're getting a lot of failures, which doesn't really surprise me. We probably need a different solution.

Move that elsewhere?

I would lean towards keeping ConvertToROCDL self contained and run it as a separate pass, but given the current state of ConvertToROCDL I wouldn't block.

Do more than that, e.g. nest the whole DropVectorUnitDims pass or populate the same set of patterns as it does?

If this works better (e.g. in benchmarks/tests) then yes, I'd run the whole thing.

Do something similar in ConvertToNVVM or any other pipeline? Note that LLVMCPU/Passes is already nesting the DropVectorUnitDims pass.

No need to worry about NVVM.

@qedawkins
Copy link
Contributor

I had a branch that tried to add the unit dim dropping patterns to TIleAndFuse but I couldn't get it to work well and dropped it: https://github.com/qedawkins/iree/tree/tile_and_fuse_improvements

@bjacob
Copy link
Contributor Author

bjacob commented Feb 18, 2025

Some of the CI failures were an outage. Retriggered after the CI hosts were rebooted, narrowed it down to real failures on SDXL. Reproduced locally, got this IR:

      %11630 = "builtin.unrealized_conversion_cast"(%11629) : (!llvm.array<4 x vector<4xf32>>) -> vector<4x4xf32>
      %11631 = "math.erf"(%11630) <{fastmath = #arith.fastmath<none>}> : (vector<4x4xf32>) -> vector<4x4xf32>
      %11632 = "builtin.unrealized_conversion_cast"(%11631) : (vector<4x4xf32>) -> !llvm.array<4 x vector<4xf32>>

So now this isn't unit dims anymore. We need a flattening pattern for element-wise ops that flattens all dims, not just unit dims. @qedawkins @MaheshRavishankar

@qedawkins
Copy link
Contributor

That is a significantly broader change than initially thought but is something we've known for a long time (giving llvm big arrays is probably not best), but have been kicking the can down the road. This may be the tipping point, but I'd advocate for finding a temporary way to handle multi dimensional cases if we want to land this soon.

@bjacob
Copy link
Contributor Author

bjacob commented Feb 18, 2025

How about adding a shapecast rewrite in MathTransforms, where we already do math-function-specific rewrites. It won't be as general is it could be (in principle it should apply to other elementwise ops) but it will then by construction be as specific as we need it to be to be able to land this in the near term. For example, we will be able to default to doing this only for erf and later generalize it to other math functions.

@MaheshRavishankar
Copy link
Contributor

How about adding a shapecast rewrite in MathTransforms, where we already do math-function-specific rewrites. It won't be as general is it could be (in principle it should apply to other elementwise ops) but it will then by construction be as specific as we need it to be to be able to land this in the near term. For example, we will be able to default to doing this only for erf and later generalize it to other math functions.

That could work.. could you post a bit more of the IR from the error.... I dont think we should even have had a vector of 4x4 at this level.
There is an option 2, have a different lowering of math.erf than the current polynomial expansion. In some ways that is more closer to what we want than relying on libdevice implementations. For now, its just an option to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants