Skip to content

Commit

Permalink
[BACKEND] Remove special handling for bf16 in fp->int, int->fp handli…
Browse files Browse the repository at this point in the history
…ng (#4281)

This PR removes some special handling for int->bf16 and bf16->int
conversions in the TritonNVIDIAGPU->LLVM lowerings, in order to support,
e.g. `cvt.bf16.s32` and `cvt.s32.bf16` instructions that are now
available on Hopper.

Before this PR - there was some special handling for conversions to and
from bf16; for int->bf16, the conversion would be done as a int->fp32
followed by fp32->bf16. Presumably, this was done because, before sm90,
the ptx "cvt" instruction doesn't support conversions to/from bf16.

However, sm90 _does_ support direct conversions to/from bf16; so this PR
removes this special handling in order to make use of the direct cvt
instructions. For Ampere, it looks like the special handling is no
longer needed and llvm handles the details of different hardware
implementations (perhaps thanks to
llvm/llvm-project#74827?)

The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
  • Loading branch information
davidberard98 authored Jul 10, 2024
1 parent b674269 commit 0ac0d2a
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 11 deletions.
24 changes: 24 additions & 0 deletions test/Conversion/tritongpu_to_llvm.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -1643,3 +1643,27 @@ module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 :
tt.return
}
}

// -----

#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0], CTAsPerCGA = [1], CTASplitNum = [1], CTAOrder = [0]}>
module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
tt.func @int32_to_bf16(%arg0: tensor<256xi32, #blocked>) attributes {noinline = false} {
// CHECK-LABEL: @int32_to_bf16
// CHECK: llvm.sitofp %{{.*}} : i32 to bf16
%a = arith.sitofp %arg0 : tensor<256xi32, #blocked> to tensor<256xbf16, #blocked>
tt.return
}
}

// -----

#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0], CTAsPerCGA = [1], CTASplitNum = [1], CTAOrder = [0]}>
module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
tt.func @bf16_to_int32(%arg0: tensor<256xbf16, #blocked>) attributes {noinline = false} {
// CHECK-LABEL: @bf16_to_int32
// CHECK: llvm.fptosi %{{.*}} : bf16 to i32
%a = arith.fptosi %arg0 : tensor<256xbf16, #blocked> to tensor<256xi32, #blocked>
tt.return
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -664,10 +664,6 @@ struct SIToFPOpConversion
auto outVals = cvtFunc(loc, rewriter, inVals);
assert(outVals.size() == 4);
return outVals;
} else if (outElemTy.isBF16()) {
auto value = rewriter.create<LLVM::SIToFPOp>(loc, f32_ty, operands[0][0]);
return {FpToFpOpConversion::convertFp32ToBf16(loc, rewriter, value,
RoundingMode::RTNE)};
} else {
return {rewriter.create<LLVM::SIToFPOp>(loc, elemTy, operands[0][0])};
}
Expand All @@ -685,13 +681,7 @@ struct FPToSIOpConversion
Type elemTy, MultipleOperandsRange operands,
Location loc) const {
auto inElemTy = getElementType(op.getIn());
if (inElemTy.isBF16()) {
auto value =
FpToFpOpConversion::convertBf16ToFp32(loc, rewriter, operands[0][0]);
return {rewriter.create<LLVM::FPToSIOp>(loc, elemTy, value)};
} else {
return {rewriter.create<LLVM::FPToSIOp>(loc, elemTy, operands[0][0])};
}
return {rewriter.create<LLVM::FPToSIOp>(loc, elemTy, operands[0][0])};
}
};

Expand Down

0 comments on commit 0ac0d2a

Please sign in to comment.