[Backend] Fix predicates for device assert inside reduction/scan region #5033

davidberard98 · 2024-11-01T05:20:48Z

Reductions have special handling for side effectful "combine ops" (e.g. "add" for a sum reduction). In the presence of side effects, a predicate is computed to determine whether a thread should participate in the reduction, to ensure that invalid/uninitialized data is not operated on. See #4811 for more details.

~~Previously, the predicate logic was incorrect for 2D reductions. This PR fixes the logic and adds a python test.~~

Edit: after additional discussion with @peterbell10, we removed the lanePred logic. Here's our thinking on why this is valid:

lanePred info is computed based entirely on the blocked layout info and properties of the reduction
the blocked layout won't tell you which threads do or don't have uninitialized data

Instead, it sounds like the motivation for #4811 is based on uninitialized values that can be indicated by the pred variable passed into warpReduce().

davidberard98 · 2024-11-01T05:22:26Z

python/test/unit/language/test_core.py

@@ -5917,7 +5917,7 @@ def test_side_effectful_reduction(device):
    if device != "cuda":
        pytest.skip()

-    @triton.jit(debug=True)
+    @triton.jit


I think the debug=True needs to be added as a kwarg to the invocation of the triton kernel. Previously I wasn't seeing any asserts in the ttgir

Thanks for spotting this, I've opened #5037 to fix it

davidberard98 · 2024-11-01T05:23:56Z

lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp

+    // Predicate to ensure we don't read from invalid memory.
+    //   definitions:
+    //     "Lane": the strip of values that are being reduced along.
+    //   relevant variables:
+    //     interleave: for two consecutive elements in a lane, the difference
+    //       between their thread ids is the interleave.
+    //     numLanesToReduce: how many lanes we're reducing across.
+    //     totalNumLanes: how many lanes exist in total. If the reduction
+    //       skips some threads, totalNumLanes might not equal numLanesToReduce.


@peterbell10 is this accurate? tbh I didn't quite understand what scenario requires a predicate - I verified that this fixes my scenario, but I don't know if it regresses the scenario you were initially targeting.

peterbell10 · 2024-11-01T13:56:14Z

lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp

+      Value laneId =
+          urem(udiv(threadId, i32_val(interleave)), i32_val(totalNumLanes));
      Value lanePred = icmp_slt(laneId, i32_val(numLaneToReduce));


The definition of lane is the position of a thread within its warp, so this is a bit confusing. Would it work to do this?

Suggested change

Value laneId =

urem(udiv(threadId, i32_val(interleave)), i32_val(totalNumLanes));

Value lanePred = icmp_slt(laneId, i32_val(numLaneToReduce));

Value laneId = urem(threadId, warpSize);

Value lanePred = icmp_slt(laneId, i32_val(totalNumLanes * interleave));

@peterbell10 thanks for the suggestion!

Instead I'm using

Value lanePred = icmp_slt(laneId, i32_val(numLaneToReduce * interleave));

since I presume that the reason for predicating is due to the difference between numLaneToReduce vs. totalNumLanes?

davidberard98 · 2024-11-01T17:27:06Z

note: the other test_side_effectful_reduction and side_effectful_scan tests are failing after #5035, but somehow not failing on the test_side_effectful_reduction_2d test added by this PR.

In upstream triton, triton-lang/triton#4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. triton-lang/triton#5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`. Pull Request resolved: #139502 Approved by: https://github.com/bertmaher

Reductions have special handling for side effectful "combine ops" (e.g. "add" for a sum reduction). In the presence of side effects, a predicate is computed to determine whether a thread should participate in the reduction, to ensure that invalid/uninitialized data is not operated on. See [] for more details. Previously, the predicate logic was incorrect for 2D reductions. This PR fixes the logic and adds a python test.

…9502) In upstream triton, triton-lang/triton#4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. triton-lang/triton#5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`. Pull Request resolved: pytorch#139502 Approved by: https://github.com/bertmaher

…5075) This is a follow up to #5033 but for scan ops, and also improving the testing as it was clearly insufficient before.

@peterbell10

…on (#5033) Reductions have special handling for side effectful "combine ops" (e.g. "add" for a sum reduction). In the presence of side effects, a predicate is computed to determine whether a thread should participate in the reduction, to ensure that invalid/uninitialized data is not operated on. See #4811 for more details. ~Previously, the predicate logic was incorrect for 2D reductions. This PR fixes the logic and adds a python test.~ Edit: after additional discussion with @peterbell10, we removed the lanePred logic. Here's our thinking on why this is valid: * lanePred info is computed based entirely on the blocked layout info and properties of the reduction * the blocked layout won't tell you which threads do or don't have uninitialized data Instead, it sounds like the motivation for #4811 is based on uninitialized values that can be indicated by the `pred` variable passed into `warpReduce()`.

…5075) This is a follow up to #5033 but for scan ops, and also improving the testing as it was clearly insufficient before.

…can region (triton-lang#5033)" This reverts commit 732aee7.

davidberard98 commented Nov 1, 2024

View reviewed changes

davidberard98 marked this pull request as ready for review November 1, 2024 05:24

davidberard98 requested a review from ptillet as a code owner November 1, 2024 05:24

davidberard98 mentioned this pull request Nov 1, 2024

[triton 3.2] inductor cumsum w/ upstream triton fails accuracy and device asserts pytorch/pytorch#139348

Closed

peterbell10 reviewed Nov 1, 2024

View reviewed changes

davidberard98 force-pushed the fix-reduction-assert branch from fe5c1fc to d430559 Compare November 1, 2024 17:24

davidberard98 force-pushed the fix-reduction-assert branch from eb32075 to 2ea66e9 Compare November 1, 2024 18:31

davidberard98 mentioned this pull request Nov 1, 2024

[inductor] set sanitize_overflow=False for triton kernels pytorch/pytorch#139502

Closed

davidberard98 force-pushed the fix-reduction-assert branch from 2614adc to 2ea66e9 Compare November 1, 2024 22:27

davidberard98 force-pushed the fix-reduction-assert branch from 2ea66e9 to 40986be Compare November 2, 2024 00:31

davidberard98 mentioned this pull request Nov 2, 2024

Reduction w/ assertion is incorrect for 2D reduction + broadcast #5045

Closed

peterbell10 force-pushed the fix-reduction-assert branch from 40986be to 4e5ba83 Compare November 2, 2024 03:22

davidberard98 requested a review from peterbell10 November 4, 2024 15:18

davidberard98 force-pushed the fix-reduction-assert branch from 4e5ba83 to 1f3198c Compare November 4, 2024 17:44

peterbell10 approved these changes Nov 5, 2024

View reviewed changes

peterbell10 merged commit 732aee7 into triton-lang:main Nov 5, 2024
7 checks passed

peterbell10 mentioned this pull request Nov 5, 2024

[BACKEND] Fix asserts in 2d scan and add assert mode to layout tests #5075

Merged

peterbell10 added a commit that referenced this pull request Nov 5, 2024

[BACKEND] Fix asserts in 2d scan and add assert mode to layout tests (#…

11d85f7

…5075) This is a follow up to #5033 but for scan ops, and also improving the testing as it was clearly insufficient before.

bertmaher pushed a commit that referenced this pull request Nov 5, 2024

[BACKEND] Fix asserts in 2d scan and add assert mode to layout tests (#…

9c689f5

…5075) This is a follow up to #5033 but for scan ops, and also improving the testing as it was clearly insufficient before.

antiagainst added a commit to antiagainst/triton that referenced this pull request Nov 5, 2024

Revert "[Backend] Fix predicates for device assert inside reduction/s…

bc80a92

…can region (triton-lang#5033)" This reverts commit 732aee7.

alexbaden mentioned this pull request Nov 6, 2024

PyTorch UT regression (1/2) intel/intel-xpu-backend-for-triton#2579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend] Fix predicates for device assert inside reduction/scan region #5033

[Backend] Fix predicates for device assert inside reduction/scan region #5033

davidberard98 commented Nov 1, 2024 •

edited

Loading

davidberard98 Nov 1, 2024

peterbell10 Nov 1, 2024

davidberard98 Nov 1, 2024

peterbell10 Nov 1, 2024 •

edited

Loading

davidberard98 Nov 1, 2024

davidberard98 commented Nov 1, 2024

[Backend] Fix predicates for device assert inside reduction/scan region #5033

[Backend] Fix predicates for device assert inside reduction/scan region #5033

Conversation

davidberard98 commented Nov 1, 2024 • edited Loading

davidberard98 Nov 1, 2024

Choose a reason for hiding this comment

peterbell10 Nov 1, 2024

Choose a reason for hiding this comment

davidberard98 Nov 1, 2024

Choose a reason for hiding this comment

peterbell10 Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

davidberard98 Nov 1, 2024

Choose a reason for hiding this comment

davidberard98 commented Nov 1, 2024

davidberard98 commented Nov 1, 2024 •

edited

Loading

peterbell10 Nov 1, 2024 •

edited

Loading