[BACKEND] Add a loop unroller pass #4645

htyu · 2024-09-04T18:25:31Z

Adding a loop unroller pass which applies to only loops with unroll annotation.

An annotated loop will look like:

    scf.for %arg5 = %c0_i32 to %arg3 step %c32_i32 : i32 {
      ...
    } {tt.loop_unroll_factor = 2 : i32}

htyu · 2024-09-04T18:26:36Z

#4639 is needed to unblock a test failure where the MLIR loop unroller doesn't work on for loops with integer IV.

manman-ren

This diff looks good to me!

manman-ren

Do you currently get performance wins from this without needing other changes?

htyu · 2024-09-05T23:35:19Z

Do you currently get performance wins from this without needing other changes?

Yes when used with kernel override, for kernels where register pressure isn't an issue (e.g persistent kernels). For E2E run we'll need a frontend change.

ThomasRaoux

I'm curious about the use case and if you considered using tl.static_range?

lib/Dialect/Triton/Transforms/LoopUnroll.cpp

third_party/nvidia/backend/compiler.py

lib/Dialect/Triton/Transforms/LoopUnroll.cpp

test/Triton/loop-unroll.mlir

htyu · 2024-09-06T15:58:51Z

I'm curious about the use case and if you considered using tl.static_range?

One typical use case is that for the GEMM persistent kernel, like #4662 proposes, to annotate the flattened loop. An unroll factor of 2 on tile 128x128x128 gives some speedup, where SMEM and register pressure aren't an issue. The speedup can be higher when combined with our subsequent work of branch removal, i.e, removing useless branch checking for the prolog and epilog checks.

ThomasRaoux · 2024-09-06T16:02:02Z

I'm curious about the use case and if you considered using tl.static_range?

One typical use case is that for the GEMM persistent kernel, like #4662 proposes, to annotate the flattened loop. An unroll factor of 2 on tile 128x128x128 gives some speedup, where SMEM and register pressure aren't an issue. The speedup can be higher when combined with our subsequent work of branch removal, i.e, removing useless branch checking for the prolog and epilog checks.

but GEMM kernels usually don't have a static K loop? It doesn't sound like something that can be used in general cases?

ThomasRaoux · 2024-09-06T16:04:43Z

I'm curious about the use case and if you considered using tl.static_range?

One typical use case is that for the GEMM persistent kernel, like #4662 proposes, to annotate the flattened loop. An unroll factor of 2 on tile 128x128x128 gives some speedup, where SMEM and register pressure aren't an issue. The speedup can be higher when combined with our subsequent work of branch removal, i.e, removing useless branch checking for the prolog and epilog checks.

but GEMM kernels usually don't have a static K loop? It doesn't sound like something that can be used in general cases?

actually looking at the code again I see that it supports dynamic case. Is that what you do for GEMM? I don't understand how this helps since the unrolled iteration will have to be guarded by an IF op right?

htyu · 2024-09-06T16:15:46Z

I'm curious about the use case and if you considered using tl.static_range?

One typical use case is that for the GEMM persistent kernel, like #4662 proposes, to annotate the flattened loop. An unroll factor of 2 on tile 128x128x128 gives some speedup, where SMEM and register pressure aren't an issue. The speedup can be higher when combined with our subsequent work of branch removal, i.e, removing useless branch checking for the prolog and epilog checks.

but GEMM kernels usually don't have a static K loop? It doesn't sound like something that can be used in general cases?

actually looking at the code again I see that it supports dynamic case. Is that what you do for GEMM?

Yes, something like

`for _ in tl.range(0, k_tiles * tiles_per_SM, unroll_factor=F):`

where F comes from the autotuner. So far we've been giving it 2.

I don't understand how this helps since the unrolled iteration will have to be guarded by an IF op right?

So each original iteration comes with two if-checks, one to identify the start of the original conceptual inner K-loop, one is to identify the last iteration of the K-loop. Once unrolled by 2, there'd be four if-checks, two of which are unnecessary.

Yes the unrolled version comes with an epilog reminder loop, which can be unnecessary too. I think one more hint to the loop unroller should get it.

ThomasRaoux · 2024-09-06T16:32:49Z

I'm curious about the use case and if you considered using tl.static_range?

One typical use case is that for the GEMM persistent kernel, like #4662 proposes, to annotate the flattened loop. An unroll factor of 2 on tile 128x128x128 gives some speedup, where SMEM and register pressure aren't an issue. The speedup can be higher when combined with our subsequent work of branch removal, i.e, removing useless branch checking for the prolog and epilog checks.

but GEMM kernels usually don't have a static K loop? It doesn't sound like something that can be used in general cases?

actually looking at the code again I see that it supports dynamic case. Is that what you do for GEMM?

Yes, something like
`for _ in tl.range(0, k_tiles * tiles_per_SM, unroll_factor=F):`
where F comes from the autotuner. So far we've been giving it 2.

I don't understand how this helps since the unrolled iteration will have to be guarded by an IF op right?

So each original iteration comes with two if-checks, one to identify the start of the original conceptual inner K-loop, one is to identify the last iteration of the K-loop. Once unrolled by 2, there'd be four if-checks, two of which are unnecessary.

Yes the unrolled version comes with an epilog reminder loop, which can be unnecessary too. I think one more hint to the loop unroller should get it.

interesting, that makes sense

test/Triton/loop-unroll.mlir

lib/Dialect/Triton/Transforms/LoopUnroll.cpp

include/triton/Dialect/Triton/Transforms/Passes.td

lib/Dialect/Triton/Transforms/CMakeLists.txt

htyu · 2024-09-09T21:44:54Z

@ThomasRaoux @antiagainst Thanks for reviewing this change! How does the latest version look to you? Please let me know if you have more comments and I'm happy to address them.

lib/Dialect/Triton/Transforms/LoopUnroll.cpp

ThomasRaoux

Looks fine to me assuming the FE changes are ready as well

htyu · 2024-09-09T23:45:29Z

Looks fine to me assuming the FE changes are ready as well

#4662 is for the FE side changes BTW.

This change exposes the scf For Loop attribute used in PR #4645 the frontend. It does this by adding a field to tl.range (the same `as num_stages`), this will allow setting loop unrolling factors like so: ``` @triton.jit def _kernel(dst, v): pid = tl.program_id(axis=0) for i in tl.range(0, 10, loop_unroll_factor=2): tl.atomic_add(dst + pid, i + pid) ``` Unroll factors of less than 2 do nothing, but 2 or more results in the loop body being replicated that number of times (similar to a clang `#pragma unroll`).

htyu requested a review from ThomasRaoux September 4, 2024 18:25

htyu requested a review from ptillet as a code owner September 4, 2024 18:25

htyu requested review from manman-ren and bertmaher September 5, 2024 22:37

manman-ren reviewed Sep 5, 2024

View reviewed changes

ThomasRaoux reviewed Sep 6, 2024

View reviewed changes

lib/Dialect/Triton/Transforms/LoopUnroll.cpp Outdated Show resolved Hide resolved

third_party/nvidia/backend/compiler.py Outdated Show resolved Hide resolved

lib/Dialect/Triton/Transforms/LoopUnroll.cpp Show resolved Hide resolved

test/Triton/loop-unroll.mlir Outdated Show resolved Hide resolved

htyu force-pushed the hoy/loopunroll branch 2 times, most recently from 5ffcf23 to b9591ad Compare September 6, 2024 16:01

antiagainst reviewed Sep 8, 2024

View reviewed changes

ThomasRaoux reviewed Sep 9, 2024

View reviewed changes

lib/Dialect/Triton/Transforms/CMakeLists.txt Outdated Show resolved Hide resolved

antiagainst reviewed Sep 9, 2024

View reviewed changes

lib/Dialect/Triton/Transforms/LoopUnroll.cpp Outdated Show resolved Hide resolved

lib/Dialect/Triton/Transforms/LoopUnroll.cpp Outdated Show resolved Hide resolved

htyu added 7 commits September 9, 2024 15:10

loop unroller

6363630

remove the unrollFactor switch.

2791408

Simplify test

04d2c5d

move the unroll pass to the ttir compilation phase.

58ea24f

Address comments.

e9d3e68

Tweak cmakelist to honor alphabetic order

a14a195

More tweaks and adding a new test.

140db9d

htyu force-pushed the hoy/loopunroll branch from d3d7af9 to 140db9d Compare September 9, 2024 22:12

antiagainst approved these changes Sep 9, 2024

View reviewed changes

ThomasRaoux approved these changes Sep 9, 2024

View reviewed changes

htyu merged commit 7df871d into triton-lang:main Sep 9, 2024
7 checks passed

htyu mentioned this pull request Sep 9, 2024

[FRONTEND] Adding unroll loops count to tl.range for scf for #4662

Merged

4 tasks

lezcano mentioned this pull request Sep 24, 2024

Implement scaled_dot(mxfp8, fp8) via mma #4795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND] Add a loop unroller pass #4645

[BACKEND] Add a loop unroller pass #4645

htyu commented Sep 4, 2024 •

edited

Loading

htyu commented Sep 4, 2024

manman-ren left a comment

manman-ren left a comment

htyu commented Sep 5, 2024

ThomasRaoux left a comment

htyu commented Sep 6, 2024

ThomasRaoux commented Sep 6, 2024

ThomasRaoux commented Sep 6, 2024

htyu commented Sep 6, 2024 •

edited

Loading

ThomasRaoux commented Sep 6, 2024

htyu commented Sep 9, 2024

ThomasRaoux left a comment

htyu commented Sep 9, 2024

[BACKEND] Add a loop unroller pass #4645

[BACKEND] Add a loop unroller pass #4645

Conversation

htyu commented Sep 4, 2024 • edited Loading

htyu commented Sep 4, 2024

manman-ren left a comment

Choose a reason for hiding this comment

manman-ren left a comment

Choose a reason for hiding this comment

htyu commented Sep 5, 2024

ThomasRaoux left a comment

Choose a reason for hiding this comment

htyu commented Sep 6, 2024

ThomasRaoux commented Sep 6, 2024

ThomasRaoux commented Sep 6, 2024

htyu commented Sep 6, 2024 • edited Loading

ThomasRaoux commented Sep 6, 2024

htyu commented Sep 9, 2024

ThomasRaoux left a comment

Choose a reason for hiding this comment

htyu commented Sep 9, 2024

htyu commented Sep 4, 2024 •

edited

Loading

htyu commented Sep 6, 2024 •

edited

Loading