Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick upstream new passes #390

Closed
wants to merge 7 commits into from
Closed

Commits on Nov 9, 2023

  1. upgrade llvm to b1115f8c (NFC) (triton-lang#2403)

    Co-authored-by: Thomas Raoux <[email protected]>
    Co-authored-by: Keren Zhou <[email protected]>
    Co-authored-by: Phil Tillet <[email protected]>
    4 people authored and zhanglx13 committed Nov 9, 2023
    Configuration menu
    Copy the full SHA
    d2f8b50 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b3276e7 View commit details
    Browse the repository at this point in the history
  3. [OPTIMIZER] Thread local reduction optimization (triton-lang#2542)

    Co-authored-by: Phil Tillet <[email protected]>
    2 people authored and zhanglx13 committed Nov 9, 2023
    Configuration menu
    Copy the full SHA
    ad3b3ff View commit details
    Browse the repository at this point in the history
  4. [BACKEND] Fix multiple bugs in WGMMA (triton-lang#2457)

    Fix dependencies in wgmma_wait op to prevent the scheduler from moving
    it past the uses of wgmma accumulator. We need to explicitly represent
    the dependency between the wait and the accumulator uses otherwise LLVM
    is free to re-order those.
    This allows us to remove a workaround to prevent the re-ordering. We can
    also remove the wait op added in the loop during pipelining.
    
    Also fix the descritpor calcuation for wgmma, we should calculate the
    same descriptor for the whole warpgroup.
    Added a workaround for a bug that was exposed by different timing due to
    those changes. We shouldn't insert operations between the loop and
    async_wait or we may have race conditions.
    ThomasRaoux authored and zhanglx13 committed Nov 9, 2023
    Configuration menu
    Copy the full SHA
    c070b98 View commit details
    Browse the repository at this point in the history
  5. [BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (trit…

    …on-lang#2526)
    
    Now that the bug related to attribute is fixed in MLIR we can use arith
    ops for cmp and select ops.
    ThomasRaoux authored and zhanglx13 committed Nov 9, 2023
    Configuration menu
    Copy the full SHA
    ac4ee36 View commit details
    Browse the repository at this point in the history
  6. [BACKEND] Dedup elementwise in LLVM IR based on constancy (triton-lan…

    …g#2512)
    
    ### Summary
    
    When Triton GPU IR is lowered into LLVM IR, we can make use of the
    constancy information about the result of the elementwise ops to
    deduplicate otherwise redundant computation. That is the contribution of
    this PR: the constancy is checked and, if possible, some of the values
    in LLVM IR are reused multiple times instead of computing equal values
    separately.
    
    The change is beneficial for the PyTorch 2 / TorchInductor-generated
    Triton code, as the leftmost sub-indices extracted from the flat index
    by div / mod operations can be equal, given sufficiently large 2^n
    factor in the rightmost rightmost dimension(s). This makes the
    computation resulting in those sub-indices redundant. Consequently,
    under the necessary constancy conditions, the redundant indexing
    arithmetics can be deduplicated. We observe up to 29% decrease in the
    latency of some of our jagged tensor kernels
    aakhundov authored and zhanglx13 committed Nov 9, 2023
    Configuration menu
    Copy the full SHA
    6d45d6c View commit details
    Browse the repository at this point in the history
  7. [BACKEND] Pipeliner refactoring (triton-lang#2565)

    Refactor the pipeliner pass in order to make it more generic. The main
    change is that the pipeliner is now broken into 2 pieces one calculating
    a modulo schedule and create async ops based on the IR and an expander
    that will generate the pipelined IR based on the modulo schedule.
    The advantage of separating the two pieces is that it will allow us to
    create different schedule without having to change the expander and it
    will allow for more complex schedules.
    For now the schedule generated for matmul case matches rougly the
    schedule picked by the previous pipeliner in order to avoid changes.
    
    This also creates a different sequence of insert/extract slice for the
    alloc. We should probably change shared alloc to use memory semantic.
    ThomasRaoux authored and zhanglx13 committed Nov 9, 2023
    Configuration menu
    Copy the full SHA
    d1de5aa View commit details
    Browse the repository at this point in the history