August meetup agenda #1

kshama-msft · 2023-08-10T23:05:55Z

Agenda items and announcements.

…riton-lang#1863) This relax the restriction in the scan lowering to support layout where we scan along a dimension which isn't the fastest moving one. This is done by relaxing how we accesses elements during scanning and allow elements to be strided.

…1873) We need to make sure to convert the whole slice of operations in order to make sure the types are always consistent.

As mentioned in triton-lang#1769, we set file name, function name to 'unknown' and lineno to 0 if frame is None

We need to split the CI into two jobs, nvidia (PR blocking) and third-party (PR non-blocking). This way we can guarantee that artifacts are uploaded for any PR that gets merged into `main`, and that `compare artifacts` job can just wait on the artifacts-uploading job.

…on-lang#1886) We've already updated the mapping between name and tensor before visiting each compound statement in the while op. As a result, any overwritten name gets up-to-date values updated in the while loop. And any unchanged livein names hold the original tensors.

triton-lang#1885) triton-lang#1883

…-lang#1889) transforms e.g.: ``` x1 = tl.expand_dims(x0, axis=2) y1 = tl.expand_dims(y0, axis=0) z = tl.sum(x1 * y1, axis=1) ``` into ``` z = tl.dot(x0, y0) ``` uses allowTF32 = True by default.

Co-authored-by: Philippe Tillet <[email protected]>

Add and test cumprod. This also allows testing a case of accumulation where 0 is not the identity element. Also add documention for scan functions.

…ng#1892)

Recent changes made TritonGPU dialect depend on transform utils (`isExpensiveCat()`), and Triton ops depend on TritonGPU dialect (`DotOperandEncodingAttr`). This works fine with CMake but circular dependencies are not ideal and Bazel builds (which we use internally at Google) try hard to prevent them. Would it be acceptable to move the `isExpensiveCat()` function back to TritonGPU dialect (where it was before), and split the TritonGPU attributes into a separate header? This would avoid diverging our internal version or creating over-sized bazel targets to avoid circular dependencies. Co-authored-by: Keren Zhou <[email protected]>

…ng#1897 (triton-lang#1898) Change the logic to be able to re-order more cases. Instead of pulling the convert of operand 0 we sink the convert of operand 1 so that even if operand 0 has some dependency that don't dominate operand 1 we can still apply the re-ordering.

…g#1880) `arith::SelectOp` supports a form where the condition argument is a scalar and the result is a tensor. This isn't generated from `tl.where`, but can still show up from canonicalization of `scf.if`. Currently if this happens, the conversion to gpu IR will fail because `triton_gpu.select` doesn't support this form. For example, ```python import triton import triton.language as tl import torch @triton.jit def _triton_test( in_ptr, out_ptr, cond, XBLOCK: tl.constexpr ): xindex = tl.arange(0, XBLOCK) tmp = tl.load(in_ptr + xindex) if cond: a = tl.zeros_like(tmp) else: a = tmp tl.store(out_ptr + xindex, a) t = torch.randn(128, device="cuda") out = torch.empty(128, device="cuda") _triton_test[(1,)](t, out, True, t.numel()) ``` Fails with the error ``` error: 'triton_gpu.select' op requires the same shape for all operands and results ``` Co-authored-by: Keren Zhou <[email protected]>

Run most of the pytest in parallel, this allows to speed up CI from 36min to 10min for A100 and 22min to 6min for H100. Some tests still need to run serially like runtime tests.

…n-lang#1907) Co-authored-by: Philippe Tillet <[email protected]>

This un-reverts commit triton-lang@d4c9411.

- Created folder for meetups to store video/meeting notes. - Created file for agenda and notes.

…lang#1902) `export TRITON_DISABLE_LINE_INFO=1` to disable the feature.

…el (triton-lang#1911)

…riton-lang#1912)

…n layout (triton-lang#1913) Fix calculation of unique number of threads within a warp. We need to consider the number of elements per thread in the calculation. Also change the layout test to integer sum in order to catch bugs with unique data as max reduction may hide those kind of problems.

) this also adds rather extensive testing for mixed precision mode, including `float8e4b15 x float8e5` and `float8e5 x float16`

Calling `tl.full` with an unsigned dtype currently fails with the error: ``` AttributeError("'triton._C.libtriton.triton.ir.builder' object has no attribute 'get_uint8'") ``` This PR defines those functions rather than changing the calls to the signed versions so that we can use an unsigned argument type in C++ and avoid overflow for large uint64 values.

…riton-lang#1811) This adds a pass that tries to reduce the shape of tensor arguments to element-wise operations by moving splat and broadcast operations later in the graph. So, for example say we have: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (0)) tmp1 = tl.broadcast_to(tmp0, [XBLOCK]) tmp2 = 0.017453292519943295 tmp3 = tmp1 * tmp2 tmp4 = tl.sin(tmp3) tl.store(out_ptr0 + (x0), tmp4, None) ``` Today this results in duplicate `sin` calls: ``` %27 = llvm.fmul %26, %3 : f32 %28 = llvm.call @__nv_sinf(%27) : (f32) -> f32 %29 = llvm.call @__nv_sinf(%27) : (f32) -> f32 ``` The duplicate `llvm.fmul` calls are eliminated via CSE, but `llvm.call` doesn't get CSE'd because it might be impure. After this change, the sin is done on a scalar value in the triton IR and splatted at the very end, so no duplicate calculation happens within a thread. --------- Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Philippe Tillet <[email protected]>

Added point for Windows Support.

Also fixes a bug exposed in convertLayout lowering for float16. We shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this needs to take a 32bits register. Also this prevented optimization at llvm ir level.

Issue triton-lang#1973 Co-authored-by: Philippe Tillet <[email protected]>

…2051)

…on-lang#2040) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <[email protected]>

Use camel case accessors ("getStaticOffsets" etc.) for `ExtractSliceOp`. This change works with and without the changes from D156857. After D156857 has landed, only camel case accessors work for ops that implement the `OffsetSizeAndStrideOpInterface`. https://reviews.llvm.org/D156857 Co-authored-by: Philippe Tillet <[email protected]>

@ptillet

We are interested in having python wheels for triton built for Linux arm64 platforms, such as NVIDIA's Grace CPU. This change is fairly simple, however: - It requires a linux arm64 build of LLVM to be available (see MR here: ptillet/triton-llvm-releases#15) - For now my changes use the LLVM build hosted here: https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17 - The Triton release process will need to be updated to include arm64 wheels. Is this something you have time to work on @ptillet? It would be difficult for me to update this part without more access permissions. With these changes, I managed to build a set of python wheels and have hosted them here for us to use in the meantime: https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64

Co-authored-by: Philippe Tillet <[email protected]>

…r than Q's (triton-lang#2033) Implemented this situation with and without causal mask. My implementation with causal mask looks like: 111000 111100 111110 Where only the right upper triangle part will be masked. I added `P_SEQ` for the notation of extra sequence length for KV. Co-authored-by: Philippe Tillet <[email protected]>

This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.

…in hopper tests (triton-lang#2041) Co-authored-by: goostavz <[email protected]> Co-authored-by: Philippe Tillet <[email protected]> Co-authored-by: ben-zhang-609 <[email protected]>

Co-authored-by: Allen Zhao <[email protected]>

Improve error messaging for block shape and value shape mismatch.

…#2050) Co-authored-by: Philippe Tillet <[email protected]>

Rename "rocm" -> "hip", to comply with other uses in compiler.py.

…2063)

…riton-lang#2057) Co-authored-by: Biao Wang <[email protected]>

…lang#2067)

…m. (triton-lang#2068) No functional changes intended, and it might slightly speed up the build. This allows a downstream Bazel build of Triton to avoid building a number of dialects and passes that Triton doesn't need.

`getScratchSizeInBytes` was assuming that the size of all types in bits is a multiple of 8. If it is not, it would return 0. This caused a bug for boolean (i1) type, where the reduction lowering would attempt to use shared memory, which was not assigned to the op. Fix this issue by setting the number of bytes per element to `ceil(bits / 8)`.

libtriton.so is pretty large these days and hashing it is slow. Switching the hash from md5 to sha1 shaves close to 300ms off the time for me (as well as being a better hash, for whatever that's worth). As far as I could tell, sha1 is the fastest stable hash in the Python standard library, including things like zlib.crc32

Realised I could do this right after my first PR got merged. This saves another 100ms

…ng#2075) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish

… codegen (triton-lang#2047)

Agenda and minutes for Triton monthly meetups.

ThomasRaoux and others added 30 commits June 30, 2023 12:40

[OPTIMIZER] Fix mismatching type when optimizing kWidth (triton-lang#…

d8376aa

…1873) We need to make sure to convert the whole slice of operations in order to make sure the types are always consistent.

Fix tl.device_assert compilation error (triton-lang#1875)

55eb32d

As mentioned in triton-lang#1769, we set file name, function name to 'unknown' and lineno to 0 if frame is None

[BACKEND] Recover the correct axis in the parent layout for reduce ops (

16de0d6

triton-lang#1885) triton-lang#1883

[DOCS] update print docs (triton-lang#1884)

cc93356

[FRONTEND] improve error message. (triton-lang#1887)

387cdc8

[FRONTEND] pattern-match ExpandDims + Mul + Reduce into DotOp (triton…

d57dcd9

…-lang#1889) transforms e.g.: ``` x1 = tl.expand_dims(x0, axis=2) y1 = tl.expand_dims(y0, axis=0) z = tl.sum(x1 * y1, axis=1) ``` into ``` z = tl.dot(x0, y0) ``` uses allowTF32 = True by default.

[CI] Use PAT for PR comment in CI (triton-lang#1839)

2c51dc0

Co-authored-by: Philippe Tillet <[email protected]>

[FRONTEND] Add cumprod scan op (triton-lang#1894)

ae0ee52

Add and test cumprod. This also allows testing a case of accumulation where 0 is not the identity element. Also add documention for scan functions.

[BACKEND] Determine fast reduce based on the parent layout (triton-la…

4255ef0

…ng#1892)

[TESTS] Enable parallel pytest in CI for CUDA (triton-lang#1905)

787cdff

Run most of the pytest in parallel, this allows to speed up CI from 36min to 10min for A100 and 22min to 6min for H100. Some tests still need to run serially like runtime tests.

[CI] Fix failure due to not finding workflow run on first page (trito…

2a2cbc3

…n-lang#1907) Co-authored-by: Philippe Tillet <[email protected]>

[FRONTEND][BACKEND] improved fp8 specs (triton-lang#1906)

6d1285e

This un-reverts commit triton-lang@d4c9411.

[DOCS] create monthly-meetups.md (triton-lang#1874)

9638f5c

- Created folder for meetups to store video/meeting notes. - Created file for agenda and notes.

[TESTS] smaller problem sizes in matmul tests (triton-lang#1908)

61e17db

[DOCS] moved meetups as a docs subdirectory

cb0321f

[FRONTEND][BACKEND] Materialize line info for triton kernels (triton-…

cc5a7ed

…lang#1902) `export TRITON_DISABLE_LINE_INFO=1` to disable the feature.

[FRONTEND] fix memory leak caused by retaining args to autotuned kern…

38d767e

…el (triton-lang#1911)

[BACKEND] make sure we always bind to primary context in loadBinary (t…

778ed64

…riton-lang#1912)

[TUTORIAL] Faster flash attention; added non-causal (triton-lang#1917)

dadf7a9

[OPS][TESTS] Added float8 support in triton.ops.matmul (triton-lang#1918

5a722b5

) this also adds rather extensive testing for mixed precision mode, including `float8e4b15 x float8e5` and `float8e5 x float16`

[DOCS] update 07-18-2023.md (triton-lang#1922)

f4e08f8

Added point for Windows Support.

ThomasRaoux and others added 28 commits August 7, 2023 15:55

[FRONTEND] Support jit functions without arguments (triton-lang#2043)

30a331e

Issue triton-lang#1973 Co-authored-by: Philippe Tillet <[email protected]>

[CI] H100 tests always use ENABLE_TMA=1 ENABLE_MMA_V3=1 (triton-lang#…

3ec05fb

…2051)

[FRONTEND] improve error message for type mismatch (triton-lang#2038)

6a1ac65

[BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (trit…

341f5b6

…on-lang#2040) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <[email protected]>

[TESTS] remove get_proper_err, get_variant_golden (triton-lang#2039)

31e79aa

Co-authored-by: Philippe Tillet <[email protected]>

add num_stages parameter to aot compile.py (triton-lang#2000)

a76ecd7

This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.

[Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts …

b525880

…in hopper tests (triton-lang#2041) Co-authored-by: goostavz <[email protected]> Co-authored-by: Philippe Tillet <[email protected]> Co-authored-by: ben-zhang-609 <[email protected]>

[hopper][ws] use per-agent thread idx by default (triton-lang#2054)

11cf334

Co-authored-by: Allen Zhao <[email protected]>

[FRONTEND] remove ptxas from git (triton-lang#2055)

658747f

[FRONTEND] improve error message for shape mismatch (triton-lang#2031)

bb47f89

Improve error messaging for block shape and value shape mismatch.

[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (triton-lang…

2a95d9b

…#2050) Co-authored-by: Philippe Tillet <[email protected]>

[HOPPER][WS] fix TMA store hang in ws mode (triton-lang#2056)

6dee55c

[ROCM] fix device_type name (triton-lang#2061)

1c45836

Rename "rocm" -> "hip", to comply with other uses in compiler.py.

[HOPPER][WS] fix missing WS attrs when lowering to llvm (triton-lang#…

6d98a08

…2063)

[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (t…

de47bba

…riton-lang#2057) Co-authored-by: Biao Wang <[email protected]>

[HOPPER][WS] remove numCTAs = 1 check in guard pass (triton-lang#2066)

8a610f7

[HOPPER][WS] support tt.reduce as dependent op in guard pass (triton-…

a58e6ef

…lang#2067)

[FRONTEND] further improve version_key speed (triton-lang#2073)

776b378

Realised I could do this right after my first PR got merged. This saves another 100ms

[TESTS] refactor test-persistent-warp-specialized-gemm UTs (triton-la…

d1ce4c4

…ng#2075) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish

[BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM…

4d373aa

… codegen (triton-lang#2047)

Create 08-22-2023.md

3deda67

Agenda and minutes for Triton monthly meetups.

kshama-msft closed this Aug 10, 2023

kshama-msft deleted the kshama-msft-patch-1 branch August 10, 2023 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

August meetup agenda #1

August meetup agenda #1

kshama-msft commented Aug 10, 2023

August meetup agenda #1

August meetup agenda #1

Conversation

kshama-msft commented Aug 10, 2023