forked from triton-lang/triton
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
August meetup agenda #1
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…riton-lang#1863) This relax the restriction in the scan lowering to support layout where we scan along a dimension which isn't the fastest moving one. This is done by relaxing how we accesses elements during scanning and allow elements to be strided.
…1873) We need to make sure to convert the whole slice of operations in order to make sure the types are always consistent.
As mentioned in triton-lang#1769, we set file name, function name to 'unknown' and lineno to 0 if frame is None
We need to split the CI into two jobs, nvidia (PR blocking) and third-party (PR non-blocking). This way we can guarantee that artifacts are uploaded for any PR that gets merged into `main`, and that `compare artifacts` job can just wait on the artifacts-uploading job.
…on-lang#1886) We've already updated the mapping between name and tensor before visiting each compound statement in the while op. As a result, any overwritten name gets up-to-date values updated in the while loop. And any unchanged livein names hold the original tensors.
…-lang#1889) transforms e.g.: ``` x1 = tl.expand_dims(x0, axis=2) y1 = tl.expand_dims(y0, axis=0) z = tl.sum(x1 * y1, axis=1) ``` into ``` z = tl.dot(x0, y0) ``` uses allowTF32 = True by default.
Co-authored-by: Philippe Tillet <[email protected]>
Add and test cumprod. This also allows testing a case of accumulation where 0 is not the identity element. Also add documention for scan functions.
Recent changes made TritonGPU dialect depend on transform utils (`isExpensiveCat()`), and Triton ops depend on TritonGPU dialect (`DotOperandEncodingAttr`). This works fine with CMake but circular dependencies are not ideal and Bazel builds (which we use internally at Google) try hard to prevent them. Would it be acceptable to move the `isExpensiveCat()` function back to TritonGPU dialect (where it was before), and split the TritonGPU attributes into a separate header? This would avoid diverging our internal version or creating over-sized bazel targets to avoid circular dependencies. Co-authored-by: Keren Zhou <[email protected]>
…ng#1897 (triton-lang#1898) Change the logic to be able to re-order more cases. Instead of pulling the convert of operand 0 we sink the convert of operand 1 so that even if operand 0 has some dependency that don't dominate operand 1 we can still apply the re-ordering.
…g#1880) `arith::SelectOp` supports a form where the condition argument is a scalar and the result is a tensor. This isn't generated from `tl.where`, but can still show up from canonicalization of `scf.if`. Currently if this happens, the conversion to gpu IR will fail because `triton_gpu.select` doesn't support this form. For example, ```python import triton import triton.language as tl import torch @triton.jit def _triton_test( in_ptr, out_ptr, cond, XBLOCK: tl.constexpr ): xindex = tl.arange(0, XBLOCK) tmp = tl.load(in_ptr + xindex) if cond: a = tl.zeros_like(tmp) else: a = tmp tl.store(out_ptr + xindex, a) t = torch.randn(128, device="cuda") out = torch.empty(128, device="cuda") _triton_test[(1,)](t, out, True, t.numel()) ``` Fails with the error ``` error: 'triton_gpu.select' op requires the same shape for all operands and results ``` Co-authored-by: Keren Zhou <[email protected]>
Run most of the pytest in parallel, this allows to speed up CI from 36min to 10min for A100 and 22min to 6min for H100. Some tests still need to run serially like runtime tests.
…n-lang#1907) Co-authored-by: Philippe Tillet <[email protected]>
This un-reverts commit triton-lang@d4c9411.
- Created folder for meetups to store video/meeting notes. - Created file for agenda and notes.
…lang#1902) `export TRITON_DISABLE_LINE_INFO=1` to disable the feature.
…n layout (triton-lang#1913) Fix calculation of unique number of threads within a warp. We need to consider the number of elements per thread in the calculation. Also change the layout test to integer sum in order to catch bugs with unique data as max reduction may hide those kind of problems.
Calling `tl.full` with an unsigned dtype currently fails with the error: ``` AttributeError("'triton._C.libtriton.triton.ir.builder' object has no attribute 'get_uint8'") ``` This PR defines those functions rather than changing the calls to the signed versions so that we can use an unsigned argument type in C++ and avoid overflow for large uint64 values.
…riton-lang#1811) This adds a pass that tries to reduce the shape of tensor arguments to element-wise operations by moving splat and broadcast operations later in the graph. So, for example say we have: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (0)) tmp1 = tl.broadcast_to(tmp0, [XBLOCK]) tmp2 = 0.017453292519943295 tmp3 = tmp1 * tmp2 tmp4 = tl.sin(tmp3) tl.store(out_ptr0 + (x0), tmp4, None) ``` Today this results in duplicate `sin` calls: ``` %27 = llvm.fmul %26, %3 : f32 %28 = llvm.call @__nv_sinf(%27) : (f32) -> f32 %29 = llvm.call @__nv_sinf(%27) : (f32) -> f32 ``` The duplicate `llvm.fmul` calls are eliminated via CSE, but `llvm.call` doesn't get CSE'd because it might be impure. After this change, the sin is done on a scalar value in the triton IR and splatted at the very end, so no duplicate calculation happens within a thread. --------- Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Philippe Tillet <[email protected]>
Added point for Windows Support.
Also fixes a bug exposed in convertLayout lowering for float16. We shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this needs to take a 32bits register. Also this prevented optimization at llvm ir level.
Issue triton-lang#1973 Co-authored-by: Philippe Tillet <[email protected]>
…on-lang#2040) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <[email protected]>
Use camel case accessors ("getStaticOffsets" etc.) for `ExtractSliceOp`. This change works with and without the changes from D156857. After D156857 has landed, only camel case accessors work for ops that implement the `OffsetSizeAndStrideOpInterface`. https://reviews.llvm.org/D156857 Co-authored-by: Philippe Tillet <[email protected]>
We are interested in having python wheels for triton built for Linux arm64 platforms, such as NVIDIA's Grace CPU. This change is fairly simple, however: - It requires a linux arm64 build of LLVM to be available (see MR here: ptillet/triton-llvm-releases#15) - For now my changes use the LLVM build hosted here: https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17 - The Triton release process will need to be updated to include arm64 wheels. Is this something you have time to work on @ptillet? It would be difficult for me to update this part without more access permissions. With these changes, I managed to build a set of python wheels and have hosted them here for us to use in the meantime: https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64
Co-authored-by: Philippe Tillet <[email protected]>
…r than Q's (triton-lang#2033) Implemented this situation with and without causal mask. My implementation with causal mask looks like: 111000 111100 111110 Where only the right upper triangle part will be masked. I added `P_SEQ` for the notation of extra sequence length for KV. Co-authored-by: Philippe Tillet <[email protected]>
This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.
…in hopper tests (triton-lang#2041) Co-authored-by: goostavz <[email protected]> Co-authored-by: Philippe Tillet <[email protected]> Co-authored-by: ben-zhang-609 <[email protected]>
Co-authored-by: Allen Zhao <[email protected]>
Improve error messaging for block shape and value shape mismatch.
…#2050) Co-authored-by: Philippe Tillet <[email protected]>
Rename "rocm" -> "hip", to comply with other uses in compiler.py.
…riton-lang#2057) Co-authored-by: Biao Wang <[email protected]>
…m. (triton-lang#2068) No functional changes intended, and it might slightly speed up the build. This allows a downstream Bazel build of Triton to avoid building a number of dialects and passes that Triton doesn't need.
`getScratchSizeInBytes` was assuming that the size of all types in bits is a multiple of 8. If it is not, it would return 0. This caused a bug for boolean (i1) type, where the reduction lowering would attempt to use shared memory, which was not assigned to the op. Fix this issue by setting the number of bytes per element to `ceil(bits / 8)`.
libtriton.so is pretty large these days and hashing it is slow. Switching the hash from md5 to sha1 shaves close to 300ms off the time for me (as well as being a better hash, for whatever that's worth). As far as I could tell, sha1 is the fastest stable hash in the Python standard library, including things like zlib.crc32
Realised I could do this right after my first PR got merged. This saves another 100ms
…ng#2075) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish
Agenda and minutes for Triton monthly meetups.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Agenda items and announcements.