Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

August meetup agenda #1

Closed
wants to merge 113 commits into from
Closed

August meetup agenda #1

wants to merge 113 commits into from

Conversation

kshama-msft
Copy link
Owner

Agenda items and announcements.

ThomasRaoux and others added 30 commits June 30, 2023 12:40
…riton-lang#1863)

This relax the restriction in the scan lowering to support layout where
we scan along a dimension which isn't the fastest moving one. This is
done by relaxing how we accesses elements during scanning and allow
elements to be strided.
…1873)

We need to make sure to convert the whole slice of operations in order
to make sure the types are always consistent.
As mentioned in triton-lang#1769, we set file name, function name to 'unknown' and
lineno to 0 if frame is None
We need to split the CI into two jobs, nvidia (PR blocking) and
third-party (PR non-blocking). This way we can guarantee that artifacts
are uploaded for any PR that gets merged into `main`, and that `compare
artifacts` job can just wait on the artifacts-uploading job.
…on-lang#1886)

We've already updated the mapping between name and tensor before
visiting each compound statement in the while op. As a result, any
overwritten name gets up-to-date values updated in the while loop. And
any unchanged livein names hold the original tensors.
…-lang#1889)

transforms e.g.:
```
x1 = tl.expand_dims(x0, axis=2)
y1 = tl.expand_dims(y0, axis=0)
z = tl.sum(x1 * y1, axis=1)
```

into
```
z = tl.dot(x0, y0) 
```

uses allowTF32 = True by default.
Add and test cumprod. This also allows testing a case of accumulation
where 0 is not the identity element.
Also add documention for scan functions.
Recent changes made TritonGPU dialect depend on transform utils
(`isExpensiveCat()`), and Triton ops depend on TritonGPU dialect
(`DotOperandEncodingAttr`). This works fine with CMake but circular
dependencies are not ideal and Bazel builds (which we use internally at
Google) try hard to prevent them.

Would it be acceptable to move the `isExpensiveCat()` function back to
TritonGPU dialect (where it was before), and split the TritonGPU
attributes into a separate header? This would avoid diverging our
internal version or creating over-sized bazel targets to avoid circular
dependencies.

Co-authored-by: Keren Zhou <[email protected]>
…ng#1897 (triton-lang#1898)

Change the logic to be able to re-order more cases. Instead of pulling
the convert of operand 0 we sink the convert of operand 1 so that even
if operand 0 has some dependency that don't dominate operand 1 we can
still apply the re-ordering.
…g#1880)

`arith::SelectOp` supports a form where the condition argument is a
scalar and the result is a tensor. This isn't generated from `tl.where`,
but can still show up from canonicalization of `scf.if`.

Currently if this happens, the conversion to gpu IR will fail because
`triton_gpu.select` doesn't support this form. For example,
```python
import triton
import triton.language as tl
import torch

@triton.jit
def _triton_test(
    in_ptr, out_ptr, cond, XBLOCK: tl.constexpr
):
    xindex = tl.arange(0, XBLOCK)
    tmp = tl.load(in_ptr + xindex)
    if cond:
        a = tl.zeros_like(tmp)
    else:
        a = tmp
    tl.store(out_ptr + xindex, a)

t = torch.randn(128, device="cuda")
out = torch.empty(128, device="cuda")
_triton_test[(1,)](t, out, True, t.numel())
```

Fails with the error
```
error: 'triton_gpu.select' op requires the same shape for all operands and results
```

Co-authored-by: Keren Zhou <[email protected]>
Run most of the pytest in parallel, this allows to speed up CI from
36min to 10min for A100 and 22min to 6min for H100. Some tests still
need to run serially like runtime tests.
- Created folder for meetups to store video/meeting notes.
- Created file for agenda and notes.
…lang#1902)

`export TRITON_DISABLE_LINE_INFO=1` to disable the feature.
…n layout (triton-lang#1913)

Fix calculation of unique number of threads within a warp. We need to
consider the number of elements per thread in the calculation. Also
change the layout test to integer sum in order to catch bugs with unique
data as max reduction may hide those kind of problems.
)

this also adds rather extensive testing for mixed precision mode,
including `float8e4b15 x float8e5` and `float8e5 x float16`
Calling `tl.full` with an unsigned dtype currently fails with the error:
```
AttributeError("'triton._C.libtriton.triton.ir.builder' object has no attribute
'get_uint8'")
```

This PR defines those functions rather than changing the calls to the
signed versions so that we can use an unsigned argument type in C++ and
avoid overflow for large uint64 values.
…riton-lang#1811)

This adds a pass that tries to reduce the shape of tensor arguments to
element-wise operations by moving splat and broadcast operations later
in the graph. So, for example say we have:

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset  + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (0))
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK])
    tmp2 = 0.017453292519943295
    tmp3 = tmp1 * tmp2
    tmp4 = tl.sin(tmp3)
    tl.store(out_ptr0 + (x0), tmp4, None)
```

Today this results in duplicate `sin` calls:
```
    %27 = llvm.fmul %26, %3  : f32
    %28 = llvm.call @__nv_sinf(%27) : (f32) -> f32
    %29 = llvm.call @__nv_sinf(%27) : (f32) -> f32
```

The duplicate `llvm.fmul` calls are eliminated via CSE, but `llvm.call`
doesn't get CSE'd because it might be impure.

After this change, the sin is done on a scalar value in the triton IR
and splatted at the very end, so no duplicate calculation happens within
a thread.

---------

Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: Philippe Tillet <[email protected]>
Added point for Windows Support.
ThomasRaoux and others added 28 commits August 7, 2023 15:55
Also fixes a bug exposed in convertLayout lowering for float16. We
shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this
needs to take a 32bits register. Also this prevented optimization at
llvm ir level.
…on-lang#2040)

Make sure that other threads within CTA do not operate on mbarrier until
it is initialized by thread 0.

Co-authored-by: Philippe Tillet <[email protected]>
Use camel case accessors ("getStaticOffsets" etc.) for `ExtractSliceOp`.
This change works with and without the changes from D156857. After
D156857 has landed, only camel case accessors work for ops that
implement the `OffsetSizeAndStrideOpInterface`.

https://reviews.llvm.org/D156857

Co-authored-by: Philippe Tillet <[email protected]>
We are interested in having python wheels for triton built for Linux
arm64 platforms, such as NVIDIA's Grace CPU.

This change is fairly simple, however:
- It requires a linux arm64 build of LLVM to be available (see MR here:
ptillet/triton-llvm-releases#15)
- For now my changes use the LLVM build hosted here:
https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17
- The Triton release process will need to be updated to include arm64
wheels. Is this something you have time to work on @ptillet? It would be
difficult for me to update this part without more access permissions.

With these changes, I managed to build a set of python wheels and have
hosted them here for us to use in the meantime:
https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64
…r than Q's (triton-lang#2033)

Implemented this situation with and without causal mask.
My implementation with causal mask looks like:
111000
111100
111110
Where only the right upper triangle part will be masked.
I added `P_SEQ` for the notation of extra sequence length for KV.

Co-authored-by: Philippe Tillet <[email protected]>
This allows the AOT client to tune the number of stages for the
generated kernel. set the default number to 3 to match the triton
compiler.
…in hopper tests (triton-lang#2041)

Co-authored-by: goostavz <[email protected]>
Co-authored-by: Philippe Tillet <[email protected]>
Co-authored-by: ben-zhang-609 <[email protected]>
Improve error messaging for block shape and value shape mismatch.
Rename "rocm" -> "hip", to comply with other uses in compiler.py.
…m. (triton-lang#2068)

No functional changes intended, and it might slightly speed up the
build.

This allows a downstream Bazel build of Triton to avoid building a
number of dialects and passes that Triton doesn't need.
`getScratchSizeInBytes` was assuming that the size of all types in bits
is
a multiple of 8. If it is not, it would return 0. This caused a bug for
boolean
(i1) type, where the reduction lowering would attempt to use shared
memory,
which was not assigned to the op.

Fix this issue by setting the number of bytes per element to `ceil(bits
/ 8)`.
libtriton.so is pretty large these days and hashing it is slow.
Switching the hash from md5 to sha1 shaves close to 300ms off the time
for me (as well as being a better hash, for whatever that's worth).

As far as I could tell, sha1 is the fastest stable hash in the Python
standard library, including things like zlib.crc32
Realised I could do this right after my first PR got merged. This saves
another 100ms
…ng#2075)

remove unnecessary skips. decompose UTs in
persistent-warp-specialized-gemm into vintage and stylish
Agenda and minutes for Triton monthly meetups.
@kshama-msft kshama-msft deleted the kshama-msft-patch-1 branch August 10, 2023 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.