Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Dev Container for consistent dev setup #175

Open
wants to merge 139 commits into
base: main
Choose a base branch
from

Conversation

maryamtahhan
Copy link

@maryamtahhan maryamtahhan commented Nov 5, 2024

Added a Dev Container configuration to streamline development and onboarding. This setup ensures a consistent, isolated environment with all necessary tools and dependencies for building and running Triton-CPU. The configuration supports use in both VS Code locally and GitHub Codespaces.

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because It's not a code change per se.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

bertmaher and others added 30 commits October 23, 2024 14:53
* [BACKEND][CPU] Implement the empty cpu backend

* Run clang-format

* Fix yadf error

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
A quick addition on how to use it.
Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`.


Test Plan: It can safely compile an empty kernel.

```
@triton.jit
def add_kernel(x_ptr,  y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    return
```

> TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py

```
//===-------------------------------------------===//
Legalizing operation : 'tt.func'(0x73be2a0) {
  * Fold {
  } -> FAILURE : unable to fold

  * Pattern : 'tt.func -> ()' {
Trying to match "(anonymous namespace)::FuncOpConversion"
    ** Insert  : 'llvm.func'(0x6c04c70)
    ** Insert Block into : 'llvm.func'(0x6c04c70)
    ** Insert Block into : 'llvm.func'(0x6c04c70)
    ** Erase   : 'tt.func'(0x73be2a0)
"(anonymous namespace)::FuncOpConversion" result 1

    //===-------------------------------------------===//
    Legalizing operation : 'llvm.func'(0x6c04c70) {
    } -> SUCCESS : operation marked legal by the target
    //===-------------------------------------------===//

...

//===-------------------------------------------===//
Legalizing operation : 'tt.return'(0x73efeb0) {
  "tt.return"() : () -> ()

  * Fold {
  } -> FAILURE : unable to fold

  * Pattern : 'tt.return -> ()' {
Trying to match "(anonymous namespace)::ReturnOpConversion"
    ** Insert  : 'llvm.return'(0x73c0f00)
    ** Replace : 'tt.return'(0x73efeb0)
"(anonymous namespace)::ReturnOpConversion" result 1

    //===-------------------------------------------===//
    Legalizing operation : 'llvm.return'(0x73c0f00) {
      "llvm.return"() : () -> ()

    } -> SUCCESS : operation marked legal by the target
    //===-------------------------------------------===//
  } -> SUCCESS : pattern applied successfully
```
…riton-lang#1)

Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments.

I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing.

Test Plan: Tested with a simple example:

```
@triton.jit
def add_kernel(...):
    pid = tl.program_id(axis=0)  # We use a 1D launch grid so axis is 0.
    foo = pid + 42
    tl.device_print("Hello, World!", foo, pid)
```

The resulting .llir is valid:
```
@printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00"

declare !dbg !3 i32 @printf(ptr, ...)

define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 {
  %8 = add i32 %4, 42, !dbg !8
  %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4)
  ret void, !dbg !9
}
```

Tried to compile with a fake main function:
```
> % cat main.c
extern void add_kernel(float*, float*, float*, int, int, int, int);

int main() {
    add_kernel(0, 0, 0, 4, 5, 6, 7);
}

> % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c
> % ./a.out
pid (5, 6, 7) Hello, World!: 47, 5
```
…ion flows (triton-lang#6)

* Support basic lowering through vector dialect in CPU backend.

Signed-off-by: Ilya Enkovich <[email protected]>

* Use axis info in memory op lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Mark test_ptx_cast as enabled for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support umulhi operation.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.clamp, tl.minimum, tl.maximum.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add enable_fp_fusion opt for CPU (only affects ASM dump now).

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix kernel args passing for propagated constants.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add permutations support.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 2-D transfer_read/transfer_write lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Introduce shape info analysis and use it for loads/stores by block pointers.

Delay scalar pointers lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 'other' arg for loads.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.join.

Signed-off-by: Ilya Enkovich <[email protected]>

* Minor renaming.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
…ent (triton-lang#8)

* [BACKEND][CPU] Make it buildable and runnable in a different environment

* Revert seemingly inconsistent python code formatting
Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Minjang Kim <[email protected]>
…iton-lang#11)

* [CPU] Support flexible active driver + update vector-add tutorial

* Update vector-add to run CPU always + optional GPU

* Update do_bench for CPU
…ng#17)

* Fixed yaml syntax

Signed-off-by: Gregory Shimansky <[email protected]>

* Removed cpu label from run-on

Signed-off-by: Gregory Shimansky <[email protected]>

* Added missing zlib-dev

Signed-off-by: Gregory Shimansky <[email protected]>

* Added missing apt-get update

Signed-off-by: Gregory Shimansky <[email protected]>

* Remove pip cache because on self-hosted runner it slows things down

Signed-off-by: Gregory Shimansky <[email protected]>

* Corrected path to tests

Signed-off-by: Gregory Shimansky <[email protected]>

* Added installation of torch==2.1.2

Signed-off-by: Gregory Shimansky <[email protected]>

---------

Signed-off-by: Gregory Shimansky <[email protected]>
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR

* Don't touch the main compiler.py
…-lang#23)

* add un-masked tiled matrix-multiplication for triton-cpu

* clean and add comment

* move test under tutorials
* Fix RelWithDebInfo build.

Signed-off-by: Ilya Enkovich <[email protected]>

* Skip fp8 cast tests on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix segfault.

Signed-off-by: Ilya Enkovich <[email protected]>

* [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059)

* Add -s option to pytest run.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add a workaround for LLVM bug causing test failure on Skylake CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add a workaround for LLVM fpext bug causing test failure on Skylake CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix formatting.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Pablo Zimmermann <[email protected]>
Junyi Mei and others added 16 commits October 23, 2024 16:15
* Lower memory ops with vector gather and scatter

This commit add lowerToGather and lowerToScatter for load and store
conversion. Memory ops with the pointer computed from splat and addptr
can be lowered with vector.gather or vector.scatter.

For architectures with scatter and gather support (like SVE and RVV),
the code generated with this approach might be more efficient.

Two options are added to scalarization and memory op conversion to
enable lowering with gather and scatter operations.

Signed-off-by: Junyi Mei <[email protected]>

* Fix incorrect rank and type in gather and scatter

Signed-off-by: Junyi Mei <[email protected]>

* Lower store op with 1-D vector scatter ops

Signed-off-by: Junyi Mei <[email protected]>

---------

Signed-off-by: Junyi Mei <[email protected]>
* Add DotOp lowering to AMX operations.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support direct tiles store to output memory.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add lit tests for amx.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix review comments.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
…ang#161)

This PR adds support for libdevice functions that don't map cleanly to a
MathOp. We implement them using tt.extern_elementwise instead,
indicating which Sleef function to use.

While tt.extern_elementwise contains fields for the library path and
name, the CUDA backend ignores those fields as it always uses the
NVIDIA's libdevice library. We take a similar approach here and assume
all extern calls go to the Sleef library.

One difference though is that we need to select our Sleef function based
on the number of elements of the vector, which is done by interpolating
this number into the symbol name. To indicate where this interpolation
should occur, I have made `%(numel)` into a special string value. This
allows us to reuse tt.extern_elementwise without adding any extra
attributes.
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
…iton-lang#162)

libsleef does not implement 2-element functions, so for libdevice
functions implemented via extern_elementwise ops that rely wholly on
libsleef implementations (as opposed to MathOps which can be lowered to
native instructions), we need to pad those vectors to size 4.

This allows us to enable test_math.py for all the functions introduced
in triton-lang#161.
…ng#169)

This commit skips test for non-cuda devices that exects to
use_cuda_graph.

Signed-off-by: Dmitrii Makarenko <[email protected]>
…riton-lang#160)

This commit adds Memref type to possible inputs of print.
Memref have strides and other supporting information to
allow print multidimensional tensors. (2d, 3d etc)
Such print will be added in the next pr.
* Enable num_threads in autotuner and use hooks for tuning on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add vector-add example for CPU with autotuner.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
@maryamtahhan maryamtahhan marked this pull request as ready for review November 5, 2024 16:37
@maryamtahhan maryamtahhan requested a review from ptillet as a code owner November 5, 2024 16:37
@maryamtahhan maryamtahhan force-pushed the devcontainer branch 3 times, most recently from 1979e0e to 82213ea Compare November 5, 2024 16:48
@minjang
Copy link
Collaborator

minjang commented Nov 8, 2024

@maryamtahhan Thanks for putting a PR! I have a quick question. Are you proposing this PR only for triton-cpu, not triton? Can this PR be for triton as well? We'd want to minimize the divergence from the triton upstream. If a change can be also made for triton, we'd recommend it.

@maryamtahhan
Copy link
Author

@maryamtahhan Thanks for putting a PR! I have a quick question. Are you proposing this PR only for triton-cpu, not triton? Can this PR be for triton as well? We'd want to minimize the divergence from the triton upstream. If a change can be also made for triton, we'd recommend it.

No problem. I'm going to do a PR for Triton also. It will be a separate folder under the .devcontainer directory with a gpu option enabled. Just need to test it next week.

In that case there would be 2 separate directories that would cover the gpu enabled and non gpu(cpu) enabled containers.

I can move the readme to the Triton PR so that there's an easier merge for you when the time comes.

@minjang
Copy link
Collaborator

minjang commented Nov 11, 2024

@maryamtahhan Thanks for putting a PR! I have a quick question. Are you proposing this PR only for triton-cpu, not triton? Can this PR be for triton as well? We'd want to minimize the divergence from the triton upstream. If a change can be also made for triton, we'd recommend it.

No problem. I'm going to do a PR for Triton also. It will be a separate folder under the .devcontainer directory with a gpu option enabled. Just need to test it next week.

In that case there would be 2 separate directories that would cover the gpu enabled and non gpu(cpu) enabled containers.

I can move the readme to the Triton PR so that there's an easier merge for you when the time comes.

I see. As this triton-cpu is a fork, we're pretty flexible to take your changes. But, I'd recommend you would land this PR to the upstream first.

README.md Outdated Show resolved Hide resolved
@maryamtahhan
Copy link
Author

Going to update this PR based on triton-lang#5143

Added a Dev Container configuration to streamline
development and onboarding. This setup ensures a
consistent, isolated environment with all necessary
tools and dependencies for building and running
Triton-CPU. The configuration supports use in both
VS Code locally and GitHub Codespaces.

Signed-off-by: Maryam Tahhan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.