feat: Dev Container for consistent dev setup #175

maryamtahhan · 2024-11-05T16:36:16Z

Added a Dev Container configuration to streamline development and onboarding. This setup ensures a consistent, isolated environment with all necessary tools and dependencies for building and running Triton-CPU. The configuration supports use in both VS Code locally and GitHub Codespaces.

Updated the PR so it will easily merge with feat: Dev Container for consistent dev setup triton#5143
Added pre-commit to the dev container so that it's automatically installed.

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because It's not a code change per se.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

* [BACKEND][CPU] Implement the empty cpu backend * Run clang-format * Fix yadf error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

A quick addition on how to use it.

Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`. Test Plan: It can safely compile an empty kernel. ``` @triton.jit def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): return ``` > TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py ``` //===-------------------------------------------===// Legalizing operation : 'tt.func'(0x73be2a0) { * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.func -> ()' { Trying to match "(anonymous namespace)::FuncOpConversion" ** Insert : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Erase : 'tt.func'(0x73be2a0) "(anonymous namespace)::FuncOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.func'(0x6c04c70) { } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// ... //===-------------------------------------------===// Legalizing operation : 'tt.return'(0x73efeb0) { "tt.return"() : () -> () * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.return -> ()' { Trying to match "(anonymous namespace)::ReturnOpConversion" ** Insert : 'llvm.return'(0x73c0f00) ** Replace : 'tt.return'(0x73efeb0) "(anonymous namespace)::ReturnOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.return'(0x73c0f00) { "llvm.return"() : () -> () } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// } -> SUCCESS : pattern applied successfully ```

@printf

…riton-lang#1) Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments. I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing. Test Plan: Tested with a simple example: ``` @triton.jit def add_kernel(...): pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0. foo = pid + 42 tl.device_print("Hello, World!", foo, pid) ``` The resulting .llir is valid: ``` @printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00" declare !dbg !3 i32 @printf(ptr, ...) define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 { %8 = add i32 %4, 42, !dbg !8 %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4) ret void, !dbg !9 } ``` Tried to compile with a fake main function: ``` > % cat main.c extern void add_kernel(float*, float*, float*, int, int, int, int); int main() { add_kernel(0, 0, 0, 4, 5, 6, 7); } > % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c > % ./a.out pid (5, 6, 7) Hello, World!: 47, 5 ```

Signed-off-by: Ilya Enkovich <[email protected]>

Co-authored-by: Shane Nay <[email protected]>

…n-lang#4) Signed-off-by: Ilya Enkovich <[email protected]>

…ion flows (triton-lang#6) * Support basic lowering through vector dialect in CPU backend. Signed-off-by: Ilya Enkovich <[email protected]> * Use axis info in memory op lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Mark test_ptx_cast as enabled for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Support umulhi operation. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.clamp, tl.minimum, tl.maximum. Signed-off-by: Ilya Enkovich <[email protected]> * Add enable_fp_fusion opt for CPU (only affects ASM dump now). Signed-off-by: Ilya Enkovich <[email protected]> * Fix kernel args passing for propagated constants. Signed-off-by: Ilya Enkovich <[email protected]> * Add permutations support. Signed-off-by: Ilya Enkovich <[email protected]> * Support 2-D transfer_read/transfer_write lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Introduce shape info analysis and use it for loads/stores by block pointers. Delay scalar pointers lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Support 'other' arg for loads. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.join. Signed-off-by: Ilya Enkovich <[email protected]> * Minor renaming. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]>

…ent (triton-lang#8) * [BACKEND][CPU] Make it buildable and runnable in a different environment * Revert seemingly inconsistent python code formatting

Signed-off-by: Ilya Enkovich <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Minjang Kim <[email protected]>

…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU

…ng#16) Signed-off-by: Gregory Shimansky <[email protected]>

…ng#17) * Fixed yaml syntax Signed-off-by: Gregory Shimansky <[email protected]> * Removed cpu label from run-on Signed-off-by: Gregory Shimansky <[email protected]> * Added missing zlib-dev Signed-off-by: Gregory Shimansky <[email protected]> * Added missing apt-get update Signed-off-by: Gregory Shimansky <[email protected]> * Remove pip cache because on self-hosted runner it slows things down Signed-off-by: Gregory Shimansky <[email protected]> * Corrected path to tests Signed-off-by: Gregory Shimansky <[email protected]> * Added installation of torch==2.1.2 Signed-off-by: Gregory Shimansky <[email protected]> --------- Signed-off-by: Gregory Shimansky <[email protected]>

* [CPU] Add OpenMP launcher * Address the comments * Fix induction variable type * Always use preallocated output buffer for CPU with torch.add

Signed-off-by: Ilya Enkovich <[email protected]>

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

Signed-off-by: Gregory Shimansky <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]>

…-lang#23) * add un-masked tiled matrix-multiplication for triton-cpu * clean and add comment * move test under tutorials

…ffer for CPU. (triton-lang#24)

* Fix RelWithDebInfo build. Signed-off-by: Ilya Enkovich <[email protected]> * Skip fp8 cast tests on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix segfault. Signed-off-by: Ilya Enkovich <[email protected]> * [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059) * Add -s option to pytest run. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM fpext bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix formatting. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Pablo Zimmermann <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]>

* Lower memory ops with vector gather and scatter This commit add lowerToGather and lowerToScatter for load and store conversion. Memory ops with the pointer computed from splat and addptr can be lowered with vector.gather or vector.scatter. For architectures with scatter and gather support (like SVE and RVV), the code generated with this approach might be more efficient. Two options are added to scalarization and memory op conversion to enable lowering with gather and scatter operations. Signed-off-by: Junyi Mei <[email protected]> * Fix incorrect rank and type in gather and scatter Signed-off-by: Junyi Mei <[email protected]> * Lower store op with 1-D vector scatter ops Signed-off-by: Junyi Mei <[email protected]> --------- Signed-off-by: Junyi Mei <[email protected]>

* Add DotOp lowering to AMX operations. Signed-off-by: Ilya Enkovich <[email protected]> * Support direct tiles store to output memory. Signed-off-by: Ilya Enkovich <[email protected]> * Add lit tests for amx. Signed-off-by: Ilya Enkovich <[email protected]> * Fix review comments. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

…ang#161) This PR adds support for libdevice functions that don't map cleanly to a MathOp. We implement them using tt.extern_elementwise instead, indicating which Sleef function to use. While tt.extern_elementwise contains fields for the library path and name, the CUDA backend ignores those fields as it always uses the NVIDIA's libdevice library. We take a similar approach here and assume all extern calls go to the Sleef library. One difference though is that we need to select our Sleef function based on the number of elements of the vector, which is done by interpolating this number into the symbol name. To indicate where this interpolation should occur, I have made `%(numel)` into a special string value. This allows us to reuse tt.extern_elementwise without adding any extra attributes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…iton-lang#162) libsleef does not implement 2-element functions, so for libdevice functions implemented via extern_elementwise ops that rely wholly on libsleef implementations (as opposed to MathOps which can be lowered to native instructions), we need to pad those vectors to size 4. This allows us to enable test_math.py for all the functions introduced in triton-lang#161.

…ng#169) This commit skips test for non-cuda devices that exects to use_cuda_graph. Signed-off-by: Dmitrii Makarenko <[email protected]>

…riton-lang#170) Signed-off-by: Ilya Enkovich <[email protected]>

…riton-lang#160) This commit adds Memref type to possible inputs of print. Memref have strides and other supporting information to allow print multidimensional tensors. (2d, 3d etc) Such print will be added in the next pr.

* Enable num_threads in autotuner and use hooks for tuning on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add vector-add example for CPU with autotuner. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

) Signed-off-by: Ilya Enkovich <[email protected]>

minjang · 2024-11-08T18:51:49Z

@maryamtahhan Thanks for putting a PR! I have a quick question. Are you proposing this PR only for triton-cpu, not triton? Can this PR be for triton as well? We'd want to minimize the divergence from the triton upstream. If a change can be also made for triton, we'd recommend it.

maryamtahhan · 2024-11-08T22:04:07Z

@maryamtahhan Thanks for putting a PR! I have a quick question. Are you proposing this PR only for triton-cpu, not triton? Can this PR be for triton as well? We'd want to minimize the divergence from the triton upstream. If a change can be also made for triton, we'd recommend it.

No problem. I'm going to do a PR for Triton also. It will be a separate folder under the .devcontainer directory with a gpu option enabled. Just need to test it next week.

In that case there would be 2 separate directories that would cover the gpu enabled and non gpu(cpu) enabled containers.

I can move the readme to the Triton PR so that there's an easier merge for you when the time comes.

minjang · 2024-11-11T17:49:46Z

@maryamtahhan Thanks for putting a PR! I have a quick question. Are you proposing this PR only for triton-cpu, not triton? Can this PR be for triton as well? We'd want to minimize the divergence from the triton upstream. If a change can be also made for triton, we'd recommend it.

No problem. I'm going to do a PR for Triton also. It will be a separate folder under the .devcontainer directory with a gpu option enabled. Just need to test it next week.

In that case there would be 2 separate directories that would cover the gpu enabled and non gpu(cpu) enabled containers.

I can move the readme to the Triton PR so that there's an easier merge for you when the time comes.

I see. As this triton-cpu is a fork, we're pretty flexible to take your changes. But, I'd recommend you would land this PR to the upstream first.

README.md

maryamtahhan · 2024-11-14T10:06:32Z

Going to update this PR based on triton-lang#5143

Added a Dev Container configuration to streamline development and onboarding. This setup ensures a consistent, isolated environment with all necessary tools and dependencies for building and running Triton-CPU. The configuration supports use in both VS Code locally and GitHub Codespaces. Signed-off-by: Maryam Tahhan <[email protected]>

bertmaher and others added 30 commits October 23, 2024 14:53

Short preamble for the README, explaining why this clone exists

fefca9b

OSS Automated Fix: Addition of Code of Conduct (triton-lang#1)

5c5eb8a

[BACKEND][CPU] Initial plumbing for cpu backend (triton-lang#2)

2639097

* [BACKEND][CPU] Implement the empty cpu backend * Run clang-format * Fix yadf error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

[BACKEND][CPU] Create TritonCPU and conversion dialects (triton-lang#3)

d842fa9

Update README.md

042ac65

A quick addition on how to use it.

Quick patches to make it work after rebasing (triton-lang#3)

d342955

Support basic lowering through vector dialect in CPU backend.

5670f69

Signed-off-by: Ilya Enkovich <[email protected]>

Revert unreviewed changes. (triton-lang#5)

3f6e029

Co-authored-by: Shane Nay <[email protected]>

Add a workaround for LLVM bug in codegen for bf16 vector cast. (trito…

0f4f2bb

…n-lang#4) Signed-off-by: Ilya Enkovich <[email protected]>

Add support for tl.cat operation. (triton-lang#9)

82ef809

Signed-off-by: Ilya Enkovich <[email protected]>

[BACKEND][CPU] Make it buildable and runnable in a different environm…

cb4875e

…ent (triton-lang#8) * [BACKEND][CPU] Make it buildable and runnable in a different environment * Revert seemingly inconsistent python code formatting

Add support for simple reductions. (triton-lang#10)

2bd99ab

Signed-off-by: Ilya Enkovich <[email protected]>

Support tl.histogram for CPU. (triton-lang#12)

435e750

Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Minjang Kim <[email protected]>

Fix merge and compile errors (triton-lang#13)

f73f936

[CPU] Support flexible active driver + update vector-add tutorial (tr…

96933b7

…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU

Added a simple workflow to run on self-hosted intel runner (triton-la…

320e314

…ng#16) Signed-off-by: Gregory Shimansky <[email protected]>

[CPU] Add an OpenMP-based CPU launcher (triton-lang#15)

c3e1c1e

* [CPU] Add OpenMP launcher * Address the comments * Fix induction variable type * Always use preallocated output buffer for CPU with torch.add

Support generic reduction and scan cases. (triton-lang#14)

6b41ad2

Signed-off-by: Ilya Enkovich <[email protected]>

[CPU] Dump human-readable asm code in TRITON_CACHE_DIR (triton-lang#19)

e264ba8

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

Added g++ installation after switching to ubuntu-22.04 (triton-lang#21)

ea6359d

Signed-off-by: Gregory Shimansky <[email protected]>

Support atomic ops for CPU. (triton-lang#20)

3bd12e5

Signed-off-by: Ilya Enkovich <[email protected]>

[TUTORIAL] Add unmasked matrix multiply example to triton-cpu (triton…

81e2626

…-lang#23) * add un-masked tiled matrix-multiplication for triton-cpu * clean and add comment * move test under tutorials

Update matrix-multiplication-cpu tutorial, use preallocated output bu…

86fdc41

…ffer for CPU. (triton-lang#24)

Use static compilation for kernels. (triton-lang#29)

4d90e9f

Signed-off-by: Ilya Enkovich <[email protected]>

Move byte manipulation ops from elwise ops conversion. (triton-lang#28)

32af075

Signed-off-by: Ilya Enkovich <[email protected]>

Junyi Mei and others added 16 commits October 23, 2024 16:15

Fix compilation when ARCH_REQ_XCOMP_PERM isn't defined (triton-lang#163)

d05fcd4

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

[CPU] Drop MLIR prefix in ScalarizeInterface (triton-lang#164)

9e536ca

Rebase onto upstream triton ff306da and fix regressions

543b36b

Simple fixes to build on MacOSx (triton-lang#165)

8745eef

Fix trailing null char in ulpSuffix (triton-lang#166)

fc3d76b

Rebase onto upstream triton 4a54311 and fix regressions

0f6dcd7

[Test][Autotuner] Skip use_cuda_graph for non cuda devices (triton-la…

51427ed

…ng#169) This commit skips test for non-cuda devices that exects to use_cuda_graph. Signed-off-by: Dmitrii Makarenko <[email protected]>

Add num_threads option to control threading per kernel invocation. (t…

a7d1412

…riton-lang#170) Signed-off-by: Ilya Enkovich <[email protected]>

Small fixes for clang + macosx (triton-lang#173)

162e1c4

Support multi-dimensional tensor prints in CPU runtime. (triton-lang#174

c0b74b0

) Signed-off-by: Ilya Enkovich <[email protected]>

maryamtahhan marked this pull request as ready for review November 5, 2024 16:37

maryamtahhan requested a review from ptillet as a code owner November 5, 2024 16:37

maryamtahhan force-pushed the devcontainer branch 3 times, most recently from 1979e0e to 82213ea Compare November 5, 2024 16:48

minjang reviewed Nov 11, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

maryamtahhan force-pushed the devcontainer branch 2 times, most recently from 899b376 to 2027217 Compare November 14, 2024 13:36

ienkovich force-pushed the main branch from b57c462 to a74ee76 Compare December 7, 2024 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Dev Container for consistent dev setup #175

feat: Dev Container for consistent dev setup #175

maryamtahhan commented Nov 5, 2024 •

edited

Loading

minjang commented Nov 8, 2024 •

edited

Loading

maryamtahhan commented Nov 8, 2024

minjang commented Nov 11, 2024

maryamtahhan commented Nov 14, 2024

feat: Dev Container for consistent dev setup #175

Are you sure you want to change the base?

feat: Dev Container for consistent dev setup #175

Conversation

maryamtahhan commented Nov 5, 2024 • edited Loading

minjang commented Nov 8, 2024 • edited Loading

maryamtahhan commented Nov 8, 2024

minjang commented Nov 11, 2024

maryamtahhan commented Nov 14, 2024

maryamtahhan commented Nov 5, 2024 •

edited

Loading

minjang commented Nov 8, 2024 •

edited

Loading