-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Dev Container for consistent dev setup #175
base: main
Are you sure you want to change the base?
Conversation
* [BACKEND][CPU] Implement the empty cpu backend * Run clang-format * Fix yadf error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
A quick addition on how to use it.
Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`. Test Plan: It can safely compile an empty kernel. ``` @triton.jit def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): return ``` > TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py ``` //===-------------------------------------------===// Legalizing operation : 'tt.func'(0x73be2a0) { * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.func -> ()' { Trying to match "(anonymous namespace)::FuncOpConversion" ** Insert : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Erase : 'tt.func'(0x73be2a0) "(anonymous namespace)::FuncOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.func'(0x6c04c70) { } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// ... //===-------------------------------------------===// Legalizing operation : 'tt.return'(0x73efeb0) { "tt.return"() : () -> () * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.return -> ()' { Trying to match "(anonymous namespace)::ReturnOpConversion" ** Insert : 'llvm.return'(0x73c0f00) ** Replace : 'tt.return'(0x73efeb0) "(anonymous namespace)::ReturnOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.return'(0x73c0f00) { "llvm.return"() : () -> () } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// } -> SUCCESS : pattern applied successfully ```
…riton-lang#1) Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments. I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing. Test Plan: Tested with a simple example: ``` @triton.jit def add_kernel(...): pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0. foo = pid + 42 tl.device_print("Hello, World!", foo, pid) ``` The resulting .llir is valid: ``` @printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00" declare !dbg !3 i32 @printf(ptr, ...) define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 { %8 = add i32 %4, 42, !dbg !8 %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4) ret void, !dbg !9 } ``` Tried to compile with a fake main function: ``` > % cat main.c extern void add_kernel(float*, float*, float*, int, int, int, int); int main() { add_kernel(0, 0, 0, 4, 5, 6, 7); } > % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c > % ./a.out pid (5, 6, 7) Hello, World!: 47, 5 ```
Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Shane Nay <[email protected]>
…n-lang#4) Signed-off-by: Ilya Enkovich <[email protected]>
…ion flows (triton-lang#6) * Support basic lowering through vector dialect in CPU backend. Signed-off-by: Ilya Enkovich <[email protected]> * Use axis info in memory op lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Mark test_ptx_cast as enabled for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Support umulhi operation. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.clamp, tl.minimum, tl.maximum. Signed-off-by: Ilya Enkovich <[email protected]> * Add enable_fp_fusion opt for CPU (only affects ASM dump now). Signed-off-by: Ilya Enkovich <[email protected]> * Fix kernel args passing for propagated constants. Signed-off-by: Ilya Enkovich <[email protected]> * Add permutations support. Signed-off-by: Ilya Enkovich <[email protected]> * Support 2-D transfer_read/transfer_write lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Introduce shape info analysis and use it for loads/stores by block pointers. Delay scalar pointers lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Support 'other' arg for loads. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.join. Signed-off-by: Ilya Enkovich <[email protected]> * Minor renaming. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
…ent (triton-lang#8) * [BACKEND][CPU] Make it buildable and runnable in a different environment * Revert seemingly inconsistent python code formatting
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Minjang Kim <[email protected]>
…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU
…ng#16) Signed-off-by: Gregory Shimansky <[email protected]>
…ng#17) * Fixed yaml syntax Signed-off-by: Gregory Shimansky <[email protected]> * Removed cpu label from run-on Signed-off-by: Gregory Shimansky <[email protected]> * Added missing zlib-dev Signed-off-by: Gregory Shimansky <[email protected]> * Added missing apt-get update Signed-off-by: Gregory Shimansky <[email protected]> * Remove pip cache because on self-hosted runner it slows things down Signed-off-by: Gregory Shimansky <[email protected]> * Corrected path to tests Signed-off-by: Gregory Shimansky <[email protected]> * Added installation of torch==2.1.2 Signed-off-by: Gregory Shimansky <[email protected]> --------- Signed-off-by: Gregory Shimansky <[email protected]>
* [CPU] Add OpenMP launcher * Address the comments * Fix induction variable type * Always use preallocated output buffer for CPU with torch.add
Signed-off-by: Ilya Enkovich <[email protected]>
* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py
Signed-off-by: Gregory Shimansky <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
…-lang#23) * add un-masked tiled matrix-multiplication for triton-cpu * clean and add comment * move test under tutorials
* Fix RelWithDebInfo build. Signed-off-by: Ilya Enkovich <[email protected]> * Skip fp8 cast tests on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix segfault. Signed-off-by: Ilya Enkovich <[email protected]> * [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059) * Add -s option to pytest run. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM fpext bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix formatting. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Pablo Zimmermann <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
* Lower memory ops with vector gather and scatter This commit add lowerToGather and lowerToScatter for load and store conversion. Memory ops with the pointer computed from splat and addptr can be lowered with vector.gather or vector.scatter. For architectures with scatter and gather support (like SVE and RVV), the code generated with this approach might be more efficient. Two options are added to scalarization and memory op conversion to enable lowering with gather and scatter operations. Signed-off-by: Junyi Mei <[email protected]> * Fix incorrect rank and type in gather and scatter Signed-off-by: Junyi Mei <[email protected]> * Lower store op with 1-D vector scatter ops Signed-off-by: Junyi Mei <[email protected]> --------- Signed-off-by: Junyi Mei <[email protected]>
* Add DotOp lowering to AMX operations. Signed-off-by: Ilya Enkovich <[email protected]> * Support direct tiles store to output memory. Signed-off-by: Ilya Enkovich <[email protected]> * Add lit tests for amx. Signed-off-by: Ilya Enkovich <[email protected]> * Fix review comments. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
…ang#161) This PR adds support for libdevice functions that don't map cleanly to a MathOp. We implement them using tt.extern_elementwise instead, indicating which Sleef function to use. While tt.extern_elementwise contains fields for the library path and name, the CUDA backend ignores those fields as it always uses the NVIDIA's libdevice library. We take a similar approach here and assume all extern calls go to the Sleef library. One difference though is that we need to select our Sleef function based on the number of elements of the vector, which is done by interpolating this number into the symbol name. To indicate where this interpolation should occur, I have made `%(numel)` into a special string value. This allows us to reuse tt.extern_elementwise without adding any extra attributes.
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
…iton-lang#162) libsleef does not implement 2-element functions, so for libdevice functions implemented via extern_elementwise ops that rely wholly on libsleef implementations (as opposed to MathOps which can be lowered to native instructions), we need to pad those vectors to size 4. This allows us to enable test_math.py for all the functions introduced in triton-lang#161.
…ng#169) This commit skips test for non-cuda devices that exects to use_cuda_graph. Signed-off-by: Dmitrii Makarenko <[email protected]>
…riton-lang#170) Signed-off-by: Ilya Enkovich <[email protected]>
…riton-lang#160) This commit adds Memref type to possible inputs of print. Memref have strides and other supporting information to allow print multidimensional tensors. (2d, 3d etc) Such print will be added in the next pr.
* Enable num_threads in autotuner and use hooks for tuning on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add vector-add example for CPU with autotuner. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
) Signed-off-by: Ilya Enkovich <[email protected]>
1979e0e
to
82213ea
Compare
@maryamtahhan Thanks for putting a PR! I have a quick question. Are you proposing this PR only for triton-cpu, not triton? Can this PR be for triton as well? We'd want to minimize the divergence from the triton upstream. If a change can be also made for triton, we'd recommend it. |
No problem. I'm going to do a PR for Triton also. It will be a separate folder under the In that case there would be 2 separate directories that would cover the gpu enabled and non gpu(cpu) enabled containers. I can move the readme to the Triton PR so that there's an easier merge for you when the time comes. |
I see. As this triton-cpu is a fork, we're pretty flexible to take your changes. But, I'd recommend you would land this PR to the upstream first. |
Going to update this PR based on triton-lang#5143 |
Added a Dev Container configuration to streamline development and onboarding. This setup ensures a consistent, isolated environment with all necessary tools and dependencies for building and running Triton-CPU. The configuration supports use in both VS Code locally and GitHub Codespaces. Signed-off-by: Maryam Tahhan <[email protected]>
899b376
to
2027217
Compare
Added a Dev Container configuration to streamline development and onboarding. This setup ensures a consistent, isolated environment with all necessary tools and dependencies for building and running Triton-CPU. The configuration supports use in both VS Code locally and GitHub Codespaces.
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.
Complete the following tasks before sending your PR, and replace
[ ]
with[x]
to indicate you have done them.I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsIt's not a code change per se
.Select one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)