Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OneDNN] Ukernel Backend interface #197

Merged
merged 7 commits into from
Feb 18, 2025

Conversation

Devjiu
Copy link
Collaborator

@Devjiu Devjiu commented Dec 16, 2024

This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

@Devjiu Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch 3 times, most recently from 50a897d to 330d674 Compare January 13, 2025 18:53
@Devjiu Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch 2 times, most recently from 9493ba5 to 4fd9ea8 Compare January 16, 2025 16:32
Devjiu and others added 6 commits January 28, 2025 16:25
draft

s

s

fmt

local

zeroptr

fmt

ref

s

local

works

fmt

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Signed-off-by: Ilya Enkovich <[email protected]>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
@Devjiu Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch 2 times, most recently from 7fb179d to 8692fe5 Compare January 30, 2025 15:18
@Devjiu Devjiu marked this pull request as ready for review January 31, 2025 20:01
@Devjiu Devjiu requested a review from ptillet as a code owner January 31, 2025 20:01
@Devjiu Devjiu requested a review from ienkovich January 31, 2025 20:02
Copy link
Collaborator

@ienkovich ienkovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! I have only minor comments mostly related to wording.

@Devjiu Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch from ebf380f to d5f3700 Compare February 14, 2025 15:43
cmake fix

add options

non reproducable flaky bfloat16

remove wall

bugfixes

fmt
@Devjiu Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch from d5f3700 to 235400e Compare February 18, 2025 12:39
@Devjiu Devjiu changed the title OneDNN ukernel [OneDNN] Ukernel Backend interface Feb 18, 2025
Copy link
Collaborator

@ienkovich ienkovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Devjiu Devjiu merged commit 9ae3f67 into triton-lang:main Feb 18, 2025
3 checks passed
@Devjiu Devjiu deleted the dmitriim/onednn_ukernel_rebased branch February 18, 2025 17:12
@digantdesai
Copy link
Collaborator

Can you share any performance data before/after? Just curious.

@ienkovich
Copy link
Collaborator

Can you share any performance data before/after? Just curious.

Ukernels are still disabled by default, so there is no performance impact yet. We want to make more experiments and tuning before enabling ukernels by default.

@Devjiu Could you please share the current perf numbers for the FP32 case in comparison with the FMA lowering?

@Devjiu
Copy link
Collaborator Author

Devjiu commented Feb 19, 2025

Can you share any performance data before/after? Just curious.

Ukernels are still disabled by default, so there is no performance impact yet. We want to make more experiments and tuning before enabling ukernels by default.

@Devjiu Could you please share the current perf numbers for the FP32 case in comparison with the FMA lowering?

image

Those results are from SPR server. Configuration will be described lower. And these are benchmarks aimed at matmuls.

Everything that marked OneDNN uses ukernels api. Along x axes is size (M=N=K), along y is perf numbers from triton benchmark. Refer to python/tutorials/cpu-blocked-matmul.py.

  1. Torch (used pytorch version 2.5.1) - torch.matmul
  2. Triton - with FMA (currently those passes enabled by default)
  3. Triton OneDNN - with ukernels enabled
  4. Triton Blocked Transposed - with FMA, B uses blocked layout, and it's blocks traversal order transposed.
  5. Triton Blocked Transposed OneDNN - same as previous, but with OneDNN
  6. Triton Blocked Transposed Prepacked - with FMA and without accounting time for packing B.
  7. Triton Blocked Transposed Prepacked OneDNN - with OneDNN and without accounting time for packing B.
x86_energy_perf_policy --turbo-enable 0
cpupower frequency-set -g performance
cpupower frequency-set -u/-d 2.4GHz
#hyperthreading off
echo off > /sys/devices/system/cpu/smt/control

export OMP_NUM_THREADS=48
export TRITON_CPU_MAX_THREADS=48

export KMP_AFFINITY=granularity=fine,compact,$SKIP,0
export TRITON_ALWAYS_COMPILE=1

# libomp used libiomp
export LD_PRELOAD=./.venv/lib/libiomp5.so

# and numactl used
numactl  -m 1 --physcpubind=48-95 ...
# or
numactl  -m 0 --physcpubind=0-47 ...

# FYI numactl --hardware:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
# node 0 size: 128540 MB
# node 0 free: 80373 MB
# node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
# node 1 size: 128983 MB
# node 1 free: 83254 MB
# node distances:
# node   0   1
#  0:  10  21
#  1:  21  10

Raw numbers:

M N K Triton BlockedB Transposed Triton BlockedB Transposed Prepacked Triton Triton BlockedB Transposed OneDNN Triton BlockedB Transposed Prepacked OneDNN Triton OneDNN Torch
256 256 256 582.997337 749.585374 582.440977 487.901336 608.216957 549.064624 264.92255
384 384 384 1109.71808 1285.340057 1005.33739 963.596618 1148.553291 1006.212962 1092.377276
512 512 512 1542.120823 1715.921907 1381.124879 1225.35792 1362.312358 1285.271849 617.525973
640 640 640 1784.339427 2123.54311 1689.3498 1709.342885 1894.651312 1748.702259 1770.071265
768 768 768 2109.535842 2288.067893 1841.600158 1993.021346 2187.29172 2053.640401 1454.382408
896 896 896 2322.054486 2783.101281 2177.711602 2241.367457 2536.81192 2460.587319 2750.34769
1024 1024 1024 2702.491642 3001.693603 2053.031934 2576.533172 2806.133622 2603.398551 2447.057406
1152 1152 1152 2935.23937 3523.081974 2434.045314 2814.329123 3274.634599 3193.065578 3971.904476
1280 1280 1280 3264.110755 3952.27442 2569.61732 3210.528869 3655.020576 3378.973134 3061.593831
1408 1408 1408 3502.608628 4313.297189 2591.573074 3449.938943 4078.506199 3838.335196 3397.773614
1536 1536 1536 3394.507056 4013.127916 1973.125762 3740.932411 4317.183944 2933.082821 4258.110415
1664 1664 1664 3696.898638 4511.583055 2702.702346 3884.808416 4519.542384 4213.101518 4437.534407
1792 1792 1792 3868.221722 4615.750468 2204.697298 3985.876191 4627.957237 3468.416425 4661.674481
1920 1920 1920 4088.809543 4737.63265 2724.335724 4149.607656 4836.419976 4320.152246 4772.06468
2048 2048 2048 4113.491517 4856.847561 2042.690598 4170.95743 4818.65665 2872.900643 4793.52649
2176 2176 2176 4306.553077 5013.202009 2684.692487 4368.11607 5058.120415 3782.222807 5053.879139
2304 2304 2304 4389.836268 5059.019761 2063.024088 4471.491058 5107.19834 2904.336887 5242.693199
2432 2432 2432 4517.970333 5260.741368 2637.035691 4489.647695 5170.42324 3471.355926 5345.872229
2560 2560 2560 4647.118714 5329.220552 1951.348148 4498.518676 5109.621689 2817.21201 5269.853532

@Devjiu
Copy link
Collaborator Author

Devjiu commented Feb 19, 2025

BF16 scenario is not so good, there are several limitations in the OneDNN Ukernels API and their execution approach that result in lower performance than what can be achieved using the AMX dialect.

@digantdesai
Copy link
Collaborator

This is awesome! Thanks for sharing the data.

If I am reading the graph correctly - Triton BlockedB Transposed Prepacked OneDNN is almost as good or better for the most cases compared to Torch, am I reading that right?

Couple of quick questions, does Torch here means compile or eager? And I am assuming on SPR, we are using AMX kernels on Torch side as well?

@Devjiu
Copy link
Collaborator Author

Devjiu commented Feb 20, 2025

This is awesome! Thanks for sharing the data.

If I am reading the graph correctly - Triton BlockedB Transposed Prepacked OneDNN is almost as good or better for the most cases compared to Torch, am I reading that right?

Couple of quick questions, does Torch here means compile or eager? And I am assuming on SPR, we are using AMX kernels on Torch side as well?

@ienkovich Please correct me if I am wrong elsewhere.

For torch it's just eager mode. (It's just call of torch.matmul) As far as I understand Pytorch also uses OneDNN library.

On the plot we have fp32 case, AMX works with fp with btw <= 16. In this case trtion uses FMA lowering.

Prepacked is not absolutely correct way to compare performance. In these configuration we are not taken into a count time spend on data preparation (transpose, block packing, vnni packing).

In real world most relevant case is 'Triton Blocked Transposed' with or without OneDNN.

@ienkovich
Copy link
Collaborator

Couple of quick questions, does Torch here means compile or eager? And I am assuming on SPR, we are using AMX kernels on Torch side as well?

We use Torch 2.5 in eager mode for comparison. This version uses MKL for FP32 matmul and oneDNN for BF16 matmul. So yes, it utilizes AMX when possible.

@digantdesai
Copy link
Collaborator

Thanks. Is there a reason to not also compare against Inductor? IIUC, there has been a quite a lot of efforts from Intel to push performance there as well.

@ienkovich
Copy link
Collaborator

Inductor just generates a library call for matmuls. In previous runs, it was always slower than eager mode, so we excluded it from measurements to compare with the fastest option.

Devjiu added a commit to Devjiu/triton-cpu that referenced this pull request Feb 20, 2025
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
Devjiu added a commit to Devjiu/triton-cpu that referenced this pull request Feb 24, 2025
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
Devjiu added a commit to Devjiu/triton-cpu that referenced this pull request Feb 28, 2025
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
Devjiu added a commit to Devjiu/triton-cpu that referenced this pull request Mar 3, 2025
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
Devjiu added a commit to Devjiu/triton-cpu that referenced this pull request Mar 3, 2025
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
Devjiu added a commit to Devjiu/triton-cpu that referenced this pull request Mar 3, 2025
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
Devjiu added a commit to Devjiu/triton-cpu that referenced this pull request Mar 3, 2025
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants