[OneDNN] Ukernel Backend interface #197

Devjiu · 2024-12-16T15:40:16Z

This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

draft s s fmt local zeroptr fmt ref s local works fmt

Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich

Overall looks good! I have only minor comments mostly related to wording.

include/triton/Dialect/TritonCPU/IR/TritonCPUOps.td

test/TritonCPU/dot-to-onednn.mlir

python/test/unit/language/test_core.py

third_party/cpu/lib/TritonCPUTransforms/ConvertDotOp/ConvertDotToUkernels.cpp

cmake fix add options non reproducable flaky bfloat16 remove wall bugfixes fmt

ienkovich

LGTM

digantdesai · 2025-02-19T03:43:33Z

Can you share any performance data before/after? Just curious.

ienkovich · 2025-02-19T03:52:48Z

Can you share any performance data before/after? Just curious.

Ukernels are still disabled by default, so there is no performance impact yet. We want to make more experiments and tuning before enabling ukernels by default.

@Devjiu Could you please share the current perf numbers for the FP32 case in comparison with the FMA lowering?

Devjiu · 2025-02-19T12:05:51Z

Can you share any performance data before/after? Just curious.

Ukernels are still disabled by default, so there is no performance impact yet. We want to make more experiments and tuning before enabling ukernels by default.

@Devjiu Could you please share the current perf numbers for the FP32 case in comparison with the FMA lowering?

Those results are from SPR server. Configuration will be described lower. And these are benchmarks aimed at matmuls.

Everything that marked OneDNN uses ukernels api. Along x axes is size (M=N=K), along y is perf numbers from triton benchmark. Refer to python/tutorials/cpu-blocked-matmul.py.

Torch (used pytorch version 2.5.1) - torch.matmul
Triton - with FMA (currently those passes enabled by default)
Triton OneDNN - with ukernels enabled
Triton Blocked Transposed - with FMA, B uses blocked layout, and it's blocks traversal order transposed.
Triton Blocked Transposed OneDNN - same as previous, but with OneDNN
Triton Blocked Transposed Prepacked - with FMA and without accounting time for packing B.
Triton Blocked Transposed Prepacked OneDNN - with OneDNN and without accounting time for packing B.

x86_energy_perf_policy --turbo-enable 0
cpupower frequency-set -g performance
cpupower frequency-set -u/-d 2.4GHz
#hyperthreading off
echo off > /sys/devices/system/cpu/smt/control

export OMP_NUM_THREADS=48
export TRITON_CPU_MAX_THREADS=48

export KMP_AFFINITY=granularity=fine,compact,$SKIP,0
export TRITON_ALWAYS_COMPILE=1

# libomp used libiomp
export LD_PRELOAD=./.venv/lib/libiomp5.so

# and numactl used
numactl  -m 1 --physcpubind=48-95 ...
# or
numactl  -m 0 --physcpubind=0-47 ...

# FYI numactl --hardware:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
# node 0 size: 128540 MB
# node 0 free: 80373 MB
# node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
# node 1 size: 128983 MB
# node 1 free: 83254 MB
# node distances:
# node   0   1
#  0:  10  21
#  1:  21  10

Raw numbers:

M	N	K	Triton BlockedB Transposed	Triton BlockedB Transposed Prepacked	Triton	Triton BlockedB Transposed OneDNN	Triton BlockedB Transposed Prepacked OneDNN	Triton OneDNN	Torch
256	256	256	582.997337	749.585374	582.440977	487.901336	608.216957	549.064624	264.92255
384	384	384	1109.71808	1285.340057	1005.33739	963.596618	1148.553291	1006.212962	1092.377276
512	512	512	1542.120823	1715.921907	1381.124879	1225.35792	1362.312358	1285.271849	617.525973
640	640	640	1784.339427	2123.54311	1689.3498	1709.342885	1894.651312	1748.702259	1770.071265
768	768	768	2109.535842	2288.067893	1841.600158	1993.021346	2187.29172	2053.640401	1454.382408
896	896	896	2322.054486	2783.101281	2177.711602	2241.367457	2536.81192	2460.587319	2750.34769
1024	1024	1024	2702.491642	3001.693603	2053.031934	2576.533172	2806.133622	2603.398551	2447.057406
1152	1152	1152	2935.23937	3523.081974	2434.045314	2814.329123	3274.634599	3193.065578	3971.904476
1280	1280	1280	3264.110755	3952.27442	2569.61732	3210.528869	3655.020576	3378.973134	3061.593831
1408	1408	1408	3502.608628	4313.297189	2591.573074	3449.938943	4078.506199	3838.335196	3397.773614
1536	1536	1536	3394.507056	4013.127916	1973.125762	3740.932411	4317.183944	2933.082821	4258.110415
1664	1664	1664	3696.898638	4511.583055	2702.702346	3884.808416	4519.542384	4213.101518	4437.534407
1792	1792	1792	3868.221722	4615.750468	2204.697298	3985.876191	4627.957237	3468.416425	4661.674481
1920	1920	1920	4088.809543	4737.63265	2724.335724	4149.607656	4836.419976	4320.152246	4772.06468
2048	2048	2048	4113.491517	4856.847561	2042.690598	4170.95743	4818.65665	2872.900643	4793.52649
2176	2176	2176	4306.553077	5013.202009	2684.692487	4368.11607	5058.120415	3782.222807	5053.879139
2304	2304	2304	4389.836268	5059.019761	2063.024088	4471.491058	5107.19834	2904.336887	5242.693199
2432	2432	2432	4517.970333	5260.741368	2637.035691	4489.647695	5170.42324	3471.355926	5345.872229
2560	2560	2560	4647.118714	5329.220552	1951.348148	4498.518676	5109.621689	2817.21201	5269.853532

Devjiu · 2025-02-19T12:12:42Z

BF16 scenario is not so good, there are several limitations in the OneDNN Ukernels API and their execution approach that result in lower performance than what can be achieved using the AMX dialect.

digantdesai · 2025-02-20T00:02:52Z

This is awesome! Thanks for sharing the data.

If I am reading the graph correctly - Triton BlockedB Transposed Prepacked OneDNN is almost as good or better for the most cases compared to Torch, am I reading that right?

Couple of quick questions, does Torch here means compile or eager? And I am assuming on SPR, we are using AMX kernels on Torch side as well?

Devjiu · 2025-02-20T12:14:03Z

This is awesome! Thanks for sharing the data.

If I am reading the graph correctly - Triton BlockedB Transposed Prepacked OneDNN is almost as good or better for the most cases compared to Torch, am I reading that right?

Couple of quick questions, does Torch here means compile or eager? And I am assuming on SPR, we are using AMX kernels on Torch side as well?

@ienkovich Please correct me if I am wrong elsewhere.

For torch it's just eager mode. (It's just call of torch.matmul) As far as I understand Pytorch also uses OneDNN library.

On the plot we have fp32 case, AMX works with fp with btw <= 16. In this case trtion uses FMA lowering.

Prepacked is not absolutely correct way to compare performance. In these configuration we are not taken into a count time spend on data preparation (transpose, block packing, vnni packing).

In real world most relevant case is 'Triton Blocked Transposed' with or without OneDNN.

ienkovich · 2025-02-20T13:56:26Z

Couple of quick questions, does Torch here means compile or eager? And I am assuming on SPR, we are using AMX kernels on Torch side as well?

We use Torch 2.5 in eager mode for comparison. This version uses MKL for FP32 matmul and oneDNN for BF16 matmul. So yes, it utilizes AMX when possible.

digantdesai · 2025-02-20T17:27:02Z

Thanks. Is there a reason to not also compare against Inductor? IIUC, there has been a quite a lot of efforts from Intel to push performance there as well.

ienkovich · 2025-02-20T17:39:13Z

Inductor just generates a library call for matmuls. In previous runs, it was always slower than eager mode, so we excluded it from measurements to compare with the fastest option.

This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>

Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch 3 times, most recently from 50a897d to 330d674 Compare January 13, 2025 18:53

Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch 2 times, most recently from 9493ba5 to 4fd9ea8 Compare January 16, 2025 16:32

Devjiu and others added 6 commits January 28, 2025 16:25

wip

64e8c1a

draft s s fmt local zeroptr fmt ref s local works fmt

Fix strides used for ukernel creation.

d92bdd4

Signed-off-by: Ilya Enkovich <[email protected]>

Skip non 2D cases for ukernels.

1090c18

Signed-off-by: Ilya Enkovich <[email protected]>

Skip FP8 cases.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

d71dda1

Signed-off-by: Ilya Enkovich <[email protected]>

Cleanup.

d98f925

Signed-off-by: Ilya Enkovich <[email protected]>

Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch 2 times, most recently from 7fb179d to 8692fe5 Compare January 30, 2025 15:18

Devjiu marked this pull request as ready for review January 31, 2025 20:01

Devjiu requested a review from ptillet as a code owner January 31, 2025 20:01

Devjiu requested a review from ienkovich January 31, 2025 20:02

ienkovich requested changes Feb 11, 2025

View reviewed changes

Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch from ebf380f to d5f3700 Compare February 14, 2025 15:43

optional onednn/dnnl

Loading
Loading status checks…

235400e

cmake fix add options non reproducable flaky bfloat16 remove wall bugfixes fmt

Devjiu force-pushed the dmitriim/onednn_ukernel_rebased branch from d5f3700 to 235400e Compare February 18, 2025 12:39

Devjiu changed the title ~~OneDNN ukernel~~ [OneDNN] Ukernel Backend interface Feb 18, 2025

ienkovich approved these changes Feb 18, 2025

View reviewed changes

Devjiu merged commit 9ae3f67 into triton-lang:main Feb 18, 2025
3 checks passed

Devjiu deleted the dmitriim/onednn_ukernel_rebased branch February 18, 2025 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OneDNN] Ukernel Backend interface #197

[OneDNN] Ukernel Backend interface #197

Devjiu commented Dec 16, 2024 •

edited

Loading

ienkovich left a comment

ienkovich left a comment

digantdesai commented Feb 19, 2025

ienkovich commented Feb 19, 2025

Devjiu commented Feb 19, 2025

Devjiu commented Feb 19, 2025

digantdesai commented Feb 20, 2025

Devjiu commented Feb 20, 2025

ienkovich commented Feb 20, 2025

digantdesai commented Feb 20, 2025

ienkovich commented Feb 20, 2025

[OneDNN] Ukernel Backend interface #197

[OneDNN] Ukernel Backend interface #197

Conversation

Devjiu commented Dec 16, 2024 • edited Loading

ienkovich left a comment

Choose a reason for hiding this comment

ienkovich left a comment

Choose a reason for hiding this comment

digantdesai commented Feb 19, 2025

ienkovich commented Feb 19, 2025

Devjiu commented Feb 19, 2025

Devjiu commented Feb 19, 2025

digantdesai commented Feb 20, 2025

Devjiu commented Feb 20, 2025

ienkovich commented Feb 20, 2025

digantdesai commented Feb 20, 2025

ienkovich commented Feb 20, 2025

Devjiu commented Dec 16, 2024 •

edited

Loading