-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OneDNN] Ukernel Backend interface #197
[OneDNN] Ukernel Backend interface #197
Conversation
50a897d
to
330d674
Compare
9493ba5
to
4fd9ea8
Compare
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
7fb179d
to
8692fe5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! I have only minor comments mostly related to wording.
third_party/cpu/lib/TritonCPUTransforms/ConvertDotOp/ConvertDotToUkernels.cpp
Outdated
Show resolved
Hide resolved
third_party/cpu/lib/TritonCPUTransforms/ConvertDotOp/ConvertDotToUkernels.cpp
Outdated
Show resolved
Hide resolved
third_party/cpu/lib/TritonCPUTransforms/ConvertDotOp/ConvertDotToUkernels.cpp
Outdated
Show resolved
Hide resolved
third_party/cpu/lib/TritonCPUTransforms/ConvertDotOp/ConvertDotToUkernels.cpp
Outdated
Show resolved
Hide resolved
third_party/cpu/lib/TritonCPUTransforms/ConvertDotOp/ConvertDotToUkernels.cpp
Outdated
Show resolved
Hide resolved
ebf380f
to
d5f3700
Compare
cmake fix add options non reproducable flaky bfloat16 remove wall bugfixes fmt
d5f3700
to
235400e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can you share any performance data before/after? Just curious. |
Ukernels are still disabled by default, so there is no performance impact yet. We want to make more experiments and tuning before enabling ukernels by default. @Devjiu Could you please share the current perf numbers for the FP32 case in comparison with the FMA lowering? |
Those results are from SPR server. Configuration will be described lower. And these are benchmarks aimed at matmuls. Everything that marked OneDNN uses ukernels api. Along x axes is size (M=N=K), along y is perf numbers from triton benchmark. Refer to
x86_energy_perf_policy --turbo-enable 0
cpupower frequency-set -g performance
cpupower frequency-set -u/-d 2.4GHz
#hyperthreading off
echo off > /sys/devices/system/cpu/smt/control
export OMP_NUM_THREADS=48
export TRITON_CPU_MAX_THREADS=48
export KMP_AFFINITY=granularity=fine,compact,$SKIP,0
export TRITON_ALWAYS_COMPILE=1
# libomp used libiomp
export LD_PRELOAD=./.venv/lib/libiomp5.so
# and numactl used
numactl -m 1 --physcpubind=48-95 ...
# or
numactl -m 0 --physcpubind=0-47 ...
# FYI numactl --hardware:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
# node 0 size: 128540 MB
# node 0 free: 80373 MB
# node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
# node 1 size: 128983 MB
# node 1 free: 83254 MB
# node distances:
# node 0 1
# 0: 10 21
# 1: 21 10 Raw numbers:
|
BF16 scenario is not so good, there are several limitations in the OneDNN Ukernels API and their execution approach that result in lower performance than what can be achieved using the AMX dialect. |
This is awesome! Thanks for sharing the data. If I am reading the graph correctly - Couple of quick questions, does |
@ienkovich Please correct me if I am wrong elsewhere. For torch it's just eager mode. (It's just call of torch.matmul) As far as I understand Pytorch also uses OneDNN library. On the plot we have fp32 case, AMX works with fp with btw <= 16. In this case trtion uses FMA lowering. Prepacked is not absolutely correct way to compare performance. In these configuration we are not taken into a count time spend on data preparation (transpose, block packing, vnni packing). In real world most relevant case is 'Triton Blocked Transposed' with or without OneDNN. |
We use Torch 2.5 in eager mode for comparison. This version uses MKL for FP32 matmul and oneDNN for BF16 matmul. So yes, it utilizes AMX when possible. |
Thanks. Is there a reason to not also compare against Inductor? IIUC, there has been a quite a lot of efforts from Intel to push performance there as well. |
Inductor just generates a library call for matmuls. In previous runs, it was always slower than eager mode, so we excluded it from measurements to compare with the fastest option. |
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing
triton_cpu.dot
op when it's possible with call of kernel from library.