Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help needed for improving conv with given shape #2088

Closed
shawnxhong opened this issue Sep 9, 2024 · 4 comments
Closed

help needed for improving conv with given shape #2088

shawnxhong opened this issue Sep 9, 2024 · 4 comments
Assignees
Labels
platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 question

Comments

@shawnxhong
Copy link

shawnxhong commented Sep 9, 2024

Hi dear team,

Is there any other way to accelerate this conv (ic=16, oc=16, height=208, width=32, stride=1, kernel=3) on a single core?

ONEDNN_VERBOSE=1 numactl -C 1 -m 0 ./benchdnn --mode=P --conv --dt=bf16:bf16:bf16 --stag=any --wtag=any --dtag=any --dir=FWD_B --attr-post-ops=eltwise_relu:0:1 --alg=AUTO mb1_ic16oc16_ih208oh208kh3sh1dh0ph1_iw32ow32kw3sw1dw0pw1

The result is:

onednn_verbose,v1,primitive,exec,cpu,convolution,brg_conv_fwd:avx10_1_512_amx,forward_training,src:bf16:a:blocked:acdb::f0 wei:bf16:a:blocked:AcdB16a2b::f0 bia:bf16:a:blocked:a::f0 dst:bf16:a:blocked:acdb::f0,attr-post-ops:eltwise_relu:0:1,alg:convolution_direct,mb1_ic16oc16_ih208oh208kh3sh1dh0ph1_iw32ow32kw3sw1dw0pw1,0.0310059

The branch is rls-v3.6 and I built with CC=icx and CXX=icpx.
Fusing conv + relu, upgrading to the latest version and using intel C++ compiler are the only useful method I can think of.
I tried to use other memory tags but not useful.
Please kindly suggest any other methods that can accelerate this conv.
Thanks a lot for your great help in advance.

@shu1chen shu1chen added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Sep 9, 2024
@shawnxhong
Copy link
Author

Dear team,
as the kernel is 3 * 3, is it helpful to modify the input (208, 32) so that winograd conv can be used? what input size does winograd need?
Or, is it helpful to use Ukernel to improve the blocking?

@asirvaiy
Copy link

Hi @shawnxhong ,
Thanks for posting your query.
Winograd can work with small kernels like 3x3. The speedup depends on how much is the reduction in scaler multiplications, and that depends on the input shape.

Winograd is supported on GPU (FP16 and FP32) and AArch64 CPUs. Which platform are you targeting here?

@shawnxhong
Copy link
Author

shawnxhong commented Sep 11, 2024

Hi @shawnxhong , Thanks for posting your query. Winograd can work with small kernels like 3x3. The speedup depends on how much is the reduction in scaler multiplications, and that depends on the input shape.

Winograd is supported on GPU (FP16 and FP32) and AArch64 CPUs. Which platform are you targeting here?

Thanks for your answer. I am targeting x86 with AMX. So it means winograd only works for gpu and aarch64?
Then what else methods are feasible to improve this conv itself of this shape on x86 with AMX? Is using Ukernel helpful?

@asirvaiy
Copy link

Hi @shawnxhong , Thanks for posting your query. Winograd can work with small kernels like 3x3. The speedup depends on how much is the reduction in scaler multiplications, and that depends on the input shape.
Winograd is supported on GPU (FP16 and FP32) and AArch64 CPUs. Which platform are you targeting here?

Thanks for your answer. I am targeting x86 with AMX. So it means winograd only works for gpu and aarch64? Then what else methods are feasible to improve this conv itself of this shape on x86 with AMX? Is using Ukernel helpful?

Hi,
Ukernal is useful if we don't have the BRGCONV/BRGeMM implementation. It can be used to implement BRGeMM.
However, oneDNN already has BRGeMM and BRGCONV implemented, there is no need for it.

Going for INT8 Precision is one way to utilize AMX and get better Conv performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 question
Projects
None yet
Development

No branches or pull requests

4 participants