src: cpu: aarch64: injectors: eltwise_injector - improve gelu performance for block size 16 #2072

nikhilfujitsu · 2024-09-03T10:15:47Z

Description

Improvement: gelu performance for block size 16 in jit_uni_eltwise_injector
This commit improves the performance of gelu function jit_uni_eltwise_injector for block size 16:

Major Code changes:

• Added a new function gelu_erf_minimax_approx_compute_vector_fwd(const TRegS &vmm_src) for
computing gelu_erf for block size 16.
• Added new gelu_minimax constants and polynomial constants table.

Checklist

General

[✓] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit? Yes
Test output is same with and without this commit.
make test summary :

95% tests passed, 11 tests failed out of 200

Total Test time (real) = 3750.50 sec

The following tests FAILED:
55 - test_convolution_backward_data_f32 (Subprocess aborted)
123 - test_graph_c_api_compile_parametrized_usm_cpu (Failed)
153 - test_graph_unit_dnnl_conv_usm_cpu (Failed)
157 - test_graph_unit_dnnl_group_norm_usm_cpu (Failed)
159 - test_graph_unit_dnnl_large_partition_usm_cpu (Failed)
160 - test_graph_unit_dnnl_layer_norm_usm_cpu (Failed)
161 - test_graph_unit_dnnl_matmul_usm_cpu (Failed)
162 - test_graph_unit_dnnl_mqa_decomp_usm_cpu (Failed)
163 - test_graph_unit_dnnl_pool_usm_cpu (Failed)
168 - test_graph_unit_dnnl_sdp_decomp_usm_cpu (Failed)
169 - test_graph_unit_dnnl_softmax_usm_cpu (Failed)
Errors while running CTest
Output from these tests are in: /home/nikhil/oneDNN/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
[✓] Have you formatted the code using clang-format? Yes

abhijain1204fujitsu · 2024-09-05T10:37:49Z

@vpirogov , @jondea , Kindly support to review the PR

jondea · 2024-09-06T13:00:36Z

Do you have any specific benchdnn calls which will exercise this path?

jondea · 2024-09-06T14:13:58Z

Also, I'm not seeing the same failures on this patch (or before) as you. E.g., out of the CI failures, I can only see test_benchdnn_modeC_graph_ci_cpu. Would you be able to investigate this please?

Also, do you have any measurements of the speedup of this optimization?

nikhilfujitsu · 2024-09-09T05:01:03Z

Do you have any specific benchdnn calls which will exercise this path?

Hi, please checkout these log, the machine is A64FX and uses jit_sve_512.

I have used:

./benchdnn --eltwise --batch=inputs/eltwise/test_eltwise_all | grep eltwise_gelu_erf

to extract these logs.

A64_FX_benchdnn_eltwise_gelu.log

Here export ONEDNN_VERBOSE=1 is used before running the command.
A64_FX_benchdnn_VERBOSE_eltwise_gelu.log

nikhilfujitsu · 2024-09-09T05:14:46Z

Also, I'm not seeing the same failures on this patch (or before) as you. E.g., out of the CI failures, I can only see test_benchdnn_modeC_graph_ci_cpu. Would you be able to investigate this please?

Also, do you have any measurements of the speedup of this optimization?
Eltwise_results.xlxs.pdf

jondea

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.

But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?

Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

jondea

Sorry, accidentally approved

nikhilfujitsu · 2024-09-09T11:15:22Z

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.

But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?

Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

It is true that we get 1.5x slowdown on G3 machines( SVE_256). That's why it is limited to SVE_512.

Extra unit cases are failing on SVE_512 machines already in the main branch.
By adding my changes no effect is there on failed test cases. The failed test cases are for SVE_512 machine not for Graviton or SVE_256 machine. If this helps.

nikhilfujitsu · 2024-09-09T13:30:47Z

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.
But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?
Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

It is true that we get 1.5x slowdown on G3 machines( SVE_256). That's why it is limited to SVE_512.

Extra unit cases are failing on SVE_512 machines already in the main branch. By adding my changes no effect is there on failed test cases. The failed test cases are for SVE_512 machine not for Graviton or SVE_256 machine. If this helps.

I am also checking with latest changes in main. Will update you soon.

nikhilfujitsu · 2024-09-09T16:05:04Z

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.
But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?
Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

It is true that we get 1.5x slowdown on G3 machines( SVE_256). That's why it is limited to SVE_512.
Extra unit cases are failing on SVE_512 machines already in the main branch. By adding my changes no effect is there on failed test cases. The failed test cases are for SVE_512 machine not for Graviton or SVE_256 machine. If this helps.

I am also checking with latest changes in main. Will update you soon.

@jondea After merging latest changes from main, errors are resolved. So I have updated the description also. Please consider approval. Thanks.

vpirogov · 2024-09-09T16:08:28Z

@nikhilfujitsu, merge commits are not allowed in production branches. Please rebase your changes.

nikhilfujitsu · 2024-09-09T17:07:48Z

@nikhilfujitsu, merge commits are not allowed in production branches. Please rebase your changes.

Hi I am struggling to get this rebased again could you please help me here.
What I did is.
I pressed the sync fork button which merged the main branch changes with my branch.
Now how can I rebase it and push it back here.
Should I revert my merge first then rebase it?

vpirogov · 2024-09-09T17:22:58Z

Hi I am struggling to get this rebased again could you please help me here.

This operation can be done from console:

git checkout gelu_erf
git rebase main
git push --force

…ance for block size 16

nikhilfujitsu · 2024-09-10T02:29:06Z

Hi I am struggling to get this rebased again could you please help me here.

This operation can be done from console:
git checkout gelu_erf
git rebase main
git push --force

Thank you. Means a lot to me.

nikhilfujitsu · 2024-09-10T06:56:52Z

Hi I am struggling to get this rebased again could you please help me here.

This operation can be done from console:
git checkout gelu_erf
git rebase main
git push --force
Thank you. Means a lot to me.

@jondea @vpirogov Please approve the changes. ThankYou.

nikhilfujitsu

Changes Reviewed. And description changed accordingly.

abhijain1204fujitsu · 2024-09-16T07:50:56Z

@jondea @vpirogov could you please support to check the changes as per feedback received.

abhijain1204fujitsu · 2024-09-20T05:17:53Z

@vpirogov Kindly let us know if any other change is required
Kindly support for merger.

nikhilfujitsu requested a review from a team as a code owner September 3, 2024 10:15

vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Sep 3, 2024

vpirogov added this to the v3.6 milestone Sep 3, 2024

jondea approved these changes Sep 9, 2024

View reviewed changes

jondea requested changes Sep 9, 2024

View reviewed changes

vpirogov modified the milestones: v3.6, v3.7 Sep 9, 2024

src: cpu: aarch64: injectors: eltwise_injector - improve gelu perform…

4f8e934

…ance for block size 16

nikhilfujitsu force-pushed the gelu_erf branch from f61ac72 to 4f8e934 Compare September 10, 2024 02:21

nikhilfujitsu requested a review from jondea September 10, 2024 14:17

nikhilfujitsu commented Sep 10, 2024

View reviewed changes

jondea approved these changes Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: cpu: aarch64: injectors: eltwise_injector - improve gelu performance for block size 16 #2072

src: cpu: aarch64: injectors: eltwise_injector - improve gelu performance for block size 16 #2072

nikhilfujitsu commented Sep 3, 2024 •

edited

Loading

abhijain1204fujitsu commented Sep 5, 2024

jondea commented Sep 6, 2024

jondea commented Sep 6, 2024

nikhilfujitsu commented Sep 9, 2024 •

edited

Loading

nikhilfujitsu commented Sep 9, 2024

jondea left a comment

jondea left a comment

nikhilfujitsu commented Sep 9, 2024

nikhilfujitsu commented Sep 9, 2024

nikhilfujitsu commented Sep 9, 2024

vpirogov commented Sep 9, 2024

nikhilfujitsu commented Sep 9, 2024

vpirogov commented Sep 9, 2024

nikhilfujitsu commented Sep 10, 2024

nikhilfujitsu commented Sep 10, 2024

nikhilfujitsu left a comment

abhijain1204fujitsu commented Sep 16, 2024

abhijain1204fujitsu commented Sep 20, 2024

src: cpu: aarch64: injectors: eltwise_injector - improve gelu performance for block size 16 #2072

Are you sure you want to change the base?

src: cpu: aarch64: injectors: eltwise_injector - improve gelu performance for block size 16 #2072

Conversation

nikhilfujitsu commented Sep 3, 2024 • edited Loading

Description

Major Code changes:

Checklist

General

abhijain1204fujitsu commented Sep 5, 2024

jondea commented Sep 6, 2024

jondea commented Sep 6, 2024

nikhilfujitsu commented Sep 9, 2024 • edited Loading

nikhilfujitsu commented Sep 9, 2024

jondea left a comment

Choose a reason for hiding this comment

jondea left a comment

Choose a reason for hiding this comment

nikhilfujitsu commented Sep 9, 2024

nikhilfujitsu commented Sep 9, 2024

nikhilfujitsu commented Sep 9, 2024

vpirogov commented Sep 9, 2024

nikhilfujitsu commented Sep 9, 2024

vpirogov commented Sep 9, 2024

nikhilfujitsu commented Sep 10, 2024

nikhilfujitsu commented Sep 10, 2024

nikhilfujitsu left a comment

Choose a reason for hiding this comment

abhijain1204fujitsu commented Sep 16, 2024

abhijain1204fujitsu commented Sep 20, 2024

nikhilfujitsu commented Sep 3, 2024 •

edited

Loading

nikhilfujitsu commented Sep 9, 2024 •

edited

Loading