gpu: generic: sycl: lnorm Intel GPU precision issues #2071

kala855 · 2024-09-02T13:46:25Z

Description

When used in Intel PVC, the layer normalization SYCL kernel implementation faces some precision issues in variance computation.

create: --lnorm --engine=gpu --inplace=true 30x300
run: --lnorm --engine=gpu --inplace=true 30x300
[   0][VAR][0] exp_f32:   0.0202823 exp:   0.0202823 got:   0.0202823 diff:1.86265e-09 rdiff:9.1836e-08
[   1][VAR][1] exp_f32:   0.0804944 exp:   0.0804944 got:   0.0804944 diff:7.45058e-09 rdiff:9.25603e-08
[   2][VAR][2] exp_f32:    0.327062 exp:    0.327062 got:    0.327062 diff:2.98023e-08 rdiff:9.11213e-08
[   6][VAR][6] exp_f32:     82.4779 exp:     82.4779 got:     82.4779 diff:7.62939e-06 rdiff:9.25023e-08
[  17][VAR][17] exp_f32:      1.3052 exp:      1.3052 got:      1.3052 diff:1.19209e-07 rdiff:9.1334e-08
[  22][VAR][22] exp_f32:   0.0829537 exp:   0.0829537 got:   0.0829537 diff:7.45058e-09 rdiff:8.98161e-08
[  23][VAR][23] exp_f32:    0.324289 exp:    0.324289 got:    0.324289 diff:2.98023e-08 rdiff:9.19006e-08
[  29][VAR][29] exp_f32:   0.0822034 exp:   0.0822034 got:   0.0822034 diff:7.45058e-09 rdiff:9.0636e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:7.62939e-06 err_max_rdiff:9.25603e-08 all_max_diff:7.62939e-06 all_max_rdiff:9.25603e-08
0:FAILED (errors:8 total:9060) __REPRO: --lnorm --engine=gpu --inplace=true 30x300

[  14][VAR][14] exp_f32: 2.68618e-05 exp: 2.68618e-05 got: 2.68618e-05 diff:1.81899e-12 rdiff:6.77165e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:1.81899e-12 err_max_rdiff:6.77165e-08 all_max_diff:1.81899e-12 all_max_rdiff:6.77165e-08
8471:FAILED (errors:1 total:75) __REPRO: --lnorm --engine=gpu --dt=f32:bf16 --tag=axb --stat_tag=abx --flags=CH 15x3_n"lnorm_ci_0d:0"

The previous are just a couple of failing examples.

As a proposal, the variance threshold is modified to pass the failing tests.

kala855 · 2024-09-02T13:55:53Z

To give a little bit of additional context. I found that in this line. The division is making v_variance slightly different from the reference that is computed in benchdnn (there is a thr=0 for these cases). Maybe that is happening because of some compiler optimization.

t4c1

Looks like an improvement. However I would just set the threshold to 5e-7 for all the cases, as 0 does not make sense as soon as floating point calculations are involved.

mgouicem · 2024-09-03T10:16:42Z

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device?
We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

vpirogov · 2024-09-03T16:37:54Z

@kala855, benchdnn is very sensitive to numerical issues by design. So in cases like this it's important to understand where the difference is coming from before considering the threshold change.

kala855 · 2024-09-13T08:10:26Z

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device? We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

We have been checking carefully these days to see what is happening here. Doing a comparison between the assembly generated by icpx on the OCL and SYCL versions of the implementations we found that:

The OCL version generates an instruction math.invm which is IEEE standard compliant and is used to compute the division.
The SYCL version generates an instruction math.inv which is not IEEE standard compliant and is the instruction that generates the precision issues.
We were looking at different compiler flags to try to generate the same behavior on the SYCL side but was not possible.
We are thinking that the problem is due to some compiler bug that does not take into account the flags to generate the expected code. (-fno-fast-math, -ffp-model=precise, etc.)
Info about math.inv and math.invm instructions could be found here pages 24 and 26 respectively.

Any suggestions or feedback will be more than welcome. Thanks.

@mgouicem @t4c1 @sgeor255

mgouicem · 2024-09-16T08:41:07Z

Thanks for checking. I would suggest:

to open an issue against the sycl compiler with your findings,
add a condition on that threshold increase on DNNL_WITH_SYCL
add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

gpu: generic: sycl: lnorm: Intel GPU precision issues

08fd0ee

kala855 requested a review from a team as a code owner September 2, 2024 13:46

kala855 requested a review from a team September 2, 2024 13:49

kala855 changed the title ~~gpu: generic: sycl: lnorm: Intel GPU precision issues~~ gpu: generic: sycl: lnorm Intel GPU precision issues Sep 2, 2024

t4c1 approved these changes Sep 3, 2024

View reviewed changes

sgeor255 approved these changes Sep 3, 2024

View reviewed changes

kala855 requested review from t4c1 and sgeor255 September 13, 2024 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

kala855 commented Sep 2, 2024

kala855 commented Sep 2, 2024

t4c1 left a comment

mgouicem commented Sep 3, 2024

vpirogov commented Sep 3, 2024

kala855 commented Sep 13, 2024

mgouicem commented Sep 16, 2024

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

Are you sure you want to change the base?

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

Conversation

kala855 commented Sep 2, 2024

Description

kala855 commented Sep 2, 2024

t4c1 left a comment

Choose a reason for hiding this comment

mgouicem commented Sep 3, 2024

vpirogov commented Sep 3, 2024

kala855 commented Sep 13, 2024

mgouicem commented Sep 16, 2024