Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kala855
Copy link
Contributor

@kala855 kala855 commented Sep 2, 2024

Description

When used in Intel PVC, the layer normalization SYCL kernel implementation faces some precision issues in variance computation.

create: --lnorm --engine=gpu --inplace=true 30x300
run: --lnorm --engine=gpu --inplace=true 30x300
[   0][VAR][0] exp_f32:   0.0202823 exp:   0.0202823 got:   0.0202823 diff:1.86265e-09 rdiff:9.1836e-08
[   1][VAR][1] exp_f32:   0.0804944 exp:   0.0804944 got:   0.0804944 diff:7.45058e-09 rdiff:9.25603e-08
[   2][VAR][2] exp_f32:    0.327062 exp:    0.327062 got:    0.327062 diff:2.98023e-08 rdiff:9.11213e-08
[   6][VAR][6] exp_f32:     82.4779 exp:     82.4779 got:     82.4779 diff:7.62939e-06 rdiff:9.25023e-08
[  17][VAR][17] exp_f32:      1.3052 exp:      1.3052 got:      1.3052 diff:1.19209e-07 rdiff:9.1334e-08
[  22][VAR][22] exp_f32:   0.0829537 exp:   0.0829537 got:   0.0829537 diff:7.45058e-09 rdiff:8.98161e-08
[  23][VAR][23] exp_f32:    0.324289 exp:    0.324289 got:    0.324289 diff:2.98023e-08 rdiff:9.19006e-08
[  29][VAR][29] exp_f32:   0.0822034 exp:   0.0822034 got:   0.0822034 diff:7.45058e-09 rdiff:9.0636e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:7.62939e-06 err_max_rdiff:9.25603e-08 all_max_diff:7.62939e-06 all_max_rdiff:9.25603e-08
0:FAILED (errors:8 total:9060) __REPRO: --lnorm --engine=gpu --inplace=true 30x300
[  14][VAR][14] exp_f32: 2.68618e-05 exp: 2.68618e-05 got: 2.68618e-05 diff:1.81899e-12 rdiff:6.77165e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:1.81899e-12 err_max_rdiff:6.77165e-08 all_max_diff:1.81899e-12 all_max_rdiff:6.77165e-08
8471:FAILED (errors:1 total:75) __REPRO: --lnorm --engine=gpu --dt=f32:bf16 --tag=axb --stat_tag=abx --flags=CH 15x3_n"lnorm_ci_0d:0"

The previous are just a couple of failing examples.

As a proposal, the variance threshold is modified to pass the failing tests.

@kala855 kala855 requested a review from a team as a code owner September 2, 2024 13:46
@kala855 kala855 requested a review from a team September 2, 2024 13:49
@kala855
Copy link
Contributor Author

kala855 commented Sep 2, 2024

To give a little bit of additional context. I found that in this line. The division is making v_variance slightly different from the reference that is computed in benchdnn (there is a thr=0 for these cases). Maybe that is happening because of some compiler optimization.

@kala855 kala855 changed the title gpu: generic: sycl: lnorm: Intel GPU precision issues gpu: generic: sycl: lnorm Intel GPU precision issues Sep 2, 2024
Copy link
Contributor

@t4c1 t4c1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an improvement. However I would just set the threshold to 5e-7 for all the cases, as 0 does not make sense as soon as floating point calculations are involved.

@mgouicem
Copy link
Contributor

mgouicem commented Sep 3, 2024

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device?
We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

@vpirogov
Copy link
Member

vpirogov commented Sep 3, 2024

@kala855, benchdnn is very sensitive to numerical issues by design. So in cases like this it's important to understand where the difference is coming from before considering the threshold change.

@kala855
Copy link
Contributor Author

kala855 commented Sep 13, 2024

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device? We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

We have been checking carefully these days to see what is happening here. Doing a comparison between the assembly generated by icpx on the OCL and SYCL versions of the implementations we found that:

  1. The OCL version generates an instruction math.invm which is IEEE standard compliant and is used to compute the division.
  2. The SYCL version generates an instruction math.inv which is not IEEE standard compliant and is the instruction that generates the precision issues.
  3. We were looking at different compiler flags to try to generate the same behavior on the SYCL side but was not possible.
  4. We are thinking that the problem is due to some compiler bug that does not take into account the flags to generate the expected code. (-fno-fast-math, -ffp-model=precise, etc.)
  5. Info about math.inv and math.invm instructions could be found here pages 24 and 26 respectively.

Any suggestions or feedback will be more than welcome. Thanks.

@mgouicem @t4c1 @sgeor255

@mgouicem
Copy link
Contributor

Thanks for checking. I would suggest:

  • to open an issue against the sycl compiler with your findings,
  • add a condition on that threshold increase on DNNL_WITH_SYCL
  • add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants