-
Notifications
You must be signed in to change notification settings - Fork 5k
Auto-vectorization of pointer indirection expr. and explicit layout #64026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsConsider two forms of writing fabs() math function in C: #include <inttypes.h>
double fabs_pointer(double value)
{
uint64_t u = *(uint64_t*)&value & 0x7FFFFFFFFFFFFFFF;
return *(double*)&u;
}
double fabs_union(double value)
{
union { double f; uint64_t i; } u = { x };
u.i &= -1ULL / 2;
return u.f;
} this code modifies raw bits of Clang with .LCPI0_0:
.quad 0x7fffffffffffffff # double NaN
.quad 0x7fffffffffffffff # double NaN
fabs_pointer: # @fabs_pointer
vandps xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
ret
.LCPI1_0:
.quad 0x7fffffffffffffff # double NaN
.quad 0x7fffffffffffffff # double NaN
fabs_union: # @fabs_union
vandps xmm0, xmm0, xmmword ptr [rip + .LCPI1_0]
ret We can write the same code in C# in four different forms: using System;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
class C
{
static double fabs(double v)
{
ulong u = BitConverter.DoubleToUInt64Bits(v);
return BitConverter.UInt64BitsToDouble(u & 0x7FFFFFFFFFFFFFFF);
}
static unsafe double fabs_pointer(double v)
{
ulong u = *(ulong*)&v & 0x7FFFFFFFFFFFFFFF;
return *(double*)&u;
}
static double fabs_manual_vectorization(double v)
{
return Sse2.And(
Vector128.CreateScalarUnsafe(v),
Vector128.Create(0x7FFFFFFFFFFFFFFF).AsDouble()).ToScalar();
}
[StructLayout(LayoutKind.Explicit)]
struct S
{
[FieldOffset(0)] public double d;
[FieldOffset(0)] public ulong u;
}
static double fabs_struct(double v)
{
var s = new S { d = v };
s.u &= 0x7FFFFFFFFFFFFFFF;
return s.d;
}
} but the ryujit's codegen varies a lot: ; Core CLR 6.0.21.52210 on amd64
C..ctor()
L0000: ret
C.fabs(Double)
L0000: vzeroupper
L0003: vmovq rax, xmm0
L0008: mov rdx, 0x7fffffffffffffff
L0012: and rax, rdx
L0015: vmovq xmm0, rax
L001a: ret
C.fabs_pointer(Double)
L0000: vzeroupper
L0003: vmovsd [rsp+8], xmm0
L0009: mov rax, 0x7fffffffffffffff
L0013: and rax, [rsp+8]
L0018: vmovq xmm0, rax
L001d: ret
C.fabs_manual_vectorization(Double)
L0000: vzeroupper
L0003: vandpd xmm0, xmm0, [0x7ffb39e304b0]
L000b: ret
C.fabs_struct(Double)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: vmovsd [rsp+8], xmm0
L000d: mov rax, [rsp+8]
L0012: mov [rsp+0x10], rax
L0017: lea rax, [rsp+0x10]
L001c: mov rdx, 0x7fffffffffffffff
L0026: and [rax], rdx
L0029: vmovsd xmm0, [rsp+0x10]
L002f: add rsp, 0x18
L0033: ret In the ideal world, pointer indirection and explicit layout would give the same codegen as the manually vectorized variant.
|
I think we need a fix for #11413 first |
I would say "not really". The
|
Found a minor inefficiency in Setup (NativeAOT): $ dotnet7 publish --use-current-runtime -c Release -o dist \
-p:PublishAot=true -p:AllowUnsafeBlocks=true -p:NativeLib=shared -p:IlcInstructionSet=avx2
$ gdb dist/nativelib1.so -batch \
-ex "set disassembly-flavor intel" \
-ex "disassemble nativelib1_C__fabs" \
-ex "disassemble nativelib1_C__fabs_pointer" \
-ex "disassemble nativelib1_C__fabs_manual_vectorization" \
-ex "disassemble nativelib1_C__fabs_struct" Disassembly: Dump of assembler code for function nativelib1_C__fabs:
0x0000000000217590 <+0>: vzeroupper
0x0000000000217593 <+3>: vmovq rax,xmm0
0x0000000000217598 <+8>: movabs rdi,0x7fffffffffffffff
0x00000000002175a2 <+18>: and rax,rdi
0x00000000002175a5 <+21>: vmovq xmm0,rax
0x00000000002175aa <+26>: ret
End of assembler dump.
Dump of assembler code for function nativelib1_C__fabs_pointer:
0x00000000002175b0 <+0>: push rax
0x00000000002175b1 <+1>: vzeroupper
0x00000000002175b4 <+4>: vmovsd QWORD PTR [rsp],xmm0
0x00000000002175b9 <+9>: movabs rax,0x7fffffffffffffff
0x00000000002175c3 <+19>: and rax,QWORD PTR [rsp]
0x00000000002175c7 <+23>: vmovq xmm0,rax
0x00000000002175cc <+28>: add rsp,0x8
0x00000000002175d0 <+32>: ret
End of assembler dump.
Dump of assembler code for function nativelib1_C__fabs_manual_vectorization:
0x00000000002175e0 <+0>: vzeroupper
0x00000000002175e3 <+3>: vandpd xmm0,xmm0,XMMWORD PTR [rip+0x10fd35] # 0x327320
0x00000000002175eb <+11>: ret
End of assembler dump.
Dump of assembler code for function nativelib1_C__fabs_struct:
0x00000000002175f0 <+0>: sub rsp,0x18
0x00000000002175f4 <+4>: vzeroupper
0x00000000002175f7 <+7>: vmovsd QWORD PTR [rsp+0x8],xmm0
0x00000000002175fd <+13>: mov rax,QWORD PTR [rsp+0x8]
0x0000000000217602 <+18>: mov QWORD PTR [rsp+0x10],rax
0x0000000000217607 <+23>: lea rax,[rsp+0x10]
0x000000000021760c <+28>: movabs rdi,0x7fffffffffffffff
0x0000000000217616 <+38>: and QWORD PTR [rax],rdi
0x0000000000217619 <+41>: vmovsd xmm0,QWORD PTR [rsp+0x10]
0x000000000021761f <+47>: add rsp,0x18
0x0000000000217623 <+51>: ret
End of assembler dump.
|
I'm not sure its worth spending cycles optimizing a pattern which has a better alternative we can push users towards instead. This is very much a case where I think we should push devs to using the built-in APIs:
As these will, across the board, be optimized and handled better in a multitude of contexts. They also won't have other issues that might crop up from the local being "address taken" or similar. |
I agree that However, Splitting |
Devs should never need to The optimal implementation if you were to do it "manually" would be |
There is probably room to improve the handling around |
Indeed, "possible auto-vectorization opportunities" was the actual intent of this thread, fabs is used as an example because it is simpler to demonstrate the codegen differences in four ways of implementing the same algorithm. In general, it is about operating on raw/integral representation of double or float objects. |
From PR #78741, // related issues:
// https://github.com/dotnet/runtime/issues/10315
// https://github.com/dotnet/runtime/issues/55357
static class C
{
static unsafe ushort narrow(ulong v) => *(ushort*)&v;
static unsafe ulong widen(ushort v) => *(ulong*)&v;
static unsafe ushort unchanged(ushort v) => *(ushort*)&v;
} - ; current codegen
+ ; proposed codegen
; https://godbolt.org/z/r1acfxao6
C.narrow(UInt64)
- mov [rsp+0x10], rdx
- movzx eax, word ptr [rsp+0x10]
+ movzx eax, dx
ret
C.widen(UInt16)
- mov [rsp+0x10], edx
- mov rax, [rsp+0x10]
+ movzx rax, dx
ret
C.unchanged(UInt16)
movzx eax, cx
ret |
Uh oh!
There was an error while loading. Please reload this page.
Consider two forms of writing fabs() math function in C:
this code modifies raw bits of
double
input in-place and returns the modified value.Clang with
-O2 -march=ivybridge
gives identical codegen:We can write the same code in C# in four different forms:
but the ryujit's codegen varies a lot:
In the ideal world, pointer indirection and explicit layout would give the same codegen as the manually vectorized variant.
category:cq
theme:vector-codegen
skill-level:expert
cost:small
impact:small
The text was updated successfully, but these errors were encountered: