-
Notifications
You must be signed in to change notification settings - Fork 5k
Signal SIGILL (Illegal instruction) code ILL_ILLOPN (Illegal operand) after migrate to .net9 #112897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@tannergooding @dotnet/jit-contrib any idea how |
ah, maybe it's actually VXSort in GC? |
@Typhon226 could you please upload it to https://developercommunity.visualstudio.com/ (you can configure privacy level for attachments there if needed) |
I'm not aware of anything that would cause this JIT side, as we should be guarding and even asserting that AVX512 specific instructions/nodes aren't being introduced if |
|
@jkotas as someone who often views dumps uploaded by the community - do I need a special permission to view them from there? Because I get Error 403 |
You do not need special permissions. It looks like that the link with restricted permissions was copy&pasted into description from somewhere. @Typhon226 Could you please attach the dump to the developercommunity issue via the paper clip icon so that we are able to access it? |
@EgorBo I reupload the file. |
@Typhon226 Thanks! Looks like it's definitely inside the GC (VXSort). GC uses AVX512 to accelerate the sort, but it's expected to be under some run-time check..
|
Tagging subscribers to this area: @dotnet/gc |
adding @cshung, since linux support for vxsort was added in 9. If it was using avx512 specific registers it probably would have failed in other environments. Is this specific to the CPU specification listed the OP? |
The code that is responsible for detecting AVX512 in run-time in GC: https://github.com/dotnet/runtime/blob/main/src/coreclr/gc/vxsort/isa_detection.cpp#L85-L121 |
I have no clue if The Linux support should really be nearly identical to what Windows is doing there maybe we can even just use the GC/EE interface and mirror the jit flags |
The information used by It would be useful to check what the |
Per https://github.com/dotnet/runtime/blob/main/docs/project/linux-build-methodology.md#security-related-servicing, we are statically linking low-level C library helpers from Ubuntu 16. The low-level C library includes the helper to initialize __cpu_model. The copy of the helper that we are linking in seems to be missing this fix to handle AVX512 correctly: gcc-mirror/gcc@059cc8a We either need to patch the low-level C library in our build containers (cc @sbomer) or to switch to our copy of the AVX512 detection logic as @tannergooding suggested. |
switching to using the PAL helper would be good, assuming it works for standalone GC too. |
It can be made to work. I assume that we would link the PAL helper statically. We tend to avoid communicating these types of details over GC/EE interface. Note that this bug is likely .NET 9 specific. It should be fixed in .NET 10 as a side-effect of updating our dependencies (#109939). |
I have the same issue after migrating from dotnet8 to dotnet9 in some backend services in Debian 12 The systemd log has these entries:
CPU Info:
|
Patching the C helpers in the container might be a riskier fix than using the PAL check since it involves switching to a source-built version (although it would be a good test of our servicing strategy). Using the PAL check also seems better long-term since it removes a dependency on the C helpers. Any opinions @jkotas? Happy to help patch the dependency if that's what we decide. |
@jkotas I don't think that is the issue here, the referenced commit seems to be for situations where the OS has AVX512 support completely disabled, but according to Intel docs, the CPU in the description lacks it totally: |
We may be missing more GCC bug fixes there, or something else is off. The dump that I have looked at had In any case, the fix implemented by #113032 is switching to our helper that should not have these bugs. |
The 'fix' for the issue is merged as #113032, I am wondering if you can help with testing it? |
Of cause. |
Yes, sure. How could we get the patched version? |
Hi @angelMachin and @Typhon226, I have added the gc specific binaries built in release for you test with: https://github.com/mrsharm/GCBinaries/tree/main/112897 To make use of these, you'll need to copy these binaries in the same spot as libcoreclr.so (depending on where you run dotnet from, it could live in |
@Typhon226, I wonder if you can check in your dump whether or not the crashing process is loading the |
@cshung It look's like the libclrgcexp.so is loaded |
This should be |
My guess is that it is not loading libclrgcexp.so` that @mrsharm shared above. |
@jkotas I changed the copy command to the path you provide.
I was not shure if i had to set DOTNET_GCName so i tried without setting it and then got the illegal instruction error. |
This means that the @mrsharm Binaries built on regular Linux are non-portable between Linux distros. They are only guaranteed to work on the distro you have built it on that is presumably not Ubuntu 22. You need to use official build image https://github.com/dotnet/runtime/blob/main/docs/workflow/using-docker.md#the-official-runtime-docker-images to build a binary that is portable between distros. |
I have got some new binaries built using the docker instruction above, let's hope it works this time. https://github.com/cshung/GCBinaries/tree/main/112897 Build command line: docker run --rm \
-v ~/git/runtime:/runtime \
-w /runtime \
-e ROOTFS_DIR=/crossrootfs/x64/ \
mcr.microsoft.com/dotnet-buildtools/prereqs:azurelinux-3.0-net10.0-cross-amd64 \
./build.sh -s clr --cross -c Release Dependencies: andrewau@aa-helium:~/dev/GCBinaries/112897$ ldd -v ./libclrgcexp.so
linux-vdso.so.1 (0x00007ffcfaa9a000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2d0c451000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2d0c302000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2d0c2e7000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2d0c0f5000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2d0c749000)
Version information:
./libclrgcexp.so:
libstdc++.so.6 (GLIBCXX_3.4) => /lib/x86_64-linux-gnu/libstdc++.so.6
libstdc++.so.6 (CXXABI_1.3) => /lib/x86_64-linux-gnu/libstdc++.so.6
libm.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libm.so.6
libm.so.6 (GLIBC_2.27) => /lib/x86_64-linux-gnu/libm.so.6
libgcc_s.so.1 (GCC_3.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.6) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.17) => /lib/x86_64-linux-gnu/libc.so.6
ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
/lib/x86_64-linux-gnu/libstdc++.so.6:
libm.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libm.so.6
ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
libgcc_s.so.1 (GCC_4.2.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
libgcc_s.so.1 (GCC_3.4) => /lib/x86_64-linux-gnu/libgcc_s.so.1
libgcc_s.so.1 (GCC_3.3) => /lib/x86_64-linux-gnu/libgcc_s.so.1
libgcc_s.so.1 (GCC_3.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.6) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.18) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.16) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.17) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
/lib/x86_64-linux-gnu/libm.so.6:
ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libc.so.6
/lib/x86_64-linux-gnu/libgcc_s.so.1:
libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
/lib/x86_64-linux-gnu/libc.so.6:
ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2 |
It work. My app now runs for more then 10 minutes. Without crashing. |
I have started the process of getting this fix backported to dotnet 9. Just to be safe, would you mind helping us to test the new binaries once more? The new binaries are shared at exactly the same place here as before. |
@cshung The new binaries are working too. |
Hello, Update: |
Can you show us some diagnostic information what is going on? |
The big problem is that the server is simply gone and this is the only information I get: |
Can you try capturing a crash dump? |
I first tried it with .NET 10.0, same error. |
Description
We are currently migrating to .net9 and after some the our application crashes without any exception.
At first i found this message in the journal of our ubuntu 22.04:
kernel: traps: .NET Server GC[1867769] trap invalid opcode ip:7f2eb7bdc9d1 sp:7f2eb0de1e90 error:0 in libcoreclr.so[7f2eb7784000+4ed000]
Then, after some digging, i was able to generate a dump with the binary of this bug

I used WinDbg to open it and saw the following error:
Signal SIGILL (Illegal instruction) code ILL_ILLOPN (Illegal operand) at 0x7fcae8ee19d1
Locking at the address is saw this:
The crash did not happen at the same timings. It's arround 20-40 seconds until this happens.
The application is running on a small kubenetes system with only one node.
If i host everything on my local machine everything works fine.
Reproduction Steps
Sadly i don't now how to provide a reproduction step without giving access to the server.
I can provide a dump if needed.
Expected behavior
No crash
Actual behavior
As written in the description.
If needed i can provide a dump (~730MB uncompressed, ~65MB compressed)
Regression?
No response
Known Workarounds
Going back to .net8
Configuration
Other information
No response
The text was updated successfully, but these errors were encountered: