Skip to content

Signal SIGILL (Illegal instruction) code ILL_ILLOPN (Illegal operand) after migrate to .net9 #112897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Typhon226 opened this issue Feb 25, 2025 · 48 comments · Fixed by #113032
Assignees
Labels
area-GC-coreclr avx512 Related to the AVX-512 architecture in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner

Comments

@Typhon226
Copy link

Description

We are currently migrating to .net9 and after some the our application crashes without any exception.
At first i found this message in the journal of our ubuntu 22.04:
kernel: traps: .NET Server GC[1867769] trap invalid opcode ip:7f2eb7bdc9d1 sp:7f2eb0de1e90 error:0 in libcoreclr.so[7f2eb7784000+4ed000]

Then, after some digging, i was able to generate a dump with the binary of this bug
I used WinDbg to open it and saw the following error:
Signal SIGILL (Illegal instruction) code ILL_ILLOPN (Illegal operand) at 0x7fcae8ee19d1
Locking at the address is saw this:
Image

The crash did not happen at the same timings. It's arround 20-40 seconds until this happens.

The application is running on a small kubenetes system with only one node.
If i host everything on my local machine everything works fine.

Reproduction Steps

Sadly i don't now how to provide a reproduction step without giving access to the server.
I can provide a dump if needed.

Expected behavior

No crash

Actual behavior

As written in the description.
If needed i can provide a dump (~730MB uncompressed, ~65MB compressed)

Regression?

No response

Known Workarounds

Going back to .net8

Configuration

dotnet info of the kubernetes pod:
.NET SDK:
 Version:           9.0.200
 Commit:            90e8b202f2
 Workload version:  9.0.200-manifests.b4a8049f
 MSBuild version:   17.13.8+cbc39bea8

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  22.04
 OS Platform: Linux
 RID:         linux-x64
 Base Path:   /usr/share/dotnet/sdk/9.0.200/

.NET workloads installed:
There are no installed workloads to display.
Configured to use loose manifests when installing new manifests.

Host:
  Version:      9.0.2
  Architecture: x64
  Commit:       80aa709f5d

.NET SDKs installed:
  9.0.200 [/usr/share/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 9.0.2 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 9.0.2 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found
Host CPU info:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
stepping        : 1
microcode       : 0xb000036
cpu MHz         : 2199.998
cache size      : 16384 KB
physical id     : 0
siblings        : 6
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat umip md_clear arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data bhi
bogomips        : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
CPU info of the Pod:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
stepping        : 1
microcode       : 0xb000036
cpu MHz         : 2199.998
cache size      : 16384 KB
physical id     : 0
siblings        : 6
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat umip md_clear arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data bhi
bogomips        : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Other information

No response

@ghost ghost added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 25, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Feb 25, 2025
@EgorBo EgorBo added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Feb 25, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Feb 25, 2025

@tannergooding @dotnet/jit-contrib any idea how vpbroadcastq could appear in a non-avx512 environment?

@EgorBo
Copy link
Member

EgorBo commented Feb 25, 2025

kernel: traps: .NET Server GC[1867769] trap invalid opcode ip:7f2eb7bdc9d1 sp:7f2eb0de1e90 error:0 in libcoreclr.so[7f2eb7784000+4ed000]

ah, maybe it's actually VXSort in GC?

@EgorBo
Copy link
Member

EgorBo commented Feb 25, 2025

I can provide a dump if needed.

@Typhon226 could you please upload it to https://developercommunity.visualstudio.com/ (you can configure privacy level for attachments there if needed)

@tannergooding
Copy link
Member

vpbroadcastq isn't AVX512 specific, it also exists on AVX2. The issue here looks to be specific to it using ZMM, which is AVX512F only.

I'm not aware of anything that would cause this JIT side, as we should be guarding and even asserting that AVX512 specific instructions/nodes aren't being introduced if compSupports reports false. So it could be GC or something else instead.

@JulieLeeMSFT
Copy link
Member

vpbroadcastq isn't AVX512 specific, it also exists on AVX2. The issue here looks to be specific to it using ZMM, which is AVX512F only.

I'm not aware of anything that would cause this JIT side, as we should be guarding and even asserting that AVX512 specific instructions/nodes aren't being introduced if compSupports reports false. So it could be GC or something else instead.

CC @mangod9 @Maoni0.

@Typhon226
Copy link
Author

@EgorBo
Dump is attached here:
https://developercommunity.visualstudio.com/t/Dump-for-Issue-112897-Signal-SIGILL-Il/10858257

@EgorBo
Copy link
Member

EgorBo commented Feb 26, 2025

@EgorBo Dump is attached here: https://developercommunity.visualstudio.com/t/Dump-for-Issue-112897-Signal-SIGILL-Il/10858257

@jkotas as someone who often views dumps uploaded by the community - do I need a special permission to view them from there? Because I get Error 403

@jkotas
Copy link
Member

jkotas commented Feb 26, 2025

You do not need special permissions. It looks like that the link with restricted permissions was copy&pasted into description from somewhere.

@Typhon226 Could you please attach the dump to the developercommunity issue via the paper clip icon so that we are able to access it?

Image

@Typhon226
Copy link
Author

@EgorBo I reupload the file.
Error was i closed the upload popup after the progress bar was at 100%.
So i waited a little bit and it closed by it self, now i also was able to download.

@EgorBo
Copy link
Member

EgorBo commented Feb 26, 2025

@Typhon226 Thanks! Looks like it's definitely inside the GC (VXSort). GC uses AVX512 to accelerate the sort, but it's expected to be under some run-time check..

[0x5]   libcoreclr!vxsort::vxsort_machine_traits<long, (vxsort::vector_machine)2>::shift_n_sub<3>+0xd   (Inline Function)   (Inline Function)   
[0x6]   libcoreclr!vxsort::packer<long, int, (vxsort::vector_machine)2, 3, 2, 128, false>::pack+0xd   (Inline Function)   (Inline Function)   
[0x7]   libcoreclr!vxsort::vxsort<long, (vxsort::vector_machine)2, 8, 3>::sort+0x111   0x7fcae20e1e90   0x7fcae8ee18b9   
[0x8]   libcoreclr!vxsort::vxsort<long, (vxsort::vector_machine)2, 8, 3>::sort+0x59   0x7fcae20e2460   0x7fcae8ee1838   
[0x9]   libcoreclr!do_vxsort_avx512+0x58   0x7fcae20e2480   0x7fcae8cdbcc2   
[0xa]   libcoreclr!SVR::do_vxsort+0x97   (Inline Function)   (Inline Function)   
[0xb]   libcoreclr!SVR::gc_heap::sort_mark_list+0x252   0x7fcae20e2a20   0x7fcae8cf687c   
[0xc]   libcoreclr!SVR::gc_heap::mark_phase+0x1b5c   0x7fcae20e2a70   0x7fcae8cefc8a   
[0xd]   libcoreclr!SVR::gc_heap::gc1+0x2ca   0x7fcae20e2b50   0x7fcae8cd7181   
[0xe]   libcoreclr!SVR::gc_heap::garbage_collect+0xbb1   0x7fcae20e2c20   0x7fcae8cd381d   
[0xf]   libcoreclr!SVR::gc_heap::gc_thread_function+0x1abd   0x7fcae20e2cd0   0x7fcae8cd1d56   
[0x10]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x31   0x7fcae20e2d60   0x7fcae8bf752e   

@EgorBo EgorBo added area-GC-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Feb 26, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9
Copy link
Member

mangod9 commented Feb 26, 2025

adding @cshung, since linux support for vxsort was added in 9. If it was using avx512 specific registers it probably would have failed in other environments. Is this specific to the CPU specification listed the OP?

@EgorBo
Copy link
Member

EgorBo commented Feb 26, 2025

The code that is responsible for detecting AVX512 in run-time in GC: https://github.com/dotnet/runtime/blob/main/src/coreclr/gc/vxsort/isa_detection.cpp#L85-L121

@tannergooding
Copy link
Member

tannergooding commented Feb 26, 2025

I have no clue if __builtin_cpu_supports is doing the "right stuff"
My "guess" is that it's only checking the CPUID bit
and not also checking xsave, xgetbv, or other necessary bits like the OS requires

The Linux support should really be nearly identical to what Windows is doing there
but we already have an xplat helper in PAL (used by the VM and NAOT/Crossgen), so lets just not duplicate it and reuse that instead

maybe we can even just use the GC/EE interface and mirror the jit flags
like the JIT/EE interface has even and rely on the VM reporting rather than querying it independently in the GC?

@jkotas
Copy link
Member

jkotas commented Feb 26, 2025

The information used by __builtin_cpu_supports is populated here: https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/cpu_model/x86.c#L867-L918 . It does check xsave, xgetbv and other necessary bits.

It would be useful to check what the __cpu_model static got populated to, and what the code that run to populate it was. The process may have old or buggy version of these checks.

@jkotas
Copy link
Member

jkotas commented Feb 26, 2025

Per https://github.com/dotnet/runtime/blob/main/docs/project/linux-build-methodology.md#security-related-servicing, we are statically linking low-level C library helpers from Ubuntu 16. The low-level C library includes the helper to initialize __cpu_model.

The copy of the helper that we are linking in seems to be missing this fix to handle AVX512 correctly: gcc-mirror/gcc@059cc8a

We either need to patch the low-level C library in our build containers (cc @sbomer) or to switch to our copy of the AVX512 detection logic as @tannergooding suggested.

@mangod9
Copy link
Member

mangod9 commented Feb 26, 2025

switching to using the PAL helper would be good, assuming it works for standalone GC too.

@jkotas
Copy link
Member

jkotas commented Feb 26, 2025

switching to using the PAL helper would be good, assuming it works for standalone GC too.

It can be made to work. I assume that we would link the PAL helper statically. We tend to avoid communicating these types of details over GC/EE interface.

Note that this bug is likely .NET 9 specific. It should be fixed in .NET 10 as a side-effect of updating our dependencies (#109939).

@angelMachin
Copy link

angelMachin commented Feb 27, 2025

I have the same issue after migrating from dotnet8 to dotnet9 in some backend services in Debian 12
It is a random issue, sometimes it happens after hours of execution.

The systemd log has these entries:

kernel: traps: .NET Server GC[1957508] trap invalid opcode ip:7fcba04281ea sp:7fcb203c7360 error:0 in libcoreclr.so[7fcb9ffd3000+4ec000]

 Main process exited, code=killed, status=4/ILL

CPU Info:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
stepping	: 7
microcode	: 0x5003302
cpu MHz		: 2400.082
cache size	: 36608 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 movbe popcnt aes f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault intel_ppin ssbd ibrs ibpb stibp fsgsbase bmi1 bmi2 erms rdseed adx clflushopt clwb md_clear
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_stale_data retbleed gds bhi ibpb_no_ret
bogomips	: 4800.16
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

@mrsharm
Copy link
Member

mrsharm commented Feb 27, 2025

@sbomer: can you take a look please? Seems like a dependency issue as JanK mentioned here

@sbomer
Copy link
Member

sbomer commented Feb 27, 2025

Patching the C helpers in the container might be a riskier fix than using the PAL check since it involves switching to a source-built version (although it would be a good test of our servicing strategy). Using the PAL check also seems better long-term since it removes a dependency on the C helpers. Any opinions @jkotas?

Happy to help patch the dependency if that's what we decide.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Mar 1, 2025
@MichalPetryka
Copy link
Contributor

Per https://github.com/dotnet/runtime/blob/main/docs/project/linux-build-methodology.md#security-related-servicing, we are statically linking low-level C library helpers from Ubuntu 16. The low-level C library includes the helper to initialize __cpu_model.

The copy of the helper that we are linking in seems to be missing this fix to handle AVX512 correctly: gcc-mirror/gcc@059cc8a

We either need to patch the low-level C library in our build containers (cc @sbomer) or to switch to our copy of the AVX512 detection logic as @tannergooding suggested.

@jkotas I don't think that is the issue here, the referenced commit seems to be for situations where the OS has AVX512 support completely disabled, but according to Intel docs, the CPU in the description lacks it totally:

Image

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Mar 4, 2025
@jkotas
Copy link
Member

jkotas commented Mar 4, 2025

@jkotas I don't think that is the issue here, the referenced commit seems to be for situations where the OS has AVX512 support completely disabled, but according to Intel docs, the CPU in the description lacks it totally:

We may be missing more GCC bug fixes there, or something else is off.

The dump that I have looked at had FEATURE_AVX512F bit set in the __cpu_model static, and the code to initialize that static was missing gcc-mirror/gcc@059cc8a bug fix.

In any case, the fix implemented by #113032 is switching to our helper that should not have these bugs.

@cshung
Copy link
Contributor

cshung commented Mar 4, 2025

@Typhon226, @angelMachin

The 'fix' for the issue is merged as #113032, I am wondering if you can help with testing it?

@Typhon226
Copy link
Author

Of cause.
Could you please provide the patched binaries?
Or tell me how do i get them?

@angelMachin
Copy link

@Typhon226, @angelMachin

The 'fix' for the issue is merged as #113032, I am wondering if you can help with testing it?

Yes, sure. How could we get the patched version?

@mrsharm
Copy link
Member

mrsharm commented Mar 5, 2025

Hi @angelMachin and @Typhon226, I have added the gc specific binaries built in release for you test with: https://github.com/mrsharm/GCBinaries/tree/main/112897

To make use of these, you'll need to copy these binaries in the same spot as libcoreclr.so (depending on where you run dotnet from, it could live in /usr/lib/dotnet/shared/Microsoft.NETCore.App/<Version>/) and then setting the following environment variable export DOTNET_GCName=libclrgcexp.so. More Details about the configuration and more details about clrgc.

@Typhon226
Copy link
Author

Typhon226 commented Mar 6, 2025

Sadly i got the same error as before.
What i have done:

  • Created a folder named "gcTool" in my project and copied the two files in it
  • Modify my dockerfile and add "COPY ./gcTool/ /usr/share/dotnet/sdk/9.0.200/" into it
  • After starting the POD i checkt the directory
    Image
  • i run "export DOTNET_GCName=libclrgcexp.so"
  • checking "echo $DOTNET_GCName" outputs "libclrgcexp.so"
  • run my application with "dotnet MyApp.dll"

The dump show "(156.15b): Signal SIGILL (Illegal instruction) code ILL_ILLOPN (Illegal operand) at 0x7f699bb88391" again.
And the address seems to be the same as before:

Image

I also checked my DOTNET_GCPath variable which is empty.

@cshung
Copy link
Contributor

cshung commented Mar 7, 2025

@Typhon226, I wonder if you can check in your dump whether or not the crashing process is loading the libclrgcexp.so?

@Typhon226
Copy link
Author

@cshung It look's like the libclrgcexp.so is loaded

Image

@jkotas
Copy link
Member

jkotas commented Mar 10, 2025

Modify my dockerfile and add "COPY ./gcTool/ /usr/share/dotnet/sdk/9.0.200/" into it

This should be COPY ./gcTool/ /usr/share/dotnet/shared/Microsoft.NETCore.App/9.0.2/ so that it overwrites the default libclrgcexp.so

@jkotas
Copy link
Member

jkotas commented Mar 10, 2025

It look's like the libclrgcexp.so is loaded

My guess is that it is not loading libclrgcexp.so` that @mrsharm shared above.

@Typhon226
Copy link
Author

Typhon226 commented Mar 11, 2025

@jkotas I changed the copy command to the path you provide.
Then i run "export DOTNET_GCName=libclrgcexp.so".
Now the app will not start cause it is not able to init the GC anymore.

GC initialization failed with error 0x8007007E Failed to create CoreCLR, HRESULT: 0x8007007E

I was not shure if i had to set DOTNET_GCName so i tried without setting it and then got the illegal instruction error.

@jkotas
Copy link
Member

jkotas commented Mar 11, 2025

GC initialization failed with error 0x8007007E Failed to create CoreCLR, HRESULT: 0x8007007E

This means that the provided libclrgcexp.so is broken. I can see that the binary depends on GLIBC_2.38 (run ldd -v libclrgcexp.so), but Ubuntu 22 has 2.35 (https://launchpad.net/ubuntu/jammy/+source/glibc).

@mrsharm Binaries built on regular Linux are non-portable between Linux distros. They are only guaranteed to work on the distro you have built it on that is presumably not Ubuntu 22.

You need to use official build image https://github.com/dotnet/runtime/blob/main/docs/workflow/using-docker.md#the-official-runtime-docker-images to build a binary that is portable between distros.

@cshung
Copy link
Contributor

cshung commented Mar 12, 2025

I have got some new binaries built using the docker instruction above, let's hope it works this time.

https://github.com/cshung/GCBinaries/tree/main/112897

Build command line:

docker run --rm \
  -v ~/git/runtime:/runtime \
  -w /runtime \
  -e ROOTFS_DIR=/crossrootfs/x64/ \
  mcr.microsoft.com/dotnet-buildtools/prereqs:azurelinux-3.0-net10.0-cross-amd64 \
  ./build.sh -s clr --cross -c Release

Dependencies:

andrewau@aa-helium:~/dev/GCBinaries/112897$ ldd -v ./libclrgcexp.so
        linux-vdso.so.1 (0x00007ffcfaa9a000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2d0c451000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2d0c302000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2d0c2e7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2d0c0f5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2d0c749000)

        Version information:
        ./libclrgcexp.so:
                libstdc++.so.6 (GLIBCXX_3.4) => /lib/x86_64-linux-gnu/libstdc++.so.6
                libstdc++.so.6 (CXXABI_1.3) => /lib/x86_64-linux-gnu/libstdc++.so.6
                libm.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libm.so.6
                libm.so.6 (GLIBC_2.27) => /lib/x86_64-linux-gnu/libm.so.6
                libgcc_s.so.1 (GCC_3.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.6) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.17) => /lib/x86_64-linux-gnu/libc.so.6
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
        /lib/x86_64-linux-gnu/libstdc++.so.6:
                libm.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libm.so.6
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
                libgcc_s.so.1 (GCC_4.2.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_3.4) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_3.3) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_3.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.6) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.18) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.16) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.17) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libm.so.6:
                ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libgcc_s.so.1:
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libc.so.6:
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
                ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2

@Typhon226
Copy link
Author

Typhon226 commented Mar 13, 2025

It work. My app now runs for more then 10 minutes. Without crashing.
Thanks for the help!

@cshung
Copy link
Contributor

cshung commented Mar 14, 2025

It work. My app now runs for more then 10 minutes. Without crashing. Thanks for the help!

I have started the process of getting this fix backported to dotnet 9.
The port, however, was not trivial. There are some build system changes that altered the way the binary is built.
I have figured it out and produced some new binaries using the older build system,
In principle, that shouldn't change anything, in practice, you never really know.

Just to be safe, would you mind helping us to test the new binaries once more? The new binaries are shared at exactly the same place here as before.

https://github.com/cshung/GCBinaries/tree/main/112897

@Typhon226
Copy link
Author

@cshung The new binaries are working too.

@CyborgDE
Copy link

CyborgDE commented Apr 2, 2025

Hello,
The new binaries worked with Debian in 9.0.1 and 9.0.2, but I got the error again in 9.0.3...
RedHat no problems at the moment with 9.0.3.

Update:
9.0.2 with new binaries failed: Main process exited, code=killed, status=4/ILL

@jkotas jkotas reopened this Apr 3, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 3, 2025
@cshung
Copy link
Contributor

cshung commented Apr 5, 2025

Hello, The new binaries worked with Debian in 9.0.1 and 9.0.2, but I got the error again in 9.0.3... RedHat no problems at the moment with 9.0.3.

Update: 9.0.2 with new binaries failed: Main process exited, code=killed, status=4/ILL

Can you show us some diagnostic information what is going on?

@CyborgDE
Copy link

CyborgDE commented Apr 5, 2025

The big problem is that the server is simply gone and this is the only information I get:
Main process exited, code=killed, status=4/ILL

@cshung
Copy link
Contributor

cshung commented Apr 6, 2025

The big problem is that the server is simply gone and this is the only information I get: Main process exited, code=killed, status=4/ILL

Can you try capturing a crash dump?
https://learn.microsoft.com/en-us/dotnet/core/diagnostics/collect-dumps-crash

@CyborgDE
Copy link

CyborgDE commented Apr 7, 2025

I first tried it with .NET 10.0, same error.
It's a production server, so I have to be careful...
A process starts, the server restarts, and then the process starts again and terminates correctly.

@CyborgDE
Copy link

CyborgDE commented Apr 9, 2025

Image

@cshung
Copy link
Contributor

cshung commented Apr 13, 2025

Using some online services, I translated the German to English. Unfortunately, looking at a few objects in the dump doesn't really help much.

Image

Since you can capture a dump (hopefully at the point of crash), I guess we can look at the native stack and try to figure out what is causing the crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr avx512 Related to the AVX-512 architecture in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.