Skip to content

Process not responsive dump indicates garbage collection #110350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tornie2 opened this issue Dec 3, 2024 · 40 comments · Fixed by #110589 or #112809
Closed

Process not responsive dump indicates garbage collection #110350

tornie2 opened this issue Dec 3, 2024 · 40 comments · Fixed by #110589 or #112809
Assignees
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged os-windows
Milestone

Comments

@tornie2
Copy link

tornie2 commented Dec 3, 2024

Description

After upgrading to .net 9, we have random processes, which just freeze, becoming completely unresponsive
Process are run as windows services on windows VM

I have a dump file, which I could send to you
I would just rather not make that public as it probably has passwords within

Analyzing the dump indicates a possible problem in the garbage collector

0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 1484

    Key  : Analysis.Elapsed.mSec
    Value: 5300

    Key  : Analysis.IO.Other.Mb
    Value: 0

    Key  : Analysis.IO.Read.Mb
    Value: 1

    Key  : Analysis.IO.Write.Mb
    Value: 1

    Key  : Analysis.Init.CPU.mSec
    Value: 781

    Key  : Analysis.Init.Elapsed.mSec
    Value: 120611

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 223

    Key  : Analysis.Version.DbgEng
    Value: 10.0.27725.1000

    Key  : Analysis.Version.Description
    Value: 10.2408.27.01 amd64fre

    Key  : Analysis.Version.Ext
    Value: 1.2408.27.1

    Key  : CLR.Engine
    Value: CORECLR

    Key  : CLR.Version
    Value: 9.0.24.52809

    Key  : Failure.Bucket
    Value: BREAKPOINT_80000003_coreclr.dll!WKS::GCHeap::WaitUntilGCComplete

    Key  : Failure.Hash
    Value: {54e9a6da-d4d0-d004-574b-4219b46bdb8d}

    Key  : Failure.Source.FileLine
    Value: 265

    Key  : Failure.Source.FilePath
    Value: D:\a\_work\1\s\src\coreclr\gc\gcee.cpp

    Key  : Failure.Source.SourceServerCommand
    Value: raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gcee.cpp

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 896327

    Key  : Timeline.Process.Start.DeltaSec
    Value: 17922

    Key  : WER.OS.Branch
    Value: rs5_release

    Key  : WER.OS.Version
    Value: 10.0.17763.1

    Key  : WER.Process.Version
    Value: 1.0.0.0


FILE_IN_CAB:  SmfHaircuts.Service-2024-12-03-YB6213.DMP

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 0000000000000000
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD:  00000f6c

PROCESS_NAME:  SmfHaircuts.Service.dll

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

STACK_TEXT:  
0000008c`61d7e028 00007ffc`bdba0f33     : 00000000`00000000 000002be`90a8dab0 000002be`90a8d9f0 0000008c`61d7e170 : ntdll!NtWaitForSingleObject+0x14
0000008c`61d7e030 00007ffb`364f1c30     : 00000000`00000000 00004612`d1730f35 00000000`00000000 00000000`00000284 : KERNELBASE!WaitForSingleObjectEx+0x93
0000008c`61d7e0d0 00007ffb`36416915     : 00000000`00000000 0000008c`61d7e2d0 00000000`00000804 0000008c`61d7e1b0 : coreclr!WKS::GCHeap::WaitUntilGCComplete+0x30
0000008c`61d7e100 00007ffb`364e8328     : 00007ffa`d68130c0 00000000`00000000 00000000`00000000 0000027d`f97fc570 : coreclr!Thread::RareDisablePreemptiveGC+0x9d
0000008c`61d7e190 00007ffb`3659ea2d     : 00007ffa`d68130c0 00000000`00000000 000002be`906cedf0 00000001`00000000 : coreclr!JIT_ReversePInvokeEnterRare2+0x18
0000008c`61d7e1c0 00007ffa`d7e5b718     : 00000000`00000004 0000008c`61d7e260 00000000`00000000 00007ffc`c166598d : coreclr!JIT_ReversePInvokeEnterTrackTransitions+0x9d13d
0000008c`61d7e1f0 00000000`00000004     : 0000008c`61d7e260 00000000`00000000 00007ffc`c166598d 00000000`00000000 : 0x00007ffa`d7e5b718
0000008c`61d7e1f8 0000008c`61d7e260     : 00000000`00000000 00007ffc`c166598d 00000000`00000000 0000008c`61d7e1f0 : 0x4
0000008c`61d7e200 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0000008c`61d7e260


STACK_COMMAND:  ~0s; .ecxr ; kb

FAULTING_SOURCE_LINE:  D:\a\_work\1\s\src\coreclr\gc\gcee.cpp

FAULTING_SOURCE_FILE:  D:\a\_work\1\s\src\coreclr\gc\gcee.cpp

FAULTING_SOURCE_LINE_NUMBER:  265

FAULTING_SOURCE_SRV_COMMAND:  https://raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gcee.cpp

FAULTING_SOURCE_CODE:  
No source found for 'D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp'


SYMBOL_NAME:  coreclr!WKS::GCHeap::WaitUntilGCComplete+30

MODULE_NAME: coreclr

IMAGE_NAME:  coreclr.dll

FAILURE_BUCKET_ID:  BREAKPOINT_80000003_coreclr.dll!WKS::GCHeap::WaitUntilGCComplete

OS_VERSION:  10.0.17763.1

BUILDLAB_STR:  rs5_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

IMAGE_VERSION:  9.0.24.52809

FAILURE_ID_HASH:  {54e9a6da-d4d0-d004-574b-4219b46bdb8d}

Followup:     MachineOwner

Reproduction Steps

Not possilbe. Happens randomly

Expected behavior

Not freezing

Actual behavior

Process completely unresponsive

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@ghost ghost added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 3, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 3, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 3, 2024
@mangod9
Copy link
Member

mangod9 commented Dec 3, 2024

Does this happen during startup?

This feels similar to #105780. Are you able to try with disabling new GC mode with DOTNET_GCDynamicAdaptationMode=0 ?

The fix for this issue should be included in the Jan servicing release for 9.0.

@tornie2
Copy link
Author

tornie2 commented Dec 3, 2024

It is not during startup. Usually the processes will run days before this happens

I have added this to the csproj of the exe of the process wherre we have seen this most often
Will this be a temporary fix, until the fix is released?

<ItemGroup>
	<RuntimeHostConfigurationOption Include="DOTNET_GCDynamicAdaptationMode" Value="0" />
</ItemGroup>

@mangod9
Copy link
Member

mangod9 commented Dec 3, 2024

If it's not during startup it could be a different issue. If it has been reproing frequently then yeah disabling DOTNET_GCDynamicAdaptationMode would be worth a try. If you are able to share a dump privately that would help in confirming if it's the same issue.

@tornie2
Copy link
Author

tornie2 commented Dec 4, 2024

I hope you got an e-mail with a link to the dump file

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Thanks for sharing the dump. This isn't related to the DATAS issue I pointed to earlier. In fact the app is using WKS GC. The dump shows something similar to #107800, but in this case its a thread shutdown racing with GC trying to create a BGC thread, so looks to be a deadlock between DetachThread + CreateThread? @VSadov @kouvel @jkotas have you seen something similar before?

 # Child-SP          RetAddr               Call Site
00 000000de`8877f298 00007ff8`dce10f33     ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 000000de`8877f2a0 00007ff8`5a3f1c30     KERNELBASE!WaitForSingleObjectEx+0x93 [minkernel\kernelbase\synch.c @ 1328] 
02 (Inline Function) --------`--------     coreclr!GCEvent::Impl::Wait+0xf [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1381] 
03 (Inline Function) --------`--------     coreclr!GCEvent::Wait+0x16 [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1431] 
04 000000de`8877f340 00007ff8`5a316915     coreclr!WKS::GCHeap::WaitUntilGCComplete+0x30 [D:\a\_work\1\s\src\coreclr\gc\gcee.cpp @ 265] 
05 000000de`8877f370 00007ff8`5a2cc828     coreclr!Thread::RareDisablePreemptiveGC+0x9d [D:\a\_work\1\s\src\coreclr\vm\threadsuspend.cpp @ 2212] 
06 (Inline Function) --------`--------     coreclr!Thread::DisablePreemptiveGC+0x1f [D:\a\_work\1\s\src\coreclr\vm\threads.h @ 1297] 
07 (Inline Function) --------`--------     coreclr!GCHolderBase::EnterInternalCoop+0x37 [D:\a\_work\1\s\src\coreclr\vm\threads.h @ 4712] 
08 000000de`8877f400 00007ff8`5a3ac7b8     coreclr!GCCoop::GCCoop+0x54 [D:\a\_work\1\s\src\coreclr\vm\threads.h @ 4832] 
09 000000de`8877f430 00007ff8`5a3ac6ea     coreclr!Thread::CooperativeCleanup+0x24 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 2737] 
0a 000000de`8877f480 00007ff8`5a3ac60e     coreclr!Thread::DetachThread+0x9a [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 936] 
0b 000000de`8877f4b0 00007ff8`5a408c83     coreclr!TlsDestructionMonitor::~TlsDestructionMonitor+0x62 [D:\a\_work\1\s\src\coreclr\vm\ceemain.cpp @ 1744] 
0c 000000de`8877f4f0 00007ff8`dfe75d37     coreclr!__dyn_tls_dtor+0x63 [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\tls\tlsdtor.cpp @ 119] 
0d 000000de`8877f520 00007ff8`dfe75e6b     ntdll!LdrpCallInitRoutine+0x6f [minkernel\ntdll\ldr.c @ 212] 
0e 000000de`8877f590 00007ff8`dfe733e1     ntdll!LdrpCallTlsInitializers+0x87 [minkernel\ntdll\ldrtls.c @ 1067] 
0f 000000de`8877f610 00007ff8`dfeaa92e     ntdll!LdrShutdownThread+0x141 [minkernel\ntdll\ldrinit.c @ 6354] 
10 000000de`8877f710 00007ff8`dfe66f06     ntdll!RtlExitUserThread+0x3e [minkernel\ntdll\rtlstrt.c @ 2110] 
11 000000de`8877f750 00007ff8`df897ac4     ntdll!TppWorkerThread+0xbe6 [minkernel\threadpool\ntdll\worker.c @ 1286] 
12 000000de`8877fa40 00007ff8`dfeaa8c1     kernel32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
13 000000de`8877fa70 00000000`00000000     ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1163] 
00 000000de`8727f128 00007ff8`dce10f33     ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 000000de`8727f130 00007ff8`5a3895d4     KERNELBASE!WaitForSingleObjectEx+0x93 [minkernel\kernelbase\synch.c @ 1328] 
02 (Inline Function) --------`--------     coreclr!CLREventWaitHelper2+0x6 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 372] 
03 000000de`8727f1d0 00007ff8`5a3ab8f2     coreclr!CLREventWaitHelper+0x20 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 397] 
04 (Inline Function) --------`--------     coreclr!CLREventBase::WaitEx+0x10 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 466] 
05 (Inline Function) --------`--------     coreclr!CLREventBase::Wait+0x10 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 412] 
06 000000de`8727f230 00007ff8`5a3abb98     coreclr!`anonymous namespace'::CreateSuspendableThread+0x10e [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 1481] 
07 000000de`8727f300 00007ff8`5a400ac8     coreclr!GCToEEInterface::CreateThread+0x154 [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 1568] 
08 (Inline Function) --------`--------     coreclr!WKS::gc_heap::create_bgc_thread+0x18 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 39484] 
09 000000de`8727f4e0 00007ff8`5a314860     coreclr!WKS::gc_heap::prepare_bgc_thread+0x4c [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 39443] 
0a 000000de`8727f510 00007ff8`5a31756e     coreclr!WKS::gc_heap::garbage_collect+0x2e4 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 24384] 
0b 000000de`8727f560 00007ff8`5a5ab2a0     coreclr!WKS::GCHeap::GarbageCollectGeneration+0x13e [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 51065] 
0c 000000de`8727f5c0 00007ff8`5a4751b3     coreclr!WKS::GCHeap::GarbageCollect+0x110 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 50191] 
0d (Inline Function) --------`--------     coreclr!ThreadStore::TriggerGCForDeadThreadsIfNecessary+0xec2cc [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5483] 
0e 000000de`8727f600 00007ff8`5a38942a     coreclr!Thread::DoExtraWorkForFinalizer+0xec3af [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7051] 
0f 000000de`8727f670 00007ff8`5a385131     coreclr!FinalizerThread::FinalizerThreadWorker+0xca [D:\a\_work\1\s\src\coreclr\vm\finalizerthread.cpp @ 407] 
10 (Inline Function) --------`--------     coreclr!ManagedThreadBase_DispatchInner+0xd [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7110] 
11 000000de`8727f8c0 00007ff8`5a38504b     coreclr!ManagedThreadBase_DispatchMiddle+0x81 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7154] 
12 000000de`8727f970 00007ff8`5a3c4201     coreclr!ManagedThreadBase_DispatchOuter+0xab [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7313] 
13 (Inline Function) --------`--------     coreclr!ManagedThreadBase_NoADTransition+0x28 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7382] 
14 (Inline Function) --------`--------     coreclr!ManagedThreadBase::FinalizerBase+0x28 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7401] 
15 000000de`8727fa10 00007ff8`df897ac4     coreclr!FinalizerThread::FinalizerThreadStart+0x91 [D:\a\_work\1\s\src\coreclr\vm\finalizerthread.cpp @ 464] 
16 000000de`8727fb20 00007ff8`dfeaa8c1     kernel32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
17 000000de`8727fb50 00000000`00000000     ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1163] 

@jkotas
Copy link
Member

jkotas commented Dec 5, 2024

have you seen something similar before?

I believe that this is one of the reasons why native AOT uses FiberDetachCallback for thread shutdown notifications. (FiberDetachCallback does not run under loader lock.)

@VSadov
Copy link
Member

VSadov commented Dec 5, 2024

But from the stacks it looks like both threads wait on CLR events/locks. So maybe this is not related to OS lock?

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Here is the stack of the BGC Thread. Its started but doesnt get to gc_heap::bgc_thread_stub due to the loader lock. There are other threads in a similar state.

  22  Id: 55ec.2e90 Suspend: 0 Teb: 000000de`86728000 Unfrozen ".NET BGC"
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ff8`dfe783f5     : 00007ff8`dffb52b0 000000de`8664d000 000000de`8757f3c0 00000000`00002000 : ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 00007ff8`dfe735f7     : 00000000`00000000 00000000`00000000 000000de`8664d000 00000000`0000000f : ntdll!LdrpDrainWorkQueue+0x15d [minkernel\ntdll\ldrmap.c @ 3142] 
02 00007ff8`dfec8b25     : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`00000000 : ntdll!LdrpInitializeThread+0x8b [minkernel\ntdll\ldrinit.c @ 6528] 
03 00007ff8`dfec8703     : 00000000`00000000 00007ff8`dfe50000 00000000`00000000 000000de`86728000 : ntdll!_LdrpInitialize+0x409 [minkernel\ntdll\ldrinit.c @ 1838] 
04 00007ff8`dfec86ae     : 000000de`8757f3c0 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrpInitialize+0x3b [minkernel\ntdll\ldrinit.c @ 1435] 
05 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrInitializeThunk+0xe [minkernel\ntdll\ldrstart.c @ 91] 

  23  Id: 55ec.6370 Suspend: 0 Teb: 000000de`8672a000 Unfrozen
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ff8`dfe783f5     : 00007ff8`dffb52b0 000000de`8664d000 000000de`87b7f460 00000000`00002000 : ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 00007ff8`dfe735f7     : 00000000`00000000 00000000`00000000 000000de`8664d000 00000000`0000000f : ntdll!LdrpDrainWorkQueue+0x15d [minkernel\ntdll\ldrmap.c @ 3142] 
02 00007ff8`dfec8b25     : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`00000000 : ntdll!LdrpInitializeThread+0x8b [minkernel\ntdll\ldrinit.c @ 6528] 
03 00007ff8`dfec8703     : 00000000`00000000 00007ff8`dfe50000 00000000`00000000 000000de`8672a000 : ntdll!_LdrpInitialize+0x409 [minkernel\ntdll\ldrinit.c @ 1838] 
04 00007ff8`dfec86ae     : 000000de`87b7f460 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrpInitialize+0x3b [minkernel\ntdll\ldrinit.c @ 1435] 
05 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrInitializeThunk+0xe [minkernel\ntdll\ldrstart.c @ 91] 

@VSadov
Copy link
Member

VSadov commented Dec 5, 2024

Ah, I see - CreateNonSuspendableThread waits on an event that will be set by the spawned thread. It just needs to see progress form the spawned thread to declare that thread creation was successful.

Thus it is indeed possible that the spawned thread creation/progress needs the same loader lock that the thread that is shutting down is holding.

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Thread::CooperativeCleanup is new in 9, perhaps it should synchronize whether GC is in progress (well mainly if its trying to spawn a BGC thread)?

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

@tornie2, as a temporary workaround you can disable background GC to ensure that avoids the issue.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Dec 6, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Dec 6, 2024
@jkotas
Copy link
Member

jkotas commented Dec 6, 2024

Thread::CooperativeCleanup is new in 9

This method is new in .NET 9 (I have introduced it in https://github.com/dotnet/runtime/pull/103877/files#diff-f5835c4b5fd134e52b4127bb4ffb7e5ad439673a429dc7ea46d53e7a5bca0529R2734). There was code on thread shutdown path that switched to cooperative mode before .NET 9 as well, so the fundamental problem is not new.

perhaps it should synchronize whether GC is in progress

This is classic A-B B-A deadlock. The two locks in question are our thread store lock that's taken when the runtime is suspended for the GC, and the Windows OS loader lock that's taken when threads are created/destroyed by the Windows OS. We would either need get these locks ordered (that's pretty hard) or avoid them to be taken in conflicting order. The FiberDetachCallback that's used for thread shutdown notifications in native AOT does the later.

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Dec 6, 2024

We are doing more work in cooperative mode during thread shutdown in .NET 9 that makes the dead lock more likely to be hit. I think that this issue is .NET 9 servicing candidate. @VSadov Is this something that you can take a look at?

@VSadov
Copy link
Member

VSadov commented Dec 6, 2024

Another way to look at it - When a thread terminates, it needs to make its TLAB (Thread-Local Allocation Buffer) parseable, to not leave a portion of GC heap in unparseable state. That is done in FixAllocContext call and cannot be done while GC is in progress, so at least that part of thread termination needs COOP mode. It would be hard to get around.

At the same time If the terminating thread waits for GC to complete (while holding loader lock) and GC needs to launch a thread to make progress, there could be a deadlock.

This is indeed not new at all. I wonder a bit why we did not see this earlier. Perhaps threads terminating while GC is in progress was not common for some reason.

@VSadov
Copy link
Member

VSadov commented Dec 6, 2024

@VSadov Is this something that you can take a look at?

Yes, I can take a look.
Using FiberDetachCallback like in NativeAOT should not have this issue as it does not run user code while holding loader lock.

@VSadov VSadov self-assigned this Dec 6, 2024
@mangod9
Copy link
Member

mangod9 commented Dec 6, 2024

Thanks @VSadov for thinking about a fix.

I think that this issue is .NET 9 servicing candidate

yeah certainly would be good to fix in the next servicing release.

@tornie2
Copy link
Author

tornie2 commented Dec 6, 2024

I have added this to the project where we have seen the problem the most
If this service can run for a week without failing, then I would be optimistic this is the issue

<PropertyGroup>
	<ConcurrentGarbageCollection>false</ConcurrentGarbageCollection>
</PropertyGroup>

@mangod9
Copy link
Member

mangod9 commented Dec 12, 2024

Its been fixed in Main and will be backported to 9 and should be available in the Feb servicing release.

@mangod9
Copy link
Member

mangod9 commented Dec 12, 2024

@tornie2 since you are able to repro this frequently, would you be able to try a private build to ensure the fix resolves the issue for you? Thx

@tornie2
Copy link
Author

tornie2 commented Dec 13, 2024

Can you can deliver the build as an SDK?

This is in our pipeline. It takes the SDK from our local JFrog Artifactory
I could add a private build SDK to our JFrog Artifactory

steps: - task: DotNetCoreInstaller@0 displayName: 'Use latest .NET Core sdk' inputs: version: 9.0.100

@jkotas
Copy link
Member

jkotas commented Dec 18, 2024

The fix was reverted #110801 since it broke WinForms tests

@jkotas jkotas reopened this Dec 18, 2024
@mdonatas-trafi
Copy link

We don't have a dump thus are not completely sure but two of our micro-services which experience high load seem to be suffering from this under debian based images, we had to rollback to net8.0.
This turns out to be more disruptive than a sudden crash as due to our health-check configuration the process is terminated after 5 minutes leaving a service unable to serve requests for ~6 minutes.

p.s. to get a mem-dump, it would have to be done somehow in ECS (aws) maybe based on a script running in parallel and doing the same health-check calls?.. this part is also not clear at the moment.

p.s.s. this comment is basically a +1 so you could gauge the impact

@tornie2
Copy link
Author

tornie2 commented Feb 20, 2025

It's been a couple of months since last update
Is there any progress on this issue?

@mangod9
Copy link
Member

mangod9 commented Feb 20, 2025

The previous fix had to be reverted since it was affecting other scenarios. @VSadov, is still working on a potential fix, we should have an update soon.

@rgroenewoudt
Copy link

We are also running into a process hang. Our main thread is not blocked and we are using Server GC with DATAS but the BGC thread looks a bit different:

Image
Is our hang the same problem as this issue or something different?

@mangod9
Copy link
Member

mangod9 commented Feb 24, 2025

This appears to the be same issue. If you can share the dump we could look to ensure it's the same. Fix is currently being worked on.

@VSadov
Copy link
Member

VSadov commented Feb 25, 2025

I will keep the bug open to track backporting of the fix to 9.0

(I'll give it a few days before backporting - in case we see regressions. Thread life-time management is a fairly delicate area.)

@AnthonyLloyd
Copy link

AnthonyLloyd commented Mar 17, 2025

Hi, We've been seeing this issue with the GC hanging on WaitForSingleObject in dotnet 9. I've tried turning off DATAS but I still see the same problem possibly less frequently. Is there some other problem this could be? I've searched but there seems to be only this.
We've not tried turning off background GC as our services allocate quite a lot and we are worried about pauses.

@mangod9
Copy link
Member

mangod9 commented Mar 17, 2025

@AnthonyLloyd, have you tried disabling background GC to check if that works around your issue ?

@AnthonyLloyd
Copy link

@AnthonyLloyd, have you tried disabling background GC to check if that works around your issue ?

We were worried about doing that as our services allocate a lot and may pause. Do you mean just for diagnostics?

@mangod9
Copy link
Member

mangod9 commented Mar 17, 2025

yeah just to determine whether its related to BGC. The fix (assuming you are on windows) should be included in an upcoming servicing release.

@AnthonyLloyd
Copy link

yeah just to determine whether its related to BGC. The fix (assuming you are on windows) should be included in an upcoming servicing release.

Will do, thanks. It may take a couple of days to know.

@VSadov
Copy link
Member

VSadov commented Mar 19, 2025

The backport of the fix to 9.0 has been merged in #113055

@tornie2
Copy link
Author

tornie2 commented Apr 9, 2025

Was this bug fixed in the SDK v9.0.203 release?
We would like to enable concurrent garbage collection again

@jkotas
Copy link
Member

jkotas commented Apr 9, 2025

Was this bug fixed in the SDK v9.0.203 release?

It was not. This fix is expected to ship in 9.0.5 runtime. You can find it in the milestone field of the backport PR - #113055.

@adamjones2
Copy link
Contributor

@jkotas was this shipped in 9.0.5?

@jkotas
Copy link
Member

jkotas commented May 15, 2025

Yes, this shipped in 9.0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged os-windows
Projects
None yet
10 participants