-
Notifications
You must be signed in to change notification settings - Fork 5k
Process not responsive dump indicates garbage collection #110350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @dotnet/gc |
Does this happen during startup? This feels similar to #105780. Are you able to try with disabling new GC mode with DOTNET_GCDynamicAdaptationMode=0 ? The fix for this issue should be included in the Jan servicing release for 9.0. |
It is not during startup. Usually the processes will run days before this happens I have added this to the csproj of the exe of the process wherre we have seen this most often
|
If it's not during startup it could be a different issue. If it has been reproing frequently then yeah disabling DOTNET_GCDynamicAdaptationMode would be worth a try. If you are able to share a dump privately that would help in confirming if it's the same issue. |
I hope you got an e-mail with a link to the dump file |
Thanks for sharing the dump. This isn't related to the DATAS issue I pointed to earlier. In fact the app is using WKS GC. The dump shows something similar to #107800, but in this case its a thread shutdown racing with GC trying to create a BGC thread, so looks to be a deadlock between DetachThread + CreateThread? @VSadov @kouvel @jkotas have you seen something similar before?
|
I believe that this is one of the reasons why native AOT uses |
But from the stacks it looks like both threads wait on CLR events/locks. So maybe this is not related to OS lock? |
Here is the stack of the BGC Thread. Its started but doesnt get to
|
Ah, I see - CreateNonSuspendableThread waits on an event that will be set by the spawned thread. It just needs to see progress form the spawned thread to declare that thread creation was successful. Thus it is indeed possible that the spawned thread creation/progress needs the same loader lock that the thread that is shutting down is holding. |
|
@tornie2, as a temporary workaround you can disable background GC to ensure that avoids the issue. |
This method is new in .NET 9 (I have introduced it in https://github.com/dotnet/runtime/pull/103877/files#diff-f5835c4b5fd134e52b4127bb4ffb7e5ad439673a429dc7ea46d53e7a5bca0529R2734). There was code on thread shutdown path that switched to cooperative mode before .NET 9 as well, so the fundamental problem is not new.
This is classic A-B B-A deadlock. The two locks in question are our thread store lock that's taken when the runtime is suspended for the GC, and the Windows OS loader lock that's taken when threads are created/destroyed by the Windows OS. We would either need get these locks ordered (that's pretty hard) or avoid them to be taken in conflicting order. The FiberDetachCallback that's used for thread shutdown notifications in native AOT does the later. |
Tagging subscribers to this area: @mangod9 |
We are doing more work in cooperative mode during thread shutdown in .NET 9 that makes the dead lock more likely to be hit. I think that this issue is .NET 9 servicing candidate. @VSadov Is this something that you can take a look at? |
Another way to look at it - When a thread terminates, it needs to make its TLAB (Thread-Local Allocation Buffer) parseable, to not leave a portion of GC heap in unparseable state. That is done in FixAllocContext call and cannot be done while GC is in progress, so at least that part of thread termination needs COOP mode. It would be hard to get around. At the same time If the terminating thread waits for GC to complete (while holding loader lock) and GC needs to launch a thread to make progress, there could be a deadlock. This is indeed not new at all. I wonder a bit why we did not see this earlier. Perhaps threads terminating while GC is in progress was not common for some reason. |
Yes, I can take a look. |
Thanks @VSadov for thinking about a fix.
yeah certainly would be good to fix in the next servicing release. |
I have added this to the project where we have seen the problem the most
|
Its been fixed in Main and will be backported to 9 and should be available in the Feb servicing release. |
@tornie2 since you are able to repro this frequently, would you be able to try a private build to ensure the fix resolves the issue for you? Thx |
Can you can deliver the build as an SDK? This is in our pipeline. It takes the SDK from our local JFrog Artifactory
|
The fix was reverted #110801 since it broke WinForms tests |
We don't have a dump thus are not completely sure but two of our micro-services which experience high load seem to be suffering from this under debian based images, we had to rollback to p.s. to get a mem-dump, it would have to be done somehow in ECS (aws) maybe based on a script running in parallel and doing the same health-check calls?.. this part is also not clear at the moment. p.s.s. this comment is basically a |
It's been a couple of months since last update |
The previous fix had to be reverted since it was affecting other scenarios. @VSadov, is still working on a potential fix, we should have an update soon. |
This appears to the be same issue. If you can share the dump we could look to ensure it's the same. Fix is currently being worked on. |
I will keep the bug open to track backporting of the fix to 9.0 (I'll give it a few days before backporting - in case we see regressions. Thread life-time management is a fairly delicate area.) |
Hi, We've been seeing this issue with the GC hanging on WaitForSingleObject in dotnet 9. I've tried turning off DATAS but I still see the same problem possibly less frequently. Is there some other problem this could be? I've searched but there seems to be only this. |
@AnthonyLloyd, have you tried disabling background GC to check if that works around your issue ? |
We were worried about doing that as our services allocate a lot and may pause. Do you mean just for diagnostics? |
yeah just to determine whether its related to BGC. The fix (assuming you are on windows) should be included in an upcoming servicing release. |
Will do, thanks. It may take a couple of days to know. |
The backport of the fix to 9.0 has been merged in #113055 |
Was this bug fixed in the SDK v9.0.203 release? |
It was not. This fix is expected to ship in 9.0.5 runtime. You can find it in the milestone field of the backport PR - #113055. |
@jkotas was this shipped in 9.0.5? |
Yes, this shipped in 9.0.5. |
Description
After upgrading to .net 9, we have random processes, which just freeze, becoming completely unresponsive
Process are run as windows services on windows VM
I have a dump file, which I could send to you
I would just rather not make that public as it probably has passwords within
Analyzing the dump indicates a possible problem in the garbage collector
Reproduction Steps
Not possilbe. Happens randomly
Expected behavior
Not freezing
Actual behavior
Process completely unresponsive
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: