Skip to content

[NativeAOT] Asynchronous thread suspension at GC safe points on Windows #70316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 9, 2022

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Jun 6, 2022

Support for asynchronous suspension at GC safe points on Windows.

  • ICodeManager::IsSafepoint API (Windows only, stubbed for Unix)
  • asynchronous suspension via redirection of OS-suspended threads
  • AVX support
  • ARM64 support (not tested, not enabled)

Contributes to: #67805

@VSadov VSadov requested a review from MichalStrehovsky as a code owner June 6, 2022 23:33
@ghost ghost assigned VSadov Jun 6, 2022
@VSadov VSadov requested a review from jkotas June 7, 2022 00:08
if (!frameIterator.IsValid())
return false;

frameIterator.CalculateCurrentMethodState();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call CalculateCurrentMethodState? frameIterator is not used after this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was copied from InternalHijack initially. I now think the entire frameIterator deal is unnecessary here.

@VSadov
Copy link
Member Author

VSadov commented Jun 7, 2022

Hmm. tests seem to pass locally.

@jkotas
Copy link
Member

jkotas commented Jun 7, 2022

Hmm. tests seem to pass locally.

You may want to think about how to stress test this before it gets merged.

@VSadov
Copy link
Member Author

VSadov commented Jun 8, 2022

It was quite reproducible and simple and library tests catch the issues (I only ran smoke tests locally, so I did not see).

GC decoder expects an off-by-one skew for nonleaf frames and we used to apply the skew unconditionally. Now that we have leaf frames to scan, we need to be more intentional with that.

@VSadov
Copy link
Member Author

VSadov commented Jun 8, 2022

You may want to think about how to stress test this before it gets merged.

Running library tests in a loop seems to be a good stress for this feature. The tests themselves and xunit present all kinds of situations for stackwalks.

I have run the entire set 30 times in Release and Debug each.
I do not see any new failures, but there is an occasional assert that happens every 5 runs or so with or without this change. The typical scenario is when we have a lot of threads running and one throws cancellation exception, while another thread starts GC suspension.

Then we hit an assert in:

ASSERT_MSG(ThreadStore::GetSuspendingThread() == NULL, "Not allowed when suspended for GC.");

  Message: Not allowed when suspended for GC.
  Expression: 'ThreadStore::GetSuspendingThread() == NULL'


[0x0]   KERNELBASE!RaiseFailFastException + 0x152   
[0x1]   System_Runtime_Tests!PalRaiseFailFastException + 0x28   
[0x2]   System_Runtime_Tests!Assert + 0x156   
[0x3]   System_Runtime_Tests!Thread::GetTransitionFrameForStackTrace + 0x34   
[0x4]   System_Runtime_Tests!StackFrameIterator::InternalInitForStackTrace + 0x4d   
[0x5]   System_Runtime_Tests!RhpSfiInit + 0x3a   
[0x6]   System_Runtime_Tests!S_P_CoreLib_System_Runtime_RuntimeExports__RhpCalculateStackTraceWorker + 0x48   
[0x7]   System_Runtime_Tests!RhpGetCurrentThreadStackTrace + 0x33   
[0x8]   System_Runtime_Tests!S_P_CoreLib_System_Runtime_RuntimeExports__RhGetCurrentThreadStackTrace + 0x90   
[0x9]   System_Runtime_Tests!S_P_CoreLib_System_Diagnostics_StackTrace__InitializeForCurrentThread + 0x20   
[0xa]   System_Runtime_Tests!S_P_CoreLib_System_Diagnostics_StackTrace___ctor_0 + 0xf   
[0xb]   System_Runtime_Tests!S_P_CoreLib_System_Exception__SetCurrentStackTrace + 0x54   

I think the assert is bogus. We are in native code in coop mode, we certainly can scan our own stack. Besides, suspension may happen right after we check anyways.

I am going to remove the assert. Otherwise, I think this PR is good to merge.

@VSadov
Copy link
Member Author

VSadov commented Jun 8, 2022

Also CreateThreadLocalContentionCountObject can be reentrant on NativeAOT. Creating a sub-counter takes locks. If we contend again, we will assert and deadlock.

This is fairly rare (I saw it just once) and completely unrelated to this change since it is entirely in managed code.
I will log an issue to follow up with that separately.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@VSadov VSadov merged commit dc3a6ac into dotnet:main Jun 9, 2022
@VSadov VSadov deleted the nAotRedir branch June 9, 2022 00:28
@VSadov
Copy link
Member Author

VSadov commented Jun 9, 2022

Thanks!!

@VSadov
Copy link
Member Author

VSadov commented Jun 9, 2022

@LakshanF hangs in GC suspension should be fixed on Windows x64 now.

@VSadov VSadov mentioned this pull request Jun 15, 2022
9 tasks
@ghost ghost locked as resolved and limited conversation to collaborators Jul 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants