-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RtlFindNextForwardRunClear holds NTFS lock for multiple seconds #64
Comments
Thanks for filing here Bruce. |
Update: This bug has now been fixed internally and will work its way through our engineering system into a future OS release. Summary of the fix:
Thanks for reporting this issue. Glad to have fixed it! Will update this thread with details of which builds this fix will first arrive in. |
Any updates yet on what build this will be fixed in? We still regularly hit this issue when imaging new machines, so I'm assuming it's not live yet? |
I've sent a note to the team and am waiting on a status update. I'll post here as soon as I get an answer |
Any updates on when this will be out? |
a year later, any update? |
When is this fixed? I have several incidents of this issue seen in the field and even one repro. When will this be backmerged to all still supported OS editions? |
Hey folks! I just pinged the team for an update on this. Thanks for your patience and I'll post here as soon as I hear back. |
Hey folks! It looks like this issue should have been fixed in Windows 11. If running Win11 and still experiencing the issue please comment and I can give these data points to the team. |
pov: your device doesn't support windows 11 with its high requirements |
So Windows Server 2022 should also not exhibit the issue? Is there a Ticket Id I could reference for a backport to Server 2016? Any chances there or was this part of a complete rewrite of the Storage Subsystem like it did happen for Memory Management or the UI Subsystem? |
@bitcrazed: Could I get some information on which server OS that is fixed? Would be a backport to earlier OS versions be possible. That happens too often with bad effects. |
Hey folks. To help us better understand the scope of impact of this issue, could you share where this is impacting you, how many users/machines, what workloads are impacted, etc? Many thanks in advance. |
This issue had caused severe performance impact in of our customer sites. Multiple users were affected (<=10) |
At least for me this is affecting servers with large RAIDs where formatting the drive is not an option for production machines. If the issue keeps coming back it is a bad situation to be in. Which of these mitigations are actually helping?
It would be good to have some tool to trace the duration of NtfsFreeRecentlyDeallocated so one can easily check if the issue is there. Currently there is nothing except ETW profiling available. NTFS uses WPP where one would need to author a custom TMF file or have private symbols. |
can someone please have a look in to the source code ? |
I did a little digging with the dev team and confirmed it was fixed in Windows 11 and is fixed in Window Server 2025 - you can validate the preview here: https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2025?msockid=0eceedf674a061483924f949751a6064. However, it is not fixed in Windows Server 2022. To backport, I would need customer impact information. If your company has an MS field contact, please have them contact me, or you can DM me on twitter/x @TheAdamBr if you don't feel comfortable sharing info publicly. |
@AdamBraden: The bug seems to be related to VSS (Volume Shadow Copy) which was not needed on that data drive. Disabling it did solve the issue for us. But it was a long journey which needed several calls to get a definitive answer. One the C drive I have not seen this in the wild yet. |
So, the issue was found in October of 2019, it was formally reported in November of 2020 (shouldn't really be necessary, but okay), it was "fixed" promptly, but the fix doesn't ship to server SKUs (where it is needed most?) until late 2024. Given the severity of the bug, in some scenarios, this seems disappointingly slow. And communication could have been better - I still can't really tell where it is fixed and where it isn't. I'm glad it never affected me. I'm just the random Internet person who identified the problem using trace data from a third party. I guess I'm cynically wondering why I was needed to help resolve this process, and what this apparent need says about Microsoft's performance culture, and why Microsoft didn't find and fix the problem using their own trace data. |
@randomascii: Support was better some years ago before MS did transform all (at least Germany) technical support guys into Cloud Solution Architects. Parallel they did outsource support to e.g. Egypt and other countries where labor costs are cheaper. In the end the new guys call back to HQ in Seattle where now even more overworked (at least that is my impression) guys are handling too many tickets. The support process usually boils down to
This script collects everything although some more streamlined |
In some cases on some machines if system restore is enabled then RtlFindNextForwardRunClear may end up spinning in a seven-instruction loop for multiple seconds while holding a lock. This prevents basic operations like WriteFile from completing. In the case where this was first hit this caused a 64-processor machine to grind to a halt, repeatedly.
A full explanation can be found here:
https://randomascii.wordpress.com/2019/10/20/63-cores-blocked-by-seven-instructions/
I have heard that this bug has been fixed but was asked to file an issue to formally track it:
https://twitter.com/richturn_ms/status/1330947602129448961
The text was updated successfully, but these errors were encountered: