Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RtlFindNextForwardRunClear holds NTFS lock for multiple seconds #64

Open
randomascii opened this issue Nov 23, 2020 · 20 comments
Open

RtlFindNextForwardRunClear holds NTFS lock for multiple seconds #64

randomascii opened this issue Nov 23, 2020 · 20 comments

Comments

@randomascii
Copy link

In some cases on some machines if system restore is enabled then RtlFindNextForwardRunClear may end up spinning in a seven-instruction loop for multiple seconds while holding a lock. This prevents basic operations like WriteFile from completing. In the case where this was first hit this caused a 64-processor machine to grind to a halt, repeatedly.

A full explanation can be found here:

https://randomascii.wordpress.com/2019/10/20/63-cores-blocked-by-seven-instructions/

I have heard that this bug has been fixed but was asked to file an issue to formally track it:
https://twitter.com/richturn_ms/status/1330947602129448961

@bitcrazed
Copy link
Contributor

Thanks for filing here Bruce.

@bitcrazed
Copy link
Contributor

Update: This bug has now been fixed internally and will work its way through our engineering system into a future OS release.

Summary of the fix:

Fixed perf bug in VspQueryCopyFreeBitmap.

Bruce Dawson on the Google Chrome team pointed out a bug where Chromium builds had several hiccups in I/O. Others have also hit the same bug.
The root cause is that the search for free regions of the volsnap CoW bitmap were incorrectly unbounded and could take multiple milliseconds on a 1TB drive.

Thanks for reporting this issue. Glad to have fixed it!

Will update this thread with details of which builds this fix will first arrive in.

@zjturner
Copy link

zjturner commented Apr 6, 2021

Any updates yet on what build this will be fixed in? We still regularly hit this issue when imaging new machines, so I'm assuming it's not live yet?

@AvriMSFT
Copy link
Contributor

I've sent a note to the team and am waiting on a status update. I'll post here as soon as I get an answer

@klauroblox
Copy link

Any updates on when this will be out?

@ghost
Copy link

ghost commented Dec 21, 2021

a year later, any update?

@Alois-xx
Copy link

When is this fixed? I have several incidents of this issue seen in the field and even one repro. When will this be backmerged to all still supported OS editions?

@AvriMSFT
Copy link
Contributor

Hey folks! I just pinged the team for an update on this. Thanks for your patience and I'll post here as soon as I hear back.

@AvriMSFT
Copy link
Contributor

AvriMSFT commented Feb 1, 2022

Hey folks! It looks like this issue should have been fixed in Windows 11. If running Win11 and still experiencing the issue please comment and I can give these data points to the team.

@thatsofia
Copy link

pov: your device doesn't support windows 11 with its high requirements

@Alois-xx
Copy link

Alois-xx commented Feb 2, 2022

Hey folks! It looks like this issue should have been fixed in Windows 11. If running Win11 and still experiencing the issue please comment and I can give these data points to the team.

So Windows Server 2022 should also not exhibit the issue? Is there a Ticket Id I could reference for a backport to Server 2016? Any chances there or was this part of a complete rewrite of the Storage Subsystem like it did happen for Memory Management or the UI Subsystem?

@AloisKraus
Copy link

@bitcrazed: Could I get some information on which server OS that is fixed? Would be a backport to earlier OS versions be possible. That happens too often with bad effects.

@bitcrazed
Copy link
Contributor

Hey folks. To help us better understand the scope of impact of this issue, could you share where this is impacting you, how many users/machines, what workloads are impacted, etc?

Many thanks in advance.

@karthikkbgl
Copy link

Hey folks. To help us better understand the scope of impact of this issue, could you share where this is impacting you, how many users/machines, what workloads are impacted, etc?

Many thanks in advance.

This issue had caused severe performance impact in of our customer sites. Multiple users were affected (<=10)

@AloisKraus
Copy link

At least for me this is affecting servers with large RAIDs where formatting the drive is not an option for production machines. If the issue keeps coming back it is a bad situation to be in.

Which of these mitigations are actually helping?

  • Disable Shadow Space
    • vssadmin delete shadowstorage /for=E:
  • Run DevNodeClean
  • Disable TRIM
    • fsutil behavior set DisableDeleteNotify 1

It would be good to have some tool to trace the duration of NtfsFreeRecentlyDeallocated so one can easily check if the issue is there. Currently there is nothing except ETW profiling available. NTFS uses WPP where one would need to author a custom TMF file or have private symbols.

@AndreasDiet
Copy link

can someone please have a look in to the source code ?
into which build number has the fix went?
Server 2022 is build number 20348, which is a Windows 10 build
windows 11 build numbers are > 22000

@AdamBraden
Copy link
Collaborator

I did a little digging with the dev team and confirmed it was fixed in Windows 11 and is fixed in Window Server 2025 - you can validate the preview here: https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2025?msockid=0eceedf674a061483924f949751a6064.

However, it is not fixed in Windows Server 2022. To backport, I would need customer impact information. If your company has an MS field contact, please have them contact me, or you can DM me on twitter/x @TheAdamBr if you don't feel comfortable sharing info publicly.

@AloisKraus
Copy link

@AdamBraden: The bug seems to be related to VSS (Volume Shadow Copy) which was not needed on that data drive. Disabling it did solve the issue for us. But it was a long journey which needed several calls to get a definitive answer. One the C drive I have not seen this in the wild yet.
Thanks for confirming that the issue is finally solved on Windows Server 2025 and Windows 11.

@randomascii
Copy link
Author

So, the issue was found in October of 2019, it was formally reported in November of 2020 (shouldn't really be necessary, but okay), it was "fixed" promptly, but the fix doesn't ship to server SKUs (where it is needed most?) until late 2024.

Given the severity of the bug, in some scenarios, this seems disappointingly slow. And communication could have been better - I still can't really tell where it is fixed and where it isn't.

I'm glad it never affected me. I'm just the random Internet person who identified the problem using trace data from a third party. I guess I'm cynically wondering why I was needed to help resolve this process, and what this apparent need says about Microsoft's performance culture, and why Microsoft didn't find and fix the problem using their own trace data.

@AloisKraus
Copy link

@randomascii: Support was better some years ago before MS did transform all (at least Germany) technical support guys into Cloud Solution Architects. Parallel they did outsource support to e.g. Egypt and other countries where labor costs are cheaper. In the end the new guys call back to HQ in Seattle where now even more overworked (at least that is my impression) guys are handling too many tickets.

The support process usually boils down to

  1. Repro issue
  2. Download https://learn.microsoft.com/en-us/troubleshoot/windows-client/windows-tss/introduction-to-troubleshootingscript-toolset-tss
  3. Run issue under this tool which takes hours to complete, or never completes.
  4. Send data back and hope for the best.

This script collects everything although some more streamlined
settings could collect data much faster (especially ETW) but you can never talk to the actual guy who looks at the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests