Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows - wait to file to unlock for removal #10629

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cderv
Copy link
Collaborator

@cderv cderv commented Aug 28, 2024

while removing dir or file, it happens that EBUSY error could happen with file being used and lock. This is a simple way to wait for file to be unlocked, with a fail safe of max try to avoid infinite loop.

On windows, while working in VS CODE with Quarto Dev version, I often encounter this error

ERROR: Le processus ne peut pas accéder au fichier car ce fichier est utilisé par un autre processus. (os error 32): remove 'C:\Users\chris\AppData\Local\Temp\quarto\test-subdir\about_files\mediabag'

Example of stacktrace when this happens

Stack trace:
    at Object.removeSync (ext:deno_fs/30_fs.js:250:3)
    at safeRemoveSync (file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/core/path.ts:51:10)
    at removeIfEmptyDir (file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/core/path.ts:67:7)
    at renderCleanup (file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/command/render/cleanup.ts:106:3)
    at file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/command/render/render.ts:363:11
    at withTiming (file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/core/timing.ts:37:20)
    at Object.complete (file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/command/render/render.ts:362:9)
    at eventLoopTick (ext:core/01_core.js:153:7)
    at async Object.onPostProcess (file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/command/render/render-files.ts:720:28)
    at async renderFileInternal (file:///C:/Users/chris/Documents/DEV_R/quarto-cli/src/command/render/render-files.ts:689:3)

Debugging this, I found this is due to EBUSY code from Deno removeSync(), and means somehow file is locked on windows and prevents Deno to remove it.

This is quite annoying so I wanted to find a solution. My guess is that this is race constraint, and file will be unlock quickly but not quickly enough for Deno to remove it safely.

Here is a suggestion to add a retry until this works. max try is high because each retry is very quick and it happens I encounter the issue even with 100 retry.

What do you think of something like this ?

Other way would be to use await Deno.remove() and see if this has the same issue as Deno.removeSync(). If not, we could use it, but we need to replace everywhere with non-Sync version which is bigger change and impact

Copy link
Collaborator

@cscheid cscheid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to do it, I think we should do this in an async version of the function, and add some timeouts while this is going on.

@cderv
Copy link
Collaborator Author

cderv commented Aug 28, 2024

That was my other solution, but it felt bigger change as safeRemoveSync is used in several functions that would all become async (IIUC). I can also first try apply this in where I think the problem is (removeIfEmptyDir() is my first guess).

I saw we have a withRetry() function which does add a timeout

export async function withRetry<T = void>(
fn: () => Promise<T>,
options?: RetryOptions,
) {
const {
attempts = 10,
minWait = 1000,
maxWait = 4000,
retry = () => true,
} = (options || {});
let attempt = 0;
while (true) {
try {
return fn();
} catch (err) {
if ((attempt++ >= attempts) || (retry && !retry(err))) {
throw err;
}
const delay = minWait + Math.floor(Math.random() * (maxWait - minWait));
await sleep(delay);
}
}
}

So I was going to use that, but I went the other way

@cderv cderv marked this pull request as draft August 28, 2024 17:29
@cscheid
Copy link
Collaborator

cscheid commented Aug 28, 2024

That was my other solution, but it felt bigger change as safeRemoveSync is used in several functions that would all become async (IIUC).

You're probably right that the change is too big. It's not a problem for sync functions to become async, but it's indeed annoying.

Do you have a reproducible example where this fixes the problem? Just to see the impact, could you

  • print out the number of times we go through the loop before succeeding
  • the total time that takes? performance.now() is the JS function to get the number of milliseconds since the process started, so you could do something like const before = performance.now(); try {}... console.log(performance.now() - before)

@cderv
Copy link
Collaborator Author

cderv commented Aug 28, 2024

Do you have a reproducible example where this fixes the problem?

Yes I had one today that allowed me to drive this solution. I'll rerun it but I remember it was something ~50, other time ~70 and sometimes more than 100.

I'll add the performance: thanks for the trick - I did not know about this JS code.

while removing dir or file, it happens that EBUSY error could happen with file being used and lock. This is a simple way to wait for file to be unlocked, with a fail safe of max try to avoid infinite loop.
@cderv
Copy link
Collaborator Author

cderv commented Aug 30, 2024

print out the number of times we go through the loop before succeeding

I did 10 try to see number we get

Number of attempts to remove / total duration

  1. 182 / 44ms
  2. 164 / 38ms
  3. 148 / 34ms
  4. 300 / 70ms
  5. 74 / 18ms
  6. 299 / 70ms
  7. 74 / 20ms
  8. 86 / 20ms
  9. 166 / 38 ms
  10. 230 / 52ms

So it is pretty quick, and we are talking about some ms delay between Deno trying to remove and file being locked as OS busy on it.

Does it help you assess the impact ?

@cderv cderv requested a review from cscheid August 30, 2024 09:24
@cscheid
Copy link
Collaborator

cscheid commented Aug 30, 2024

Does it help you assess the impact ?

It does. ~50ms is not that fast, though. I'd hate to add a 50ms wait at every call to safeRemove. Do you have a sense of how often this happens?

I'm now a bit more inclined to make this call async, so that we have a chance to (eventually) hide this latency in the rest of Quarto's processing. If we made this call async, then we could have a timeout instead of max number of tries. (the algorithm would be: pause for 10ms, try again, and give up after, say, 100ms).

@cderv
Copy link
Collaborator Author

cderv commented Aug 30, 2024

Do you have a sense of how often this happens?

On the project I manage to make a reproducible example, It happens at every render to try clean the mediabag folder.

I wonder why this happens. I'll do some more testing, but maybe this is only happening to because I am running dev quarto while the source code is also open in VSCODE, which does a bunch of scanning of all the folders. I'll investigate this route, because maybe I scope the fix more, or even push my research with some VSCODE tweaks config instead.

Anyhow, I am not in a hurry - I can leave with it, though I get some failed rendering quite often in this scenario.

(the algorithm would be: pause for 10ms, try again, and give up after, say, 100ms).

For the give up part, just so it's clear, currently it will make the whole quarto render fails because Deno return an error while not being able to remove this empty directory.
So maybe give up could mean here - leave the directory and not delete it instead of just failing completely. It is not that critical to leave a folder behind that would justify a failure.

@cscheid
Copy link
Collaborator

cscheid commented Aug 30, 2024

So maybe give up could mean here - leave the directory and not delete it instead of just failing completely. It is not that critical to leave a folder behind that would justify a failure.

In that case, though, this would need to be an option. We don't want a timeout to silently fail to remove a file by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants