Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file write concurrency #578

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

JulianPinzaru
Copy link

Fixes #565.

Proposed changes

Using native python module fcntl to put locks on writing data to a file. This way we can prevent multiple processes writing to the same file simultaneously and erasing each other's data.

return written_characters


def append_to_file(filepath=None, data=None, mode='a', **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is strictly append -- might be good to take out kwarg and just pass in mode='a' util.to write_to_file

Comment on lines +14 to +25
def write_to_file(filepath=None, data=None, mode='w', **kwargs) -> int:
'''
Concurrency safe function for writing data to the file.
Param mode: file open mode ('a' for appending, 'w' for writing)
'''
assert mode == 'w' or mode == 'a'

with open(filepath, mode, **kwargs) as f:
fcntl.flock(f, fcntl.LOCK_EX)
written_characters = f.write(data)
fcntl.flock(f, fcntl.LOCK_UN)
return written_characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question comes from my lack of knowledge:
If another process tries to access file while it is locked does it lead to an IOError? Or does it automatically wait for lock to be released?

with open(filepath, mode, **kwargs) as f:
fcntl.flock(f, fcntl.LOCK_EX)
written_characters = f.write(data)
fcntl.flock(f, fcntl.LOCK_UN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth explicitly closing file

@bmblumenfeld
Copy link
Contributor

@JulianPinzaru awesome work! I left a couple comments. Should be good to go after addressing those! Let me know if you have any questions!

@youngj
Copy link
Contributor

youngj commented Feb 27, 2020

I did some tests with this approach. Although it ensures that one process finishes writing before the other process starts, the output file could still contain data from both processes if the first process writes more data than the second process.

Here's an example script to show how this can happen:

foo.py:

import argparse
import time
import fcntl

def write_to_file(filepath=None, data=None, mode='w', **kwargs) -> int:
    with open(filepath, mode, **kwargs) as f:
        fcntl.flock(f, fcntl.LOCK_EX)
        for i in range(1, 1000):
            f.write(data)
        time.sleep(3)
        for i in range(1, 1000):
            f.write(data)
        fcntl.flock(f, fcntl.LOCK_UN)

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('--foo', required=True, help='foo')
    args = parser.parse_args()

    write_to_file('foo.txt', args.foo + '\n')

Test two processes writing at once; the first process that opens the file writes more data than the second:
(python foo.py --foo=aaaaaaaaaaaaaaaaaaaaaa &); sleep 1; python foo.py --foo=b

The output file contains a bunch of b's followed by a bunch of \0 characters, followed by a bunch of a's.

Instead of using flock, I think the best approach is to write a temporary file with a unique name in the same directory, then rename it to the desired filename when it is complete. With this approach, flock isn't necessary because each writer has a unique filename to write to, and the filesystem will ensure that the rename is atomic.

This approach also ensures that read_from_file will always see a consistent version of the file without needing to acquire a lock even if it is being written to concurrently by another process.

@hathix hathix self-requested a review June 5, 2020 12:37
Copy link
Member

@hathix hathix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review feedback from this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid concurrency errors when saving cache files
4 participants