Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pidlock to public functions #45

Merged
merged 13 commits into from
Feb 9, 2025

Conversation

willow-ahrens
Copy link
Contributor

@willow-ahrens willow-ahrens commented Jan 6, 2025

This PR locks the basic functions of pyjuliapkg. I used FileLock instead of pidlock because I needed a reentrant lock. I also used a thread lock. There is a test which I have verified does show that this fix avoids redundant resolves. fixes #19.

@amilsted
Copy link
Contributor

Do you know if this will work on a cluster if the shared filesystem does not support unix file locking? The "pidfile" approach, which stores the hostname and process ID of the acquirer in the lock file, does not depend on filesystem support for locking. I don't think filelock uses this approach, so I'm a bit worried about robustness.

@willow-ahrens
Copy link
Contributor Author

Filelock advertises itself as platform-independent, and has windows and unix specific functions. I haven't been able to test this on windows, however.

@willow-ahrens
Copy link
Contributor Author

One question I have about this PR is where to write the lock. I have it being written to the source directory of pyjuliapkg, but I suspect it would be better to use some location which is specified by the STATE dict. I just don't know which one makes sense to use here.

@amilsted
Copy link
Contributor

Filelock advertises itself as platform-independent, and has windows and unix specific functions.

It can be platform independent and still not work well across multiple hosts on a shared filesystem.

@willow-ahrens
Copy link
Contributor Author

I see what you're asking. The filelock library seems to advise the use of the SoftFileLock type in the event that there are multiple hosts, but this may require sometimes deleting the lock file when someone interrupts the setup. The julia developers seem to have had similar concerns in the design of Pkg, and ultimately I'm not sure of the extent to which this is solved. The solution in this PR should fix bugs on not-shared filesystems, and I'm open to input on how to ensure this also fixes bugs on shared ones. I chose the PID file package off of PyPI with the greatest support that offered a simple reentrant lock, but I'm open to suggestions on others?

@amilsted
Copy link
Contributor

amilsted commented Jan 16, 2025

Yeah, I understand this is a nontrivial choice. I shopped around for a Python package that ticked all the boxes (basically replicating the Julia solution) but didn't find one that worked well out of the... well, box. Any locking is definitely better than none here and I appreciate your taking this on!

Edit: And I don't mean to sound like I'm a maintainer here. I'm not! Just a user quite invested in making it easier to have python packages that use julia (including for use on clusters).

@willow-ahrens
Copy link
Contributor Author

willow-ahrens commented Jan 16, 2025

Looking at package managers in python:

@amilsted
Copy link
Contributor

In case it's useful, I used a somewhat modified version of pidlock to externally lock juliapkg operations. This worked okay on a cluster, but it again probably doesn't do everything right. For one thing, it's missing a mechanism to delete stale lock files.

Also the github version has a defect, in that it checks for file existence and then, if the file exists, opens the file without exception handling. The delay between existence check and open allows other processes or hosts to delete the file and cause a crash. This can be fixed through handling exceptions on open instead of checking for existence separately.

@amilsted
Copy link
Contributor

Looks like the conda approach is quite similar to pidlock, but using a directory instead of a file. It also doesn't seem to include the hostname at the moment. Maybe worth stealing?

@willow-ahrens
Copy link
Contributor Author

willow-ahrens commented Jan 23, 2025

I'm happy to choose whichever pidlock we would like, but it seems out of scope to try to do better than e.g. virtualenv. @cjdoris, do you have a preference for which file lock you prefer? Can we take on the maintenance burden of maintaining our own lock here, or is this best effort choice of FileLock satisfactory?

@willow-ahrens
Copy link
Contributor Author

willow-ahrens commented Jan 23, 2025

I suppose I would also like to see an example of a failing test with the current setup before I roll my own file lock, so that we can verify that a different choice of file lock fixes the test. @amilsted, do you have additional tests to contribute to the PR?

@amilsted
Copy link
Contributor

Very fair. This is tricky, because I think it's hard to break FileLock on a single host and a non-network filesystem. I will think about it. I suppose one could argue that juliapkg just shouldn't be in online mode when running on distributed systems like HPC clusters, so that these cases should not be considered relevant for the locking here. Perhaps one should just always set PYTHON_JULIAPKG_OFFLINE=yes when running a batch job that would load juliacall in parallel on many hosts.

@amilsted
Copy link
Contributor

Counterpoint: Julia core uses https://github.com/vtjnash/Pidfile.jl to do hostname/PID-based locking to protect precompilation and Pkg operations, because this type of locking is robust on HPC clusters.

@willow-ahrens
Copy link
Contributor Author

willow-ahrens commented Jan 29, 2025

I can't import Pidlock.jl because it's written in Julia, and I don't have the resources to implement and maintain a new pidfile solution and test that it works on HPC clusters, at least in the near term. I'd like this PR to merge because my package depends on pyjuliapkg, causing an error in multithreaded pytest in CI.

FileLock is good enough for virtualenv, could we use it for now for the purposes of this PR? You bring up several good issues regarding the robustness of PID locking in cluster environments. I've filed a separate issue to track HPC cluster file locking issues here: #46

I propose to use FileLock as a best-effort initial implementation that fixes demonstrated bugs, and open an issue to improve the lock software as we find development effort. We can also, if you wish, add a warning about running in offline mode in networked environments.

@cjdoris
Copy link
Collaborator

cjdoris commented Feb 9, 2025

Hiya, thanks for the PR! I have taken the liberty of making some changes, the main ones being:

  • Only add locking to resolve() - all the other functions are either internal or are only intended to be called interactively, so don't need locking.
  • Only do process-locking, not thread-locking - this package does not promise to be thread-safe, you should only call its functions from one Python thread.
  • Move the lock file to somewhere project-specific.
  • Print a message if waiting for the lock for more than 3 seconds informing the user why resolving is paused.

Thanks for adding the test.

I'll leave this PR open to comments before merging.

@willow-ahrens
Copy link
Contributor Author

willow-ahrens commented Feb 9, 2025

Thank you for making these changes! I have one question: if the user calls add or rm programatically on startup, should they be using their own process lock? For example, should the following behavior be considered unsafe? https://github.com/finch-tensor/finch-tensor-python/blob/38d0df4138f663c67010296172c77bb933fbaaa7/src/finch/julia.py#L6-L29

(Note: the above failed in multithreaded contexts on julia installation during resolve, the add and rm weren't causing issues, but now I'm beginning to be convinced that I should avoid modifying the dependencies at runtime, and I should add a install-time way to set those variables).

@cjdoris
Copy link
Collaborator

cjdoris commented Feb 9, 2025

Yeah, you're not expected to use add when loading a module. Dependencies should be declared entirely in juliapkg.json files.

You can do what you do there but it will require an extra resolve() in addition to the one that happens when juliacall is loaded. And yeah if you're expecting this to be used in a distributed setting you should put a lock around it.

I see you allow the user to override where Finch is installed from. Is this purely for dev purposes? There is no mechanism currently for dev dependencies or dynamic dependencies. What I do in juliacall is have a separate juliapkg-dev.json that I copy into juliapkg.json when developing.

@willow-ahrens
Copy link
Contributor Author

Thanks! This makes sense, I'll adopt that workflow as well.

@cjdoris cjdoris merged commit 6b93da5 into JuliaPy:main Feb 9, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible concurrency issues with resolve on a cluster
3 participants