Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangerous race condition? #166

Open
bitc opened this issue Mar 8, 2022 · 6 comments
Open

Dangerous race condition? #166

bitc opened this issue Mar 8, 2022 · 6 comments

Comments

@bitc
Copy link

bitc commented Mar 8, 2022

I was reading the internal concurrency guidelines of cabal install located here:

https://github.com/haskell/cabal/blob/df6de322c5f36dae48282ebae55985935eaced37/cabal-install/src/Distribution/Client/Store.hs#L53

This paragraph in particular looks very important, regarding multiple concurrent builds of the same package:

Note that because builds are not reproducible in general (nor even necessarily ABI compatible) then it is essential that the loser abandon their build and use the one installed by the winner, so that subsequent packages are built against the exact package from the store rather than some morally equivalent package that may not be ABI compatible.

The regular cabal-install uses locking in order to protect against concurrent builds, but cabal-cache does not do any locking.

Therefore (if I am not mistaken) there is a danger that we can violate the above rule and get corrupt builds.

Here is an example scenario:

  1. S3 bucket is empty
  2. Builder 1 locally builds package A₁
  3. Builder 2 locally builds package A₂ (Same package-id as A₁)
  4. Builder 1 uploads package A₁ to the bucket
  5. Builder 3 downloads package A₁ from bucket
  6. Builder 3 builds package B₃ (which depends on A₁)
  7. Builder 3 uploads package B₃ to the bucket
  8. Builder 2 uploads package A₂ to the bucket (overwriting A₁)
  9. Builder 4 downloads A₂ and B₃ from the bucket, and now gets a corrupt build, because B₃ was built against A₁, but is now being used against A₂ (A₁ and A₂ are supposed to be "morally equivalent" but as explained in the quote above "may not be ABI compatible")
@newhoggy
Copy link
Collaborator

newhoggy commented Mar 9, 2022

Thanks for the detailed report.

I appreciate the concern, but I don't know that there is a solution to this outside of implementing cache functionality directly in cabal.

If it's any reassurance, I've not seen the concurrency bug cause a problem for the entire time I've used the tool.

BUT if it remains a concern, the CI script should mitigate it with additional measures for example by refusing to upload the cache if it has been updated since the last restore.

The coarse grain lock would add protection at the cost of more pessimistic builds.

@bitc
Copy link
Author

bitc commented Mar 9, 2022

Thank you for your response. It is good to hear that any potential issues are rare and possibly non-existent in the real world.

My motivation for opening this issue is that the interaction between the components involved in this tool is complex. I may be misunderstanding something or my analysis may be wrong.

Therefore I opened this issue in the interest of hearing from others whether or not my analysis is correct, and whether or not there is indeed a possibility for a (rare/theoretical) race condition that can cause corrupt builds and/or other bad stuff.

@newhoggy
Copy link
Collaborator

newhoggy commented Mar 9, 2022

Yes, I appreciate you raising the issue.

This is worth documenting in the README in the least.

For users who use this on Github, I think a Github Action can provide course level concurrency protection by default.

@bitc
Copy link
Author

bitc commented Mar 10, 2022

For those who want to ensure 100% safety please be aware that a lock will provide additional safety but not 100%.

One additional issue (I believe) is with the way the sync-to-archive command works. It uploads the packages to S3 in arbitrary (random) order. If it fails (or is killed) part-way through then it may have uploaded a dependent package without its dependency(s). A later worker may then build and upload said dependency, which could lead to the same problems originally described.

I think fixing this issue is actually fairly easy and worth doing in the code here. The fix is to upload the packages to S3 ordered by their dependency topology (carefully accounting for concurrent uploads).

@newhoggy
Copy link
Collaborator

The fix is to upload the packages to S3 ordered by their dependency topology (carefully accounting for concurrent uploads).

Good idea! Thanks!

@newhoggy
Copy link
Collaborator

newhoggy commented Aug 9, 2023

I had an idea on how to solve this.

Store files into S3 as content addressed objects.

Track the objects and their filenames in git. Use git as a way to atomically push a new snapshot and adjudicate which build "wins".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants