-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency issues with package cache in validator #1641
Comments
The information in this file is only used when displaying the cache to a user and advising when an entry in the cache was last used. There is a tool that does this, but not all tools set the usage date, so it's not reliable. We should surround this code with a try..catch and ignore any errors from it; they simply don't matter. |
Update: there is one other error we've seen multiple times, looks like it's related to one thread trying to read a file while another is trying to delete it. I haven't had time to try to replicate this yet but this is the log snippet.
|
Yes this is. big concurrency issue. scheduled to look at next week |
I've done some initial work with this. I created I have the beginnings of a solution, which gets me repeated passes for master...do-20240722-thread-safety The issue is that there is no read lock... we check to make sure two engines aren't writing the same data, but never that one engine might be reading something that is still in the process of being written. What I've done in the branch above is first ask for another write lock when we need to read. Then, if a write lock is permitted (no one is writing), the lock is downgraded to a read. There are multiple read threads allowed on these locks, but just one write, so this will give us a bit more efficiency, rather than only allowing a single thread to read at a time. This is by no means a complete solution, just something to start with. Thoughts? |
Hah, and I see that there's some DDOS protection on the server, because my repeated tests are displaying 'Too Many Requests' errors. |
Oh... and this is weird. Dylan's test STILL fails, but for a different reason; the txCache is getting repeatedly cleared because at some point |
Thanks @dotasek , I haven't dug into the details of how files get read here, but conceptually this sounds good. One thought though - it sounds like the issue is more that there currently is no read-locking being done, rather than an existing read-lock mechanism is broken. Is the always-get-a-write-lock-to-read-lock approach needed? Since the write lock is exclusive, seems like that will result in read locks essentially being made exclusive too |
@dehall I'm thinking along similar lines. I ran this several times this morning only to get different failures for other files, which have no locks on them at all. The lock downgrade mechanism actually releases the write lock, so it doesn't make read locks exclusive. It gets a write lock, then gets a read lock, then releases the write lock, leaving only the read lock active while it does its work. The intent there was to prevent reads while writing is happening. This should work OK for the package manager, where it only writes a file once. I do have to check if there's a case where certain files will be re-written; I'm thinking particularly packages.ini And, looking at how the terminology cache works, this concurrency will likely be a big issue there, and a more complex one, because the cache appends to files instead of just creating and writing once. |
Some very odd errors cropping up today that I don't recall seeing before:
I don't know if this is related to concurrency at all. Even reverting to a state before my changes results in this behaviour. |
To get beyond the issue I was experiencing, I started running my test using local packages, so at least I'm not hammering the package server as much. I made some changes that seem to get @dehall 's test to repeatedly pass (I set it to loop until failure in IntelliJ, and got to 4 before writing this). The main changes were ensuring that there's only one lock management system per package cache directory, and adding a special lock for changes to the whole package cache directory (and, I guess, the package.ini file). This is NOT in any way complete. There's so many other places it could crap out, and I removed the Dylan, could you take a look at this and see if what I did makes any sense? |
Sure thing, will take a look tomorrow AM. Thanks so much for digging into this |
No problem. This is a good find, and the replicable test is great. Staring at this now, I realize that maybe that bit of 'cleverness' with lock downgrading isn't so clever. I think it will wait for read locks to be released before allowing a write lock, so functionally, I might as well just have a write lock there. I'll need to revisit that. |
Overall I think this makes sense but that's with the caveat that I haven't looked at the existing and surrounding code nearly as much as you and Grahame have, and I'm not an expert in concurrency. In terms of the specifics of the locks, from reading the docs on ReadWriteLock https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReadWriteLock.html the write lock is completely exclusive but multiple read locks may be held at the same time - that sounds good. So in I'm wondering if this "Manager" approach, where a function is passed into the Manager to call after acquiring the lock, is the best way to do this. Another way to do it might be to pass the Lock around into the specific methods that do file operations and let them get the readLock or writeLock as appropriate. That way the caller doesn't have to know the specifics of what the method is going to do. But that may be a major refactoring and more trouble than it's worth |
I think I spoke entirely too soon. I'm looking at it again today, and I think yesterday I got the cache into a 'lucky' state. I'm getting all sorts of errors now.
This is kind of the problem with the cache; it mixes reading and writing in numerous locations instead of clearly distinguishing between the two. I think sorting THAT out first might make concurrency much easier to solve. What disturbs me about taking another approach (passing locks around) is precisely that read/write locking would be left up to individual methods. The moment we do that, we'll end up with |
Definitely agree. If the methods can be cleanly split out then the specifics of the locking approach don't matter as much. |
After talking with Grahame, we agreed on a few things to make this easier. There are two tracking mechanisms which are unused: usage.ini (ironically) is completely unused to date. It can be totally removed. packages.ini was also slated for total removal, except that it contains the cache version. This is important; if we update the cache implementation, this is the mechanism that allows us to identify obsolete caches and initiate a cleanup. However, the tracking mechanism for package install timestamps is totally unused, and THAT was causing file rewrites mid package install. I will remove that. I'm going to try to do two things:
|
The last round of commits to this branch are pretty much what I outlined in my last comment: master...do-20240722-thread-safety Still to be done:
|
This comment was marked as outdated.
This comment was marked as outdated.
I hid the last comment, as it was too soon. I have a consistently failing Windows test now that can't delete the lock files for some reason. Will try to fix... |
Apologies, I didn't have time to take a close look at this today as I'm wrapping a lot of things up before I go on vacation (back 8/28). At a glance the tests look good. Maybe I'm missing something but I don't see much in terms of new Exceptions. "Error reading package" and "Timeout waiting for lock file deletion" I think are fine. More specific exceptions for situations like this are tough because if you could explain to the user exactly what broke you could potentially handle it anyway. When you're ready for a stress test, let me know and we (the Inferno team) can try it out in the scenario we've seen this in. It doesn't happen 100% of the time but it happens often enough that we could tell pretty quickly if it's resolved |
@dehall I'm not ready for a stress test; unfortunately. The Ubuntu tests are failing, and I don't know why. Windows and Mac both appear to be working just fine, but I think I may have to debug in WSL or something to find out what's going on. |
After lots of debugging it turns out the problem was universal, but somehow, Ubuntu caught it quite often. I think this is now solid. @grahamegrieve, the linked PR should have a review before I clean it up. @dehall, the branch in the PR would be ideal for stress tests now. Please take a look at #1715. In reviewing it would be good to verify the following:
There's also lots of other System.out garbage I need to clean up; I'll do that after you've finished your reviews. This DOES NOT ADDRESS TerminologyCache concurrency. I want to keep this PR limited in scope to just the package cache. TerminologyCache can wait until this PR is finalized. |
Apologies I wasn't able to test that PR earlier as I was out most of the past two weeks, I see it's already been merged and a new release is up. Still I'll give it a shot tonight and hopefully confirm everything looks good |
@dehall no worries. The more thorough testing picked up a bunch of issues that needed repair. Hopefully the fixes for those also fixed the one you initially reported as well. There are other problems with concurrency that those tests picked up, but those will be outside the scope of this issue, and I will work on them separately. |
Ok I tested it using the validator wrapper docker image
It's tough to tell from the logs what number corresponds to what since so many threads are printing in parallel, but I see most packages load pretty quickly:
some of them take 3+ minutes:
Or worse:
(maybe Looks like the default timeout is 60 seconds but there is another constructor that takes a different timeout so I'll keep looking into that. This might also just be a worst-case scenario, where it's trying to spin up 10 sessions concurrently - 2 each for 5 US Core versions, plus the presets are loading. This is also just a quick test so there may be a way we could take advantage of those presets, I'll need to spend more time with it. |
@dehall even as a worst case scenario, this should be addressed. The timeout may need to be extended, or have some kind of customization. Is it practical for you to hack a longer timeout into your scenario by building core? If some are taking I'll look into the size of the vsac package and see what else compares on the known packages (if the server APIs will let me). |
Yes I can build locally with a longer timeout and give it a shot, I should have time for that next week. |
Quick comment: us.nlm.vsac is indeed quite large. (584MB), compared to the core spec for r4 (60MB) I wonder if the vsac load is consistently that slow. More than 6 minutes is an awful long time to wait for a timeout in a performant application. I can put together some local code to test this. Which other constructor uses a timeout other than 60 seconds? Maybe I used one in the tests, but in all other cases, it should be the default. |
Ah I wasn't looking closely at the constructor the first time. Quickly scrolling through I saw this one but didn't notice it's private and the one method that calls it is protected. Line 49 in 041b6e8
|
I went through some of the logs for the validator wrapper and found most package loads to be quite fast, including vsac, BUT with some pretty extreme outliers (one in days, which is bizarre). Those times aren't necessarily representative of the run time of addPackageToCache(...), though; they are timed by another mechanism within the validation code. I've added some temporary code in my own local version to test with a more direct approach. In FilesystemPackageCacheManagerLocks: long start = System.currentTimeMillis();
result = f.get();
long end = System.currentTimeMillis();
System.out.println("Time taken for package write: " + (end - start) + "ms"); In FilesystemPackageManagerTests: if (randomOperation == 0) {
HTTPResult res = ManagedWebAccess.get("https://www.fhir.org/packages/us.nlm.vsac/0.19.0/package.tgz");
res.checkThrowException();
InputStream igStream = new ByteArrayInputStream(res.getContent());
pcm.addPackageToCache("us.nlm.vsac", "0.19.0", igStream, "http://fhir.org/packages/us.nlm.vsac");
} I'm loading a package from the server and installing it into the cache, and even vsac takes ~12 seconds.
|
The full server approach doesn't add anything. Still taking 12 seconds for me, on my Mac, over WiFi. @dehall maybe you got unlucky with a weird bit of network slowdown; either way, it's probably something we could handle more gracefully. I'll take a look at two things next week:
|
Ok update from my side, TL;DR: things look good! My earlier test was on my personal machine, Windows running docker in WSL2. I re-ran that same test Saturday AM and saw similar results. This morning I tried it on my work machine, Mac m3 pro running the validator-wrapper natively, no changes to the timeout yet, and everything seems to be working ok. I did see an exception the first time I ran it, but clearing the local package cache fixed that, and on one occasion I accidentally sent 30 simultaneous requests instead of 10 (no exceptions on the service but it did slow to a halt and the http requests timed out), but once I got back into realistic test scenarios it all worked great. I didn't consider my personal machine to be "low performance" but maybe it is - it's some years old now but was well-specced when new. I'll keep poking at why it's so much slower, maybe it's a suboptimal WSL or docker setting. Thanks so much for all your work on this! |
Thanks for the feedback and additional testing @dehall This is coming close to a stopping point, but there are still two items: Process TimeoutThe timeout configuration is actually a tiny bit problematic. We want all locks on those directories to be across the whole JVM or else they won't be useful, and they must be essentially 'final' too, so changing that timeout would be ugly in tests. I think this leaves us with exposing timeout as a parameter to be passed to Getting this functionality is crucial, though, because the next item needs it for testing purposes. Orphaned .lock filesThe .lock files themselves are already causing some people headaches, too. People cancel long running builds, or they fail at some point, and then a .lock file is left in their cache that they're forced to manually delete (and they should be clearing the installed package too, because it could be incomplete). node.js projects had a bunch of similar problems early on. A fix for this is performing 'cleanup' for locked package directories. If a .lock file exists, but is not locked by another OS process, delete both the .lock file and the installed package. When should this be done, though? On initialization of the FPCM is one place. For validator CLI and ig-publisher, that's fine, and I'm working on it right now. But for long-running instances, that wouldn't help; any failures after startup would be persistent for that package. We could attempt to clean and re-install whenever an orphaned .lock file is encountered, but that's tricky because it switches between reading/writing. It's probably the best way to go, though. |
Regarding the above: #1745 is where I'm at now. @dehall can you look at how the lock timeout is set there? I don't think it's exposed enough yet, so any comments would be helpful. I was thinking FhirSettings would be the most obvious place. |
I believe my branch is at a good review point now: #1745 It fixes the primary issue we were having with undeleted .lock files, so it would likely benefit users to have it merged and released. @dehall I could still use feedback on how the lock timeout is set here, though something like FhirSettings could be added later. |
Apologies @dotasek you've caught me on travel again. Meant to comment last Friday but too many other things were happening. I don't have a computer to run any of this on but I'm looking at the code. At a glance I agree, I don't think the lock timeout is exposed enough yet for it to be used in the validator-wrapper where, if I recall correctly, the main entry point back into the core validator is the ValidationService. I don't remember seeing FhirSettings before but it looks like a reasonable spot to me to avoid passing settings though a dozen methods. One question I'm not sure we've discussed is, will this timeout be set per running instance of the service, or should it be possible to set it per request, such as in a validation request cli context? I don't think we need it per-request so using the FhirSettings for a global setting sounds fine to me. (Unless for some reason that won't work with the validator-wrapper, but I can't see why it wouldn't) |
When multiple validator sessions are starting up at the same time we've seen Exceptions that appear to be concurrency issues when reading from certain files in the package cache.
Similar to #1491 - this issue was initially observed using the org.hl7.fhir.validator-wrapper but able to reproduce just with the core validation library.
An example stack trace:
In practice with the validator-wrapper, I seem to recall other files causing issues but maybe it's only usage.ini. Other examples:
I suspect the issue is that
FilesystemPackageCacheManager.loadPackageInfo
writes to the file without locking it, so another thread can enter this at the same time and catch the file before the first thread is done with it. It may be as simple as wrapping the following line in aFilesystemPackageCacheLock.doWriteWithLock
block but my initial attempts to do that just got me different errors.org.hl7.fhir.core/org.hl7.fhir.utilities/src/main/java/org/hl7/fhir/utilities/npm/FilesystemPackageCacheManager.java
Line 264 in ad31f71
Sample code that (usually, not always unfortunately) reproduces the issue:
The text was updated successfully, but these errors were encountered: