Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issues with package cache in validator #1641

Open
dehall opened this issue May 30, 2024 · 36 comments
Open

Concurrency issues with package cache in validator #1641

dehall opened this issue May 30, 2024 · 36 comments
Labels

Comments

@dehall
Copy link
Contributor

dehall commented May 30, 2024

When multiple validator sessions are starting up at the same time we've seen Exceptions that appear to be concurrency issues when reading from certain files in the package cache.
Similar to #1491 - this issue was initially observed using the org.hl7.fhir.validator-wrapper but able to reproduce just with the core validation library.
An example stack trace:

java.io.IOException: Error parsing JSON source: Unexpected content at start of JSON: Eof at Line 1 (path=[null]) from '/Users/dehall/.fhir/packages/hl7.fhir.uv.extensions.r4#1.0.0/usage.ini'
	at org.hl7.fhir.utilities.json.parser.JsonLexer.error(JsonLexer.java:120)
	at org.hl7.fhir.utilities.json.parser.JsonParser.parseSource(JsonParser.java:244)
	at org.hl7.fhir.utilities.json.parser.JsonParser.parseJsonObject(JsonParser.java:230)
	at org.hl7.fhir.utilities.json.parser.JsonParser.parseObject(JsonParser.java:71)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackageInfo(FilesystemPackageCacheManager.java:262)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackageFromCacheOnly(FilesystemPackageCacheManager.java:463)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackage(FilesystemPackageCacheManager.java:615)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:108)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:121)
	at org.hl7.fhir.validation.cli.services.ValidationService.loadIgsAndExtensions(ValidationService.java:549)
	at org.hl7.fhir.validation.cli.services.ValidationService.buildValidationEngine(ValidationService.java:495)
	at org.hl7.fhir.validation.cli.services.ValidationService.initializeValidator(ValidationService.java:476)

In practice with the validator-wrapper, I seem to recall other files causing issues but maybe it's only usage.ini. Other examples:

Error parsing JSON source: Unexpected content at start of JSON: Eof at Line 1 (path=[null]) from '/home/ktor/.fhir/packages/hl7.terminology#5.5.0/usage.ini'
Error parsing JSON source: Unexpected content at start of JSON: Eof at Line 1 (path=[null]) from '/home/ktor/.fhir/packages/hl7.fhir.uv.extensions.r4#1.0.0/usage.ini'

I suspect the issue is that FilesystemPackageCacheManager.loadPackageInfo writes to the file without locking it, so another thread can enter this at the same time and catch the file before the first thread is done with it. It may be as simple as wrapping the following line in a FilesystemPackageCacheLock.doWriteWithLock block but my initial attempts to do that just got me different errors.

Sample code that (usually, not always unfortunately) reproduces the issue:

  @Test
  public void multithreadingTest() throws IOException {
    ValidationService myService = new ValidationService();
    final AtomicInteger totalSuccessful = new AtomicInteger();    

    final String[] packages = {
        "hl7.fhir.us.core#3.1.1",
        "hl7.fhir.us.core#4.0.0",
        "hl7.fhir.us.core#5.0.1",
        "hl7.fhir.us.core#6.1.0",
        "hl7.fhir.us.core#7.0.0-ballot"
    };
    
    List<Thread> threads = new ArrayList<>();
    int i = 0;
    for (String pckage : packages) {
      final int index = i++;
      Thread t = new Thread(() -> {
        try {
          myService.initializeValidator(
              new CliContext().setTxServer(null).setIgs(List.of(pckage)),
              "hl7.fhir.r4.core", 
              new TimeTracker(), 
              null
          );
          
          totalSuccessful.incrementAndGet();
          System.out.println("Thread " + index + " completed");
        } catch (Throwable e) {
          e.printStackTrace();
          System.err.println("Thread " + index + " failed");
        }
      });
      t.start();
      threads.add(t);
    }
    threads.forEach(t -> {
      try {
        t.join();
      } catch (InterruptedException e) {

      }
    });
    assertEquals(packages.length, totalSuccessful.get());
  }
@grahamegrieve
Copy link
Collaborator

The information in this file is only used when displaying the cache to a user and advising when an entry in the cache was last used. There is a tool that does this, but not all tools set the usage date, so it's not reliable.

We should surround this code with a try..catch and ignore any errors from it; they simply don't matter.

@dehall
Copy link
Contributor Author

dehall commented Jun 10, 2024

Update: there is one other error we've seen multiple times, looks like it's related to one thread trying to read a file while another is trying to delete it. I haven't had time to try to replicate this yet but this is the log snippet.

Installing fhir.dicom#2022.4.202210062024-06-07 17:50:20.380 [ktor-jetty-3500-2] ERROR ktor.application - 1 exception(s): [org.apache.commons.io.IOIndexedException: IOException #0: Cannot delete file: /home/ktor/.fhir/packages/44eb56b7-82f4-4808-b3d8-6df033b81f72/package]
 done.
Clean up package /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006 because installation failed: File /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006/package/package.json not found
java.io.IOException: File /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006/package/package.json not found
2024-06-07 17:50:20.383 [ktor-jetty-3500-3] INFO  ktor.application - 500 Internal Server Error: POST - /validate
	at org.hl7.fhir.utilities.TextFile.fileToString(TextFile.java:117)
	at org.hl7.fhir.utilities.npm.NpmPackage.loadFiles(NpmPackage.java:401)
	at org.hl7.fhir.utilities.npm.NpmPackage.fromFolder(NpmPackage.java:360)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackageInfo(FilesystemPackageCacheManager.java:265)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.lambda$addPackageToCache$2(FilesystemPackageCacheManager.java:546)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheLock.doWriteWithLock(FilesystemPackageCacheLock.java:37)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.addPackageToCache(FilesystemPackageCacheManager.java:508)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackage(FilesystemPackageCacheManager.java:651)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:107)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:120)
	at org.hl7.fhir.validation.cli.services.ValidationService.loadIgsAndExtensions(ValidationService.java:553)
	at org.hl7.fhir.validation.cli.services.ValidationService.buildValidationEngine(ValidationService.java:494)
	at org.hl7.fhir.validation.cli.services.ValidationService.initializeValidator(ValidationService.java:475)
	at org.hl7.fhir.validation.cli.services.ValidationService.validateSources(ValidationService.java:95)
	at controller.validation.ValidationControllerImpl.validateRequest(ValidationControllerImpl.kt:18)
	at controller.validation.ValidationModuleKt$validationModule$1.invokeSuspend(ValidationModule.kt:43)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:138)
	at io.ktor.util.pipeline.SuspendFunctionGun.loop(SuspendFunctionGun.kt:112)
	at io.ktor.util.pipeline.SuspendFunctionGun.access$loop(SuspendFunctionGun.kt:14)
	at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:62)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2024-06-07 17:50:20.393 [ktor-jetty-3500-4] ERROR ktor.application - File /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006/package/package.json not found
2024-06-07 17:50:20.394 [ktor-jetty-3500-1] INFO  ktor.application - Received Validation Request. FHIR Version: 4.0.1 IGs: [hl7.fhir.us.core#5.0.1] Memory (free/max): 4921594968/8283750400
No such cached session exists for session id 008b523b-bb4c-44ae-8a0e-44ac1bcd5599, re-instantiating validator.
  Load FHIR v4.0.1 from hl7.fhir.r4.core#4.0.12024-06-07 17:50:20.395 [ktor-jetty-3500-4] INFO  ktor.application - 500 Internal Server Error: POST - /validate

@grahamegrieve
Copy link
Collaborator

Yes this is. big concurrency issue. scheduled to look at next week

@dotasek
Copy link
Collaborator

dotasek commented Aug 1, 2024

@dehall @grahamegrieve

I've done some initial work with this. I created multithreadingTestMinimal() which uses ValidationEngine more directly; I hope to eventually strip that away and use the FileSystemPackageCacheManager directly instead. It originally displayed the same failure as Dylan's test.

I have the beginnings of a solution, which gets me repeated passes for multithreadingTestMinimal()

master...do-20240722-thread-safety

The issue is that there is no read lock... we check to make sure two engines aren't writing the same data, but never that one engine might be reading something that is still in the process of being written.

What I've done in the branch above is first ask for another write lock when we need to read. Then, if a write lock is permitted (no one is writing), the lock is downgraded to a read. There are multiple read threads allowed on these locks, but just one write, so this will give us a bit more efficiency, rather than only allowing a single thread to read at a time.

This is by no means a complete solution, just something to start with.

Thoughts?

@dotasek
Copy link
Collaborator

dotasek commented Aug 1, 2024

Hah, and I see that there's some DDOS protection on the server, because my repeated tests are displaying 'Too Many Requests' errors.

@dotasek
Copy link
Collaborator

dotasek commented Aug 1, 2024

Oh... and this is weird. Dylan's test STILL fails, but for a different reason; the txCache is getting repeatedly cleared because at some point packages.ini contains a null for the package cache version... which it shouldn't. Unless something is in the middle of rewriting it.

@dehall
Copy link
Contributor Author

dehall commented Aug 5, 2024

Thanks @dotasek , I haven't dug into the details of how files get read here, but conceptually this sounds good.

One thought though - it sounds like the issue is more that there currently is no read-locking being done, rather than an existing read-lock mechanism is broken. Is the always-get-a-write-lock-to-read-lock approach needed? Since the write lock is exclusive, seems like that will result in read locks essentially being made exclusive too

@dotasek
Copy link
Collaborator

dotasek commented Aug 5, 2024

@dehall I'm thinking along similar lines. I ran this several times this morning only to get different failures for other files, which have no locks on them at all.

The lock downgrade mechanism actually releases the write lock, so it doesn't make read locks exclusive. It gets a write lock, then gets a read lock, then releases the write lock, leaving only the read lock active while it does its work. The intent there was to prevent reads while writing is happening. This should work OK for the package manager, where it only writes a file once.

I do have to check if there's a case where certain files will be re-written; I'm thinking particularly packages.ini

And, looking at how the terminology cache works, this concurrency will likely be a big issue there, and a more complex one, because the cache appends to files instead of just creating and writing once.

@dotasek
Copy link
Collaborator

dotasek commented Aug 5, 2024

Some very odd errors cropping up today that I don't recall seeing before:

Load FHIR v4.0 from hl7.fhir.r4.coreError parsing Observation-HVS-item-example-88122-7.json: [SQLITE_READONLY_DBMOVED] The database file has been moved since it was opened (attempt to write a readonly database)
Error parsing Observation-HVS-item-example-88123-5.json: Cannot read the array length because "bytes" is null
Error parsing Observation-HVS-item-example-88124-3.json: Cannot read the array length because "bytes" is null
Error parsing Observation-HVS-panel-example-88121-9.json: statement is not executing

I don't know if this is related to concurrency at all. Even reverting to a state before my changes results in this behaviour.

@dotasek
Copy link
Collaborator

dotasek commented Aug 6, 2024

To get beyond the issue I was experiencing, I started running my test using local packages, so at least I'm not hammering the package server as much.

I made some changes that seem to get @dehall 's test to repeatedly pass (I set it to loop until failure in IntelliJ, and got to 4 before writing this). The main changes were ensuring that there's only one lock management system per package cache directory, and adding a special lock for changes to the whole package cache directory (and, I guess, the package.ini file).

This is NOT in any way complete. There's so many other places it could crap out, and I removed the .lock file mechanism during the refactor in order to reduce complexity.

Dylan, could you take a look at this and see if what I did makes any sense?

aa23ed8

@dehall
Copy link
Contributor Author

dehall commented Aug 6, 2024

Sure thing, will take a look tomorrow AM. Thanks so much for digging into this

@dotasek
Copy link
Collaborator

dotasek commented Aug 6, 2024

No problem. This is a good find, and the replicable test is great.

Staring at this now, I realize that maybe that bit of 'cleverness' with lock downgrading isn't so clever. I think it will wait for read locks to be released before allowing a write lock, so functionally, I might as well just have a write lock there. I'll need to revisit that.

@dehall
Copy link
Contributor Author

dehall commented Aug 7, 2024

Overall I think this makes sense but that's with the caveat that I haven't looked at the existing and surrounding code nearly as much as you and Grahame have, and I'm not an expert in concurrency.

In terms of the specifics of the locks, from reading the docs on ReadWriteLock https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReadWriteLock.html the write lock is completely exclusive but multiple read locks may be held at the same time - that sounds good. So in doReadWithLock I think the write lock downgrade would be unnecessary if it could be guaranteed that the function it calls is truly only a read; ie, if you're sure you only need to read, just get a read lock and ReadWriteLock will handle the rest. But I can tell making sure the function really is read-only is not as easy as it sounds. I'm not sure if there's a way to do that beyond manually reviewing every function you might call. loadPackageInfo sounds like it should be a pure read but no, turns out it also writes back to the usage.ini file. Having doReadWithLock always get a write lock first is, as far as I can tell, totally safe in terms of locking, but could potentially slow things down unnecessarily. (Not sure how "slow" it would be in practice given the lock is per-package and there shouldn't be that many concurrent readers anyway)

I'm wondering if this "Manager" approach, where a function is passed into the Manager to call after acquiring the lock, is the best way to do this. Another way to do it might be to pass the Lock around into the specific methods that do file operations and let them get the readLock or writeLock as appropriate. That way the caller doesn't have to know the specifics of what the method is going to do. But that may be a major refactoring and more trouble than it's worth

@dotasek
Copy link
Collaborator

dotasek commented Aug 7, 2024

I think I spoke entirely too soon. I'm looking at it again today, and I think yesterday I got the cache into a 'lucky' state. I'm getting all sorts of errors now.

usage.ini is a bit of a pain. From the git blame, this looks to have been a tracking mechanism for the validator website to see what packages are being validated against. But it hasn't been used at all, as I've never gone to look at it. I'm going to suggest to Grahame that we move that.

This is kind of the problem with the cache; it mixes reading and writing in numerous locations instead of clearly distinguishing between the two. I think sorting THAT out first might make concurrency much easier to solve.

What disturbs me about taking another approach (passing locks around) is precisely that read/write locking would be left up to individual methods. The moment we do that, we'll end up with n implementations of concurrency peppered throughout the cache instead of being in one encapsulated location. But I'm open to altering that opinion if you can expand on the approach, maybe with some pseudocode.

@dehall
Copy link
Contributor Author

dehall commented Aug 8, 2024

This is kind of the problem with the cache; it mixes reading and writing in numerous locations instead of clearly distinguishing between the two. I think sorting THAT out first might make concurrency much easier to solve.

Definitely agree. If the methods can be cleanly split out then the specifics of the locking approach don't matter as much.

@dotasek
Copy link
Collaborator

dotasek commented Aug 8, 2024

After talking with Grahame, we agreed on a few things to make this easier.

There are two tracking mechanisms which are unused:

usage.ini (ironically) is completely unused to date. It can be totally removed.

packages.ini was also slated for total removal, except that it contains the cache version. This is important; if we update the cache implementation, this is the mechanism that allows us to identify obsolete caches and initiate a cleanup. However, the tracking mechanism for package install timestamps is totally unused, and THAT was causing file rewrites mid package install. I will remove that.

I'm going to try to do two things:

  • Isolate package install and cache cleaning into a single test that only involves the FilesystemPackageCacheManager.
  • Clean up the cache init process first and see what that gets us.

@dotasek
Copy link
Collaborator

dotasek commented Aug 9, 2024

The last round of commits to this branch are pretty much what I outlined in my last comment: master...do-20240722-thread-safety

Still to be done:

  • Apply the locks to the rest of the FilesystemPackageCacheManager
  • Figure out a way to test not just the same method, but combinations of methods from different threads.
  • Re-introduce a file lock, and make sure it can poll (at present, it actually just failed when the lock was held, instead of waiting).
  • Create an issue to apply a similar approach to TerminologyCache too.

@dotasek

This comment was marked as outdated.

@dotasek
Copy link
Collaborator

dotasek commented Aug 16, 2024

I hid the last comment, as it was too soon. I have a consistently failing Windows test now that can't delete the lock files for some reason. Will try to fix...

@dehall
Copy link
Contributor Author

dehall commented Aug 16, 2024

Apologies, I didn't have time to take a close look at this today as I'm wrapping a lot of things up before I go on vacation (back 8/28). At a glance the tests look good. Maybe I'm missing something but I don't see much in terms of new Exceptions. "Error reading package" and "Timeout waiting for lock file deletion" I think are fine. More specific exceptions for situations like this are tough because if you could explain to the user exactly what broke you could potentially handle it anyway.

When you're ready for a stress test, let me know and we (the Inferno team) can try it out in the scenario we've seen this in. It doesn't happen 100% of the time but it happens often enough that we could tell pretty quickly if it's resolved

@dotasek
Copy link
Collaborator

dotasek commented Aug 16, 2024

@dehall I'm not ready for a stress test; unfortunately. The Ubuntu tests are failing, and I don't know why.

Windows and Mac both appear to be working just fine, but I think I may have to debug in WSL or something to find out what's going on.

@dotasek
Copy link
Collaborator

dotasek commented Aug 20, 2024

After lots of debugging it turns out the problem was universal, but somehow, Ubuntu caught it quite often. I think this is now solid. @grahamegrieve, the linked PR should have a review before I clean it up. @dehall, the branch in the PR would be ideal for stress tests now.

Please take a look at #1715.

In reviewing it would be good to verify the following:

  • The new tests in FilesystemPackageManagerLockTests and FilesystemPackageManagerTests should now cover what was originally found breaking in Dylan's additions to ValidationServiceTests. After this review, I intend to drop Dylans original test in favour of these.
  • There is a change in NpmPackage to method visibility. I know the functionality is correct for this, but I don't like that the indexing methods are exposed.
  • That the exceptions I've added in FilesystemPackageManagerLocks are clear. They really shouldn't be encountered unless something very, very odd is going on, like a stubborn lock file.

There's also lots of other System.out garbage I need to clean up; I'll do that after you've finished your reviews.

This DOES NOT ADDRESS TerminologyCache concurrency. I want to keep this PR limited in scope to just the package cache. TerminologyCache can wait until this PR is finalized.

@dehall
Copy link
Contributor Author

dehall commented Sep 3, 2024

Apologies I wasn't able to test that PR earlier as I was out most of the past two weeks, I see it's already been merged and a new release is up. Still I'll give it a shot tonight and hopefully confirm everything looks good

@dotasek
Copy link
Collaborator

dotasek commented Sep 3, 2024

@dehall no worries. The more thorough testing picked up a bunch of issues that needed repair. Hopefully the fixes for those also fixed the one you initially reported as well.

There are other problems with concurrency that those tests picked up, but those will be outside the scope of this issue, and I will work on them separately.

@dehall
Copy link
Contributor Author

dehall commented Sep 4, 2024

Ok I tested it using the validator wrapper docker image markiantorno/validator-wrapper:1.0.55-SNAPSHOT, it does seem to be an improvement in general but in one instance I did hit one of the new exceptions:

2024-09-04 00:30:41.746 [ktor-jetty-3500-10] ERROR ktor.application - Error reading package.
2024-09-03 20:30:41 java.io.IOException: Error reading package.
2024-09-03 20:30:41     at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManagerLocks$PackageLock.checkForLockFileWaitForDeleteIfExists(FilesystemPackageCacheManagerLocks.java:148)
2024-09-03 20:30:41     at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManagerLocks$PackageLock.doReadWithLock(FilesystemPackageCacheManagerLocks.java:157)
2024-09-03 20:30:41     at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackageFromCacheOnly(FilesystemPackageCacheManager.java:453)
2024-09-03 20:30:41     at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackage(FilesystemPackageCacheManager.java:649)
2024-09-03 20:30:41     at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:120)
2024-09-03 20:30:41     at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:133)
2024-09-03 20:30:41     at org.hl7.fhir.validation.cli.services.ValidationService.loadIgsAndExtensions(ValidationService.java:603)
2024-09-03 20:30:41     at org.hl7.fhir.validation.cli.services.ValidationService.buildValidationEngine(ValidationService.java:538)
2024-09-03 20:30:41     at org.hl7.fhir.validation.cli.services.ValidationService.getValidationEngineFromCliContext(ValidationService.java:522)
2024-09-03 20:30:41     at org.hl7.fhir.validation.cli.services.ValidationService.initializeValidator(ValidationService.java:504)
2024-09-03 20:30:41     at org.hl7.fhir.validation.cli.services.ValidationService.validateSources(ValidationService.java:112)
2024-09-03 20:30:41     at controller.validation.ValidationControllerImpl.validateRequest(ValidationControllerImpl.kt:16)
2024-09-03 20:30:41     at controller.validation.ValidationModuleKt$validationModule$1.invokeSuspend(ValidationModule.kt:43)
2024-09-03 20:30:41     at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
2024-09-03 20:30:41     at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:138)
2024-09-03 20:30:41     at io.ktor.util.pipeline.SuspendFunctionGun.loop(SuspendFunctionGun.kt:112)
2024-09-03 20:30:41     at io.ktor.util.pipeline.SuspendFunctionGun.access$loop(SuspendFunctionGun.kt:14)
2024-09-03 20:30:41     at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:62)
2024-09-03 20:30:41     at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
2024-09-03 20:30:41     at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
2024-09-03 20:30:41     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2024-09-03 20:30:41     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2024-09-03 20:30:41     at java.base/java.lang.Thread.run(Unknown Source)
2024-09-03 20:30:41 Caused by: java.util.concurrent.TimeoutException: Timeout waiting for lock file deletion: us.nlm.vsac#0.7.0.lock
2024-09-03 20:30:41     at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManagerLocks$PackageLock.checkForLockFileWaitForDeleteIfExists(FilesystemPackageCacheManagerLocks.java:130)
2024-09-03 20:30:41     ... 22 common frames omitted

It's tough to tell from the logs what number corresponds to what since so many threads are printing in parallel, but I see most packages load pretty quickly:

2024-09-03 20:28:50   Load hl7.fhir.uv.extensions.r5#5.1.0 - 1396 resources (00:03.594)
2024-09-03 20:28:52   Load hl7.terminology#6.0.2 - 4243 resources (00:02.225)
2024-09-03 20:28:53   Load hl7.terminology.r5#6.0.2 - 4243 resources (00:00.947)

some of them take 3+ minutes:

2024-09-03 18:06:53 Installing hl7.terminology.r4#3.1.0
2024-09-03 18:06:54  - 4117 resources (03:29.922)
2024-09-03 18:06:54 Installing hl7.fhir.uv.bulkdata#2.0.0 to the package cache
2024-09-03 18:06:54   Fetching:
2024-09-03 18:06:55   Load hl7.terminology.r4#3.1.0
2024-09-03 18:06:55 Installing hl7.fhir.uv.extensions.r4#1.0.0
2024-09-03 18:06:55   Load hl7.fhir.uv.extensions.r4#1.0.0 - 4117 resources (03:31.329)

Or worse:

2024-09-03 18:09:46 Installing us.nlm.vsac#0.11.0 to the package cache
2024-09-03 18:09:46   Fetching:
2024-09-03 18:09:46  - 4576 resources (06:21.977)

(maybe us.nlm.vsac is particularly large?)

Looks like the default timeout is 60 seconds but there is another constructor that takes a different timeout so I'll keep looking into that.

This might also just be a worst-case scenario, where it's trying to spin up 10 sessions concurrently - 2 each for 5 US Core versions, plus the presets are loading. This is also just a quick test so there may be a way we could take advantage of those presets, I'll need to spend more time with it.

@dotasek
Copy link
Collaborator

dotasek commented Sep 4, 2024

@dehall even as a worst case scenario, this should be addressed. The timeout may need to be extended, or have some kind of customization.

Is it practical for you to hack a longer timeout into your scenario by building core? If some are taking 3 6 minutes to install, maybe a 5 8-10 minute timeout would be more appropriate. Maybe we need to start asking what the worst case scenarios are for our default package servers.

I'll look into the size of the vsac package and see what else compares on the known packages (if the server APIs will let me).

@dehall
Copy link
Contributor Author

dehall commented Sep 4, 2024

Yes I can build locally with a longer timeout and give it a shot, I should have time for that next week.

@dotasek
Copy link
Collaborator

dotasek commented Sep 4, 2024

Quick comment: us.nlm.vsac is indeed quite large. (584MB), compared to the core spec for r4 (60MB)

I wonder if the vsac load is consistently that slow. More than 6 minutes is an awful long time to wait for a timeout in a performant application. I can put together some local code to test this.

Which other constructor uses a timeout other than 60 seconds? Maybe I used one in the tests, but in all other cases, it should be the default.

@dehall
Copy link
Contributor Author

dehall commented Sep 5, 2024

Ah I wasn't looking closely at the constructor the first time. Quickly scrolling through I saw this one but didn't notice it's private and the one method that calls it is protected.

private FilesystemPackageCacheManagerLocks(File cacheFolder, Long lockTimeoutTime, TimeUnit lockTimeoutTimeUnit) throws IOException {

@dotasek
Copy link
Collaborator

dotasek commented Sep 6, 2024

I went through some of the logs for the validator wrapper and found most package loads to be quite fast, including vsac, BUT with some pretty extreme outliers (one in days, which is bizarre).

Those times aren't necessarily representative of the run time of addPackageToCache(...), though; they are timed by another mechanism within the validation code. I've added some temporary code in my own local version to test with a more direct approach.

In FilesystemPackageCacheManagerLocks:

 long start = System.currentTimeMillis();
 result = f.get();
 long end = System.currentTimeMillis();
 System.out.println("Time taken for package write: " + (end - start) + "ms");

In FilesystemPackageManagerTests:

if (randomOperation == 0) {
            HTTPResult res = ManagedWebAccess.get("https://www.fhir.org/packages/us.nlm.vsac/0.19.0/package.tgz");
            res.checkThrowException();

            InputStream igStream = new ByteArrayInputStream(res.getContent());
            pcm.addPackageToCache("us.nlm.vsac", "0.19.0", igStream, "http://fhir.org/packages/us.nlm.vsac");
          }

I'm loading a package from the server and installing it into the cache, and even vsac takes ~12 seconds.

Mind you, this is one of those packages that is queried for in three places (primary server, secodary server, and then 'the old way', which is the approach I tested). Maybe between those three servers, response times add up, particularly if one of them might be rate limiting their requests. (see next comment)

@dotasek
Copy link
Collaborator

dotasek commented Sep 6, 2024

The full server approach doesn't add anything. Still taking 12 seconds for me, on my Mac, over WiFi.

@dehall maybe you got unlucky with a weird bit of network slowdown; either way, it's probably something we could handle more gracefully. I'll take a look at two things next week:

  • Configuring that timeout programattically.
  • Producing.a timeout-specific error for client code to consume. That way, you could catch timeouts and establish your own retry.

@dehall
Copy link
Contributor Author

dehall commented Sep 10, 2024

Ok update from my side, TL;DR: things look good!

My earlier test was on my personal machine, Windows running docker in WSL2. I re-ran that same test Saturday AM and saw similar results.

This morning I tried it on my work machine, Mac m3 pro running the validator-wrapper natively, no changes to the timeout yet, and everything seems to be working ok. I did see an exception the first time I ran it, but clearing the local package cache fixed that, and on one occasion I accidentally sent 30 simultaneous requests instead of 10 (no exceptions on the service but it did slow to a halt and the http requests timed out), but once I got back into realistic test scenarios it all worked great.
Then I tried it in docker here and it works fine too. A quick glance at the logs and nothing takes more than a few seconds to load. So I think we should be good! (One change to the wrapper unrelated to this might be an issue but I'll post a separate issue for that.) Maybe the timeout could be increased a little bit to be on the safe side but it doesn't seem like it will be an issue for us in practice with the default 1-minute timeout.

I didn't consider my personal machine to be "low performance" but maybe it is - it's some years old now but was well-specced when new. I'll keep poking at why it's so much slower, maybe it's a suboptimal WSL or docker setting.

Thanks so much for all your work on this!

@dotasek
Copy link
Collaborator

dotasek commented Sep 13, 2024

Thanks for the feedback and additional testing @dehall

This is coming close to a stopping point, but there are still two items:

Process Timeout

The timeout configuration is actually a tiny bit problematic. We want all locks on those directories to be across the whole JVM or else they won't be useful, and they must be essentially 'final' too, so changing that timeout would be ugly in tests. I think this leaves us with exposing timeout as a parameter to be passed to doWriteWithLock(...)/doReadWithLock(...) and have that set in instances of FilesystemPackageCacheManager. I don't like FPCM knowing any implementation details of the locking, but maybe this is unavoidable.

Getting this functionality is crucial, though, because the next item needs it for testing purposes.

Orphaned .lock files

The .lock files themselves are already causing some people headaches, too. People cancel long running builds, or they fail at some point, and then a .lock file is left in their cache that they're forced to manually delete (and they should be clearing the installed package too, because it could be incomplete). node.js projects had a bunch of similar problems early on.

A fix for this is performing 'cleanup' for locked package directories. If a .lock file exists, but is not locked by another OS process, delete both the .lock file and the installed package.

When should this be done, though? On initialization of the FPCM is one place. For validator CLI and ig-publisher, that's fine, and I'm working on it right now. But for long-running instances, that wouldn't help; any failures after startup would be persistent for that package. We could attempt to clean and re-install whenever an orphaned .lock file is encountered, but that's tricky because it switches between reading/writing. It's probably the best way to go, though.

@dotasek
Copy link
Collaborator

dotasek commented Sep 13, 2024

Regarding the above: #1745 is where I'm at now.

@dehall can you look at how the lock timeout is set there? I don't think it's exposed enough yet, so any comments would be helpful. I was thinking FhirSettings would be the most obvious place.

@dotasek
Copy link
Collaborator

dotasek commented Sep 19, 2024

I believe my branch is at a good review point now: #1745

It fixes the primary issue we were having with undeleted .lock files, so it would likely benefit users to have it merged and released.

@dehall I could still use feedback on how the lock timeout is set here, though something like FhirSettings could be added later.

@dehall
Copy link
Contributor Author

dehall commented Sep 20, 2024

Apologies @dotasek you've caught me on travel again. Meant to comment last Friday but too many other things were happening. I don't have a computer to run any of this on but I'm looking at the code.

At a glance I agree, I don't think the lock timeout is exposed enough yet for it to be used in the validator-wrapper where, if I recall correctly, the main entry point back into the core validator is the ValidationService. I don't remember seeing FhirSettings before but it looks like a reasonable spot to me to avoid passing settings though a dozen methods.

One question I'm not sure we've discussed is, will this timeout be set per running instance of the service, or should it be possible to set it per request, such as in a validation request cli context? I don't think we need it per-request so using the FhirSettings for a global setting sounds fine to me. (Unless for some reason that won't work with the validator-wrapper, but I can't see why it wouldn't)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants