Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed race condition bug in tables #292

Merged
merged 5 commits into from
Jul 9, 2021
Merged

Conversation

kkoz
Copy link
Contributor

@kkoz kkoz commented Jun 24, 2021

There is a bug in omero tables where multiple simultaneous queries on the same table can result in failures. The crux of the issue is that there is a gap in the locking between the acquisition of the HdfStorage in and the call to incr. E.g. if thread 1 makes a query against a table A, the locks is acquired, an HdfStorage is created and added to the HdfList ( in getOrCreate), and the lock is released. Then later when the TableI is created, the lock is acquired, incr is called (adding the table to the list of tables maintained by the HdfStorage), and the lock is released. Then the query is executed (a locked operation) and decr is called (locked). If thread 2 makes a query just after thread 1 though, it is possible that in between thread 2's call to getOrCreate and its call to incr, thread 1 will call decr, causing the HdfStorage to call cleanup on itself and making it unusable. This PR makes changes such that incr is called inside getOrCreate so there is no opportunity for decr and cleanup to be called in between storage acquisition and the call to incr.
To test this bug, I used siege (https://www.joedog.org/siege-home/) and ran the following on a server with omero and omero web running: siege "http://localhost/webgateway/table/<table id>/query/?query=<query>&col_names=<col_name_1>&col_names=<col_name_2>" -H "Cookie:sessionid=<session_id>" -c20 -r2
which will run 20 concurrent requests twice against the table query url. Usually at least 1 of these will fail with errors appearing in OMEROweb.log.

@kkoz kkoz force-pushed the table-lock-bug branch from 0887071 to cc5cf10 Compare June 24, 2021 17:53
@kkoz
Copy link
Contributor Author

kkoz commented Jun 25, 2021

Here is another simpler way to achieve the same result but with a coarser lock:

index c342909e..4329a425 100644
--- a/src/omero/tables.py
+++ b/src/omero/tables.py
@@ -531,11 +531,15 @@ class TablesI(omero.grid.Tables, omero.util.Servant):
         if not p.exists():
             p.makedirs()
 
-        storage = self._storage_factory.getOrCreate(file_path, self.read_only)
-        table = TableI(self.ctx, file_obj, factory, storage,
-                       uuid=Ice.generateUUID(),
-                       call_context=current.ctx,
-                       adapter=current.adapter)
+        self._storage_factory._lock.acquire()
+        try:
+            storage = self._storage_factory.getOrCreate(file_path, self.read_only)
+            table = TableI(self.ctx, file_obj, factory, storage,
+                           uuid=Ice.generateUUID(),
+                           call_context=current.ctx,
+                           adapter=current.adapter)
+        finally:
+            self._storage_factory._lock.release()
         self.resources.add(table)
         prx = current.adapter.add(table, table.id)
         return self._table_cast(prx)```

@chris-allan chris-allan requested a review from joshmoore June 25, 2021 17:13
@kkoz
Copy link
Contributor Author

kkoz commented Jun 25, 2021

@joshmoore It might be worth having a conversation about this at some point. Both solutions proposed here are less than elegant but it might take more serious refactoring to have something that works and feels good. Not sure if we want to open that can of worms here and now.

@sbesson
Copy link
Member

sbesson commented Jun 28, 2021

@kkoz this seems to have caused a regression in the OMERO integration test and the webgateway/table/<fileid>/metadata endpoint - see https://merge-ci.openmicroscopy.org/jenkins/job/OMERO-test-integration/lastCompletedBuild/testReport/OmeroWeb.test.integration.test_table/TestOmeroTables/test_table_metadata/.

I have amended the description to append an --exclude comment and get the CI builds back to green. Feel free to remove it as soon as you have a fix pushed.

@joshmoore
Copy link
Member

Hi @kkoz. Apologies, was in a coma. Happy to chat at your convenience. Looking at https://stackoverflow.com/questions/17682484/is-collections-defaultdict-thread-safe and https://github.com/ome/omero-py/blob/master/src/omero/hdfstorageV2.py#L159-L164, I wonder if a defaultdict(HDFStorage) with an initial hdfStorage._init(hdfpath, self._lock, read_only=read_only) that's idempotent wouldn't be an option.

@sbesson
Copy link
Member

sbesson commented Jun 28, 2021

Update: #292 (comment) was actually an incorrect judgement from my side. The failures seem to have been caused by ome/omero-web#301 instead. Re-including

@chris-allan
Copy link
Member

As I understand it, I don't think that would be sufficient @joshmoore.

The race is between the return of getOrCreate() (which is globally locked across all threads via HDFLIST._lock and the call to incr() (also globally locked via the aforementioned lock). As HdfStorage (__tables list) and TablesI (storage instance variable) maintain cyclical references to each other containment of all references to HdfStorage is currently impossible. Consequently, incr() and decr() are not 100% thread safe currently. There might be other concurrency issues with the HdfList.__paths collection but I don't think that's the fundamental issue here.

@chris-allan
Copy link
Member

chris-allan commented Jun 29, 2021

The more I think about this the more I think I prefer the explicit locking. Based on our discussion today, and following on from @joshmoore's thoughts on a Java example, if we were doing this in that language the relevant contents of TablesI#getTable() would be something like:

...
synchronized(this.storageFactory) {
    this.storage = this.storageFactory.getOrCreate(filePath, readOnly)
    table = TableI(...)
    this.storage.incr(table)
}
...

(The store.incr() would of course not be present on the TableI constructor.)

Now obviously this is Python not Java but effectively that's the Java version of the coarse lock example from earlier in the PR comments. Do we have a sense here about what's most "Pythonic"? Maybe it was a mistake from the beginning to be calling incr() in the constructor?

@kkoz
Copy link
Contributor Author

kkoz commented Jun 30, 2021

First, I made the changes I did mostly to simplify some of the unit testing (otherwise I needed to do mock locking).
Second, I think that encapsulation suggests that no, the TablesI constructor should not call incr. All it cares about is getting the data it needs out of the storage - whether that uses 1 shared file handle, 1 file handle per table, or a new file handle per operation should be implementation details of the HdfStorage. I feel like If we were writing a Java or C++ interface for HdfStorage, we would not expose incr.
It seems possible to me that we will never achieve something Pythonic because Python is largely expecting single-threaded workflows. In general file reading in Python is done using with, so we'd do something like

class TableI:
def getWhereList:
   with HdfStorage(file_path,...) as storage:
       logic

Then the storage would be "automatically" cleaned up/closed like any other file handle. But for various reasons, we can't do it this way.
Still thinking about how I'd refactor this.

@chris-allan
Copy link
Member

Agreed, @kkoz. I think what's here now is functional and certainly solves the initial problem. There is certainly elegance and clarity to consider here but given you are already looking at a large refactoring to support pure read-only access I think what's here is sufficient and we can try to tackle some of these issues in a follow up PR.

Good to merge from my perspective.

The race condition bug here is actively causing issues with the testing of ome/omero-web#300.

@joshmoore
Copy link
Member

If anyone comes out of the wood works worried about the API changes, we'll workaround. Thanks all!

@joshmoore
Copy link
Member

I assume this can go into a quick release, though rolling in #287 would likely make the most marketing sense.

cc: @sbesson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants