Fixed race condition bug in tables #292

kkoz · 2021-06-24T14:14:23Z

There is a bug in omero tables where multiple simultaneous queries on the same table can result in failures. The crux of the issue is that there is a gap in the locking between the acquisition of the HdfStorage in and the call to incr. E.g. if thread 1 makes a query against a table A, the locks is acquired, an HdfStorage is created and added to the HdfList ( in getOrCreate), and the lock is released. Then later when the TableI is created, the lock is acquired, incr is called (adding the table to the list of tables maintained by the HdfStorage), and the lock is released. Then the query is executed (a locked operation) and decr is called (locked). If thread 2 makes a query just after thread 1 though, it is possible that in between thread 2's call to getOrCreate and its call to incr, thread 1 will call decr, causing the HdfStorage to call cleanup on itself and making it unusable. This PR makes changes such that incr is called inside getOrCreate so there is no opportunity for decr and cleanup to be called in between storage acquisition and the call to incr.
To test this bug, I used siege (https://www.joedog.org/siege-home/) and ran the following on a server with omero and omero web running: siege "http://localhost/webgateway/table/<table id>/query/?query=<query>&col_names=<col_name_1>&col_names=<col_name_2>" -H "Cookie:sessionid=<session_id>" -c20 -r2
which will run 20 concurrent requests twice against the table query url. Usually at least 1 of these will fail with errors appearing in OMEROweb.log.

kkoz · 2021-06-25T16:53:12Z

Here is another simpler way to achieve the same result but with a coarser lock:

index c342909e..4329a425 100644
--- a/src/omero/tables.py
+++ b/src/omero/tables.py
@@ -531,11 +531,15 @@ class TablesI(omero.grid.Tables, omero.util.Servant):
         if not p.exists():
             p.makedirs()
 
-        storage = self._storage_factory.getOrCreate(file_path, self.read_only)
-        table = TableI(self.ctx, file_obj, factory, storage,
-                       uuid=Ice.generateUUID(),
-                       call_context=current.ctx,
-                       adapter=current.adapter)
+        self._storage_factory._lock.acquire()
+        try:
+            storage = self._storage_factory.getOrCreate(file_path, self.read_only)
+            table = TableI(self.ctx, file_obj, factory, storage,
+                           uuid=Ice.generateUUID(),
+                           call_context=current.ctx,
+                           adapter=current.adapter)
+        finally:
+            self._storage_factory._lock.release()
         self.resources.add(table)
         prx = current.adapter.add(table, table.id)
         return self._table_cast(prx)```

kkoz · 2021-06-25T17:37:34Z

@joshmoore It might be worth having a conversation about this at some point. Both solutions proposed here are less than elegant but it might take more serious refactoring to have something that works and feels good. Not sure if we want to open that can of worms here and now.

sbesson · 2021-06-28T08:52:55Z

@kkoz this seems to have caused a regression in the OMERO integration test and the webgateway/table/<fileid>/metadata endpoint - see https://merge-ci.openmicroscopy.org/jenkins/job/OMERO-test-integration/lastCompletedBuild/testReport/OmeroWeb.test.integration.test_table/TestOmeroTables/test_table_metadata/.

I have amended the description to append an --exclude comment and get the CI builds back to green. Feel free to remove it as soon as you have a fix pushed.

joshmoore · 2021-06-28T13:47:52Z

Hi @kkoz. Apologies, was in a coma. Happy to chat at your convenience. Looking at https://stackoverflow.com/questions/17682484/is-collections-defaultdict-thread-safe and https://github.com/ome/omero-py/blob/master/src/omero/hdfstorageV2.py#L159-L164, I wonder if a defaultdict(HDFStorage) with an initial hdfStorage._init(hdfpath, self._lock, read_only=read_only) that's idempotent wouldn't be an option.

sbesson · 2021-06-28T15:24:03Z

Update: #292 (comment) was actually an incorrect judgement from my side. The failures seem to have been caused by ome/omero-web#301 instead. Re-including

chris-allan · 2021-06-29T08:36:18Z

As I understand it, I don't think that would be sufficient @joshmoore.

The race is between the return of getOrCreate() (which is globally locked across all threads via HDFLIST._lock and the call to incr() (also globally locked via the aforementioned lock). As HdfStorage (__tables list) and TablesI (storage instance variable) maintain cyclical references to each other containment of all references to HdfStorage is currently impossible. Consequently, incr() and decr() are not 100% thread safe currently. There might be other concurrency issues with the HdfList.__paths collection but I don't think that's the fundamental issue here.

chris-allan · 2021-06-29T17:12:42Z

The more I think about this the more I think I prefer the explicit locking. Based on our discussion today, and following on from @joshmoore's thoughts on a Java example, if we were doing this in that language the relevant contents of TablesI#getTable() would be something like:

...
synchronized(this.storageFactory) {
    this.storage = this.storageFactory.getOrCreate(filePath, readOnly)
    table = TableI(...)
    this.storage.incr(table)
}
...

(The store.incr() would of course not be present on the TableI constructor.)

Now obviously this is Python not Java but effectively that's the Java version of the coarse lock example from earlier in the PR comments. Do we have a sense here about what's most "Pythonic"? Maybe it was a mistake from the beginning to be calling incr() in the constructor?

kkoz · 2021-06-30T13:36:32Z

First, I made the changes I did mostly to simplify some of the unit testing (otherwise I needed to do mock locking).
Second, I think that encapsulation suggests that no, the TablesI constructor should not call incr. All it cares about is getting the data it needs out of the storage - whether that uses 1 shared file handle, 1 file handle per table, or a new file handle per operation should be implementation details of the HdfStorage. I feel like If we were writing a Java or C++ interface for HdfStorage, we would not expose incr.
It seems possible to me that we will never achieve something Pythonic because Python is largely expecting single-threaded workflows. In general file reading in Python is done using with, so we'd do something like

class TableI:
def getWhereList:
   with HdfStorage(file_path,...) as storage:
       logic

Then the storage would be "automatically" cleaned up/closed like any other file handle. But for various reasons, we can't do it this way.
Still thinking about how I'd refactor this.

chris-allan · 2021-07-09T12:20:11Z

Agreed, @kkoz. I think what's here now is functional and certainly solves the initial problem. There is certainly elegance and clarity to consider here but given you are already looking at a large refactoring to support pure read-only access I think what's here is sufficient and we can try to tackle some of these issues in a follow up PR.

Good to merge from my perspective.

The race condition bug here is actively causing issues with the testing of ome/omero-web#300.

joshmoore · 2021-07-09T13:20:16Z

If anyone comes out of the wood works worried about the API changes, we'll workaround. Thanks all!

joshmoore · 2021-07-09T18:52:56Z

I assume this can go into a quick release, though rolling in #287 would likely make the most marketing sense.

cc: @sbesson

kkoz added 3 commits June 24, 2021 07:51

Fixed race condition bug in tables

896238d

Test changes

68f314a

More test fixes

cc5cf10

kkoz force-pushed the table-lock-bug branch from 0887071 to cc5cf10 Compare June 24, 2021 17:53

Fix whitespace

cc7b7f3

chris-allan requested a review from joshmoore June 25, 2021 17:13

sbesson mentioned this pull request Jun 28, 2021

table_errors status_codes ome/omero-web#301

Closed

Move incr into getOrCreate and move getOrCreate into TableI constructor

4fc86cc

chris-allan mentioned this pull request Jul 9, 2021

Improve OMERO.table download usability ome/omero-web#300

Merged

joshmoore merged commit 8560d80 into ome:master Jul 9, 2021

joshmoore mentioned this pull request Jul 9, 2021

Add logic to search OMERO.tables on non Pythonic named columns #287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed race condition bug in tables #292

Fixed race condition bug in tables #292

kkoz commented Jun 24, 2021 •

edited by sbesson

Loading

kkoz commented Jun 25, 2021

kkoz commented Jun 25, 2021

sbesson commented Jun 28, 2021

joshmoore commented Jun 28, 2021

sbesson commented Jun 28, 2021

chris-allan commented Jun 29, 2021

chris-allan commented Jun 29, 2021 •

edited

Loading

kkoz commented Jun 30, 2021

chris-allan commented Jul 9, 2021

joshmoore commented Jul 9, 2021

joshmoore commented Jul 9, 2021

Fixed race condition bug in tables #292

Fixed race condition bug in tables #292

Conversation

kkoz commented Jun 24, 2021 • edited by sbesson Loading

kkoz commented Jun 25, 2021

kkoz commented Jun 25, 2021

sbesson commented Jun 28, 2021

joshmoore commented Jun 28, 2021

sbesson commented Jun 28, 2021

chris-allan commented Jun 29, 2021

chris-allan commented Jun 29, 2021 • edited Loading

kkoz commented Jun 30, 2021

chris-allan commented Jul 9, 2021

joshmoore commented Jul 9, 2021

joshmoore commented Jul 9, 2021

kkoz commented Jun 24, 2021 •

edited by sbesson

Loading

chris-allan commented Jun 29, 2021 •

edited

Loading