-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed race condition bug in tables #292
Conversation
Here is another simpler way to achieve the same result but with a coarser lock: index c342909e..4329a425 100644
--- a/src/omero/tables.py
+++ b/src/omero/tables.py
@@ -531,11 +531,15 @@ class TablesI(omero.grid.Tables, omero.util.Servant):
if not p.exists():
p.makedirs()
- storage = self._storage_factory.getOrCreate(file_path, self.read_only)
- table = TableI(self.ctx, file_obj, factory, storage,
- uuid=Ice.generateUUID(),
- call_context=current.ctx,
- adapter=current.adapter)
+ self._storage_factory._lock.acquire()
+ try:
+ storage = self._storage_factory.getOrCreate(file_path, self.read_only)
+ table = TableI(self.ctx, file_obj, factory, storage,
+ uuid=Ice.generateUUID(),
+ call_context=current.ctx,
+ adapter=current.adapter)
+ finally:
+ self._storage_factory._lock.release()
self.resources.add(table)
prx = current.adapter.add(table, table.id)
return self._table_cast(prx)``` |
@joshmoore It might be worth having a conversation about this at some point. Both solutions proposed here are less than elegant but it might take more serious refactoring to have something that works and feels good. Not sure if we want to open that can of worms here and now. |
@kkoz this seems to have caused a regression in the OMERO integration test and the I have amended the description to append an |
Hi @kkoz. Apologies, was in a coma. Happy to chat at your convenience. Looking at https://stackoverflow.com/questions/17682484/is-collections-defaultdict-thread-safe and https://github.com/ome/omero-py/blob/master/src/omero/hdfstorageV2.py#L159-L164, I wonder if a |
Update: #292 (comment) was actually an incorrect judgement from my side. The failures seem to have been caused by ome/omero-web#301 instead. Re-including |
As I understand it, I don't think that would be sufficient @joshmoore. The race is between the return of |
The more I think about this the more I think I prefer the explicit locking. Based on our discussion today, and following on from @joshmoore's thoughts on a Java example, if we were doing this in that language the relevant contents of
(The Now obviously this is Python not Java but effectively that's the Java version of the coarse lock example from earlier in the PR comments. Do we have a sense here about what's most "Pythonic"? Maybe it was a mistake from the beginning to be calling |
First, I made the changes I did mostly to simplify some of the unit testing (otherwise I needed to do mock locking).
Then the storage would be "automatically" cleaned up/closed like any other file handle. But for various reasons, we can't do it this way. |
Agreed, @kkoz. I think what's here now is functional and certainly solves the initial problem. There is certainly elegance and clarity to consider here but given you are already looking at a large refactoring to support pure read-only access I think what's here is sufficient and we can try to tackle some of these issues in a follow up PR. Good to merge from my perspective. The race condition bug here is actively causing issues with the testing of ome/omero-web#300. |
If anyone comes out of the wood works worried about the API changes, we'll workaround. Thanks all! |
There is a bug in omero tables where multiple simultaneous queries on the same table can result in failures. The crux of the issue is that there is a gap in the locking between the acquisition of the
HdfStorage
in and the call toincr
. E.g. if thread 1 makes a query against a table A, the locks is acquired, anHdfStorage
is created and added to theHdfList
( ingetOrCreate
), and the lock is released. Then later when theTableI
is created, the lock is acquired,incr
is called (adding the table to the list of tables maintained by theHdfStorage
), and the lock is released. Then the query is executed (a locked operation) anddecr
is called (locked). If thread 2 makes a query just after thread 1 though, it is possible that in between thread 2's call togetOrCreate
and its call toincr
, thread 1 will calldecr
, causing theHdfStorage
to callcleanup
on itself and making it unusable. This PR makes changes such thatincr
is called insidegetOrCreate
so there is no opportunity fordecr
andcleanup
to be called in between storage acquisition and the call toincr
.To test this bug, I used siege (https://www.joedog.org/siege-home/) and ran the following on a server with omero and omero web running:
siege "http://localhost/webgateway/table/<table id>/query/?query=<query>&col_names=<col_name_1>&col_names=<col_name_2>" -H "Cookie:sessionid=<session_id>" -c20 -r2
which will run 20 concurrent requests twice against the table query url. Usually at least 1 of these will fail with errors appearing in
OMEROweb.log
.