-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout parameter to wait(::Condition)
#56974
Conversation
# Confirm that the waiting task is still in the wait queue and remove it. If | ||
# the task is not in the wait queue, it must have been notified already so we | ||
# don't do anything here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to introduce a data race though, so we cannot merge this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's that? We're locking the condition variable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Timer runs concurrently with the return from wait
, so by the time this code runs, you might have just corrupted some arbitrary subsequent wait on the same condition or by the time you schedule the TimeoutError, it could blow up some completely unrelated wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. There's an ABA problem. Let me see if I can find a solution for that.
But the waiting task is only scheduled with a TimeoutError
if it was in this condition's wait queue, so I'm not sure I understand your "or" case here -- the only subsequent wait
that could get blown up is a wait
on the same condition, which is the same ABA problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could been in the waitq, then removed before you got around to scheduling it, or vice versa with some other thread scheduling before it got around to removing it from the queue. Those codes are running on other threads, so it could be concurrent. There is potentially no guarantee that you can safely mutate this data-structure concurrently on two threads (#55542)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a fix for the ABA problem that relies on happens-before -- if the waiter was scheduled, it sets waiter_left
before returning. It can only re-enter the condition's wait queue by another call to wait
, for which it must acquire the lock.
We acquire the condition's lock before checking waiter_left
and for the task's presence in the wait queue. If the task is present, it can only be because it has not been scheduled, because if it was scheduled, it would have set waiter_left
before re-entering the wait queue.
I think the combination of the lock and the atomic assure there is no ABA problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could been in the waitq, then removed before you got around to scheduling it
We acquire the lock, confirm that the waiter did not leave and remove it from the wait queue before scheduling it. If it was not in the wait queue, we do not schedule it and this decision is made while holding the lock.
some other thread scheduling before it got around to removing it from the queue
If the task is scheduled by notify
, then it is removed from the condition's wait queue before it is scheduled, which is done while holding the condition's lock. If it is not in the wait queue, then we do not schedule it.
How difficult would it be to re-use this implementation for waiting on other objects like |
Both |
53a5d21
to
0d2c642
Compare
A timed-out Adjusted the tests as well. |
end | ||
unlock(c.lock) | ||
# send the waiting task a timeout | ||
dosched && schedule(ct, :timed_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about throwing an instance of a custom struct here instead?
struct TimeOutEvent end
and returning
dosched && schedule(ct, TimeOutEvent())
With the current code, i worry about someone having a typo and missing the timeout for that reason :(
if wait(cond) === :time_out # oops, this will never be reached
# handle timeout
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wow, i just read your commit comment:
Return :timed_out instead of :timeout like Base.timedwait
.... :/ I don't love that decision for base..... But i also don't love the idea of doing something different than timedwait....
I think my preference is still to use a custom struct here, but i feel a bit less confident about it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the special symbol either, but I didn't want to have two different timed waits in Base
with different interfaces.
Would be good to get some more reviews on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Thanks for addressing the test flakiness.
I like the return API better than throwing an exception 👍
LGTM except for my suggestion to change to a type instead of a symbol for the return value
Would be good to get some additional reviews. Attn: @vchuravy, @JeffBezanson, @gbaraldi, @topolarity. |
dosched && schedule(ct, :timed_out) | ||
end | ||
t.sticky = false | ||
Threads._spawn_set_thrpool(t, :interactive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we have no threads in the interactive threadpool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threads._spawn_set_thrpool will set the threadpool to :default
if there are no interactive threads.
# Confirm that the waiting task is still in the wait queue and remove it. If | ||
# the task is not in the wait queue, it must have been notified already so we | ||
# don't do anything here. | ||
if !waiter_left[] && ct.queue == c.waitq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful to note the orderings here in a little diagram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment describing the typical flows and some possible interleavings.
ee1a4cb
to
11ef85a
Compare
LGTM |
I have to admit, I'm not keen that this kind of API was merged. I would have much preferred to a) see a generic cancellation framework designed that handles timeouts as a special case of cancelling operations, and b) an implementation in a package to test things out until a good design is achieved. Is the plan now to add the Would it not be possible to add an implementation for |
I agree that a more general cancellation mechanism would be better. However, there are many significant challenges to implementing such a thing, especially when managing multi-threaded interactions. Given the availability of time and resources in our team, having this is better than having nothing. |
We have a need for this capability. I believe this closes JuliaLang#36217. The implementation is straightforward and there are a couple of tests.
It may be worth knowing that all of our blocking event-based objects (Channels, IO objects, etc.) already support cancellation (that is how this PR is implemented "under the hood"), while there is only sporadic and unreliable support for timeouts |
Should we unwind this for 1.12, and give it some more time to bake? |
I don't see any ability to cancel a blocking |
We can run with this patch in our build, so this can be unwound if there's a better solution planned. Is there such a plan however? |
All cancellation is implemented with |
This approach pushes the complexity of managing synchronization issues to user code. I need a separate task to |
AFAICT from a quick scan of the literature, cancellation tokens are still being researched. Here is a recent (2022) survey of problems with cancellations and timeouts; it seems clear that there is no ideal solution at this time. |
See #57148. |
We have a need for this capability. I believe this closes #36217.
The implementation is straightforward and there are a couple of tests.