-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in P2P restarts #8091
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 20 files + 1 20 suites +1 11h 22m 11s ⏱️ + 18m 44s For more details on these failures, see this check. Results for commit 65fdcc4. ± Comparison against base commit 9469b91. This pull request removes 2 and adds 1 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
|
||
return {}, {}, {} | ||
recommendations: Recs = {} | ||
self._propagate_released(ts, recommendations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing a dedicated test for this is harder than anticipated. I think we might catch most (all?) non-P2P edge cases in other places. FWIW, removing this from transition_queued_released
only fails P2P as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this kind of test is triggering this transition
@gen_cluster(nthreads=[("", 1)] * 2, client=True)
async def test_no_worker_released(c, s, a, b):
f1 = c.submit(inc, 1, workers=[a.address], allow_other_workers=True, key='f1')
f2 = c.submit(inc, f1, resources={"C": 1}, key='f2')
while not f2.key in s.tasks or s.tasks[f2.key].state != "no-worker":
await asyncio.sleep(0.01)
assert f1.key in a.data and not f1.key in b.data
await a.close()
...
In this test, f2
is in no-worker
and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting
The scheduler extension is however expecting two transitions
- whatever->released
- released->waiting->processing
i.e. the scheduler extension issues a transition to released and is relying on the recommendation system to trigger appropriate follow ups. All other ordinary transitions rather transition immediately to waiting
So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test,
f2
is inno-worker
and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition isno-worker->released->waiting
That's what I meant, sorry. It's hard to write a dedicated test that fails on main
but isn't using P2P for the reason you described above.
So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)
+1, I think it's better to have the reconciliation logic within the state machine instead of forcing all users of the state machine to handle those edge cases.
|
||
return {}, {}, {} | ||
recommendations: Recs = {} | ||
self._propagate_released(ts, recommendations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this kind of test is triggering this transition
@gen_cluster(nthreads=[("", 1)] * 2, client=True)
async def test_no_worker_released(c, s, a, b):
f1 = c.submit(inc, 1, workers=[a.address], allow_other_workers=True, key='f1')
f2 = c.submit(inc, f1, resources={"C": 1}, key='f2')
while not f2.key in s.tasks or s.tasks[f2.key].state != "no-worker":
await asyncio.sleep(0.01)
assert f1.key in a.data and not f1.key in b.data
await a.close()
...
In this test, f2
is in no-worker
and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting
The scheduler extension is however expecting two transitions
- whatever->released
- released->waiting->processing
i.e. the scheduler extension issues a transition to released and is relying on the recommendation system to trigger appropriate follow ups. All other ordinary transitions rather transition immediately to waiting
So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)
Closes #8088
pre-commit run --all-files