Fix deadlock in P2P restarts #8091

hendrikmakait · 2023-08-09T17:37:34Z

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-08-09T18:38:48Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files +      1       20 suites +1 11h 22m 11s ⏱️ + 18m 44s
  3 764 tests -       1   3 656 ✔️ +      7   106 💤 -     2 2 ❌ - 6
36 412 runs +3 628 34 658 ✔️ +3 534 1 751 💤 +101 3 ❌ - 7

For more details on these failures, see this check.

Results for commit 65fdcc4. ± Comparison against base commit 9469b91.

This pull request removes 2 and adds 1 tests. Note that renamed tests count towards both.

distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]

distributed.shuffle.tests.test_shuffle ‑ test_restarting_does_not_deadlock

♻️ This comment has been updated with latest results.

hendrikmakait · 2023-08-10T09:29:16Z

distributed/scheduler.py


-        return {}, {}, {}
+        recommendations: Recs = {}
+        self._propagate_released(ts, recommendations)


Writing a dedicated test for this is harder than anticipated. I think we might catch most (all?) non-P2P edge cases in other places. FWIW, removing this from transition_queued_released only fails P2P as well.

Well, this kind of test is triggering this transition

@gen_cluster(nthreads=[("", 1)] * 2, client=True) async def test_no_worker_released(c, s, a, b): f1 = c.submit(inc, 1, workers=[a.address], allow_other_workers=True, key='f1') f2 = c.submit(inc, f1, resources={"C": 1}, key='f2') while not f2.key in s.tasks or s.tasks[f2.key].state != "no-worker": await asyncio.sleep(0.01) assert f1.key in a.data and not f1.key in b.data await a.close() ...

In this test, f2 is in no-worker and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting

The scheduler extension is however expecting two transitions

whatever->released

released->waiting->processing

i.e. the scheduler extension issues a transition to released and is relying on the recommendation system to trigger appropriate follow ups. All other ordinary transitions rather transition immediately to waiting

So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)

In this test, f2 is in no-worker and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting

That's what I meant, sorry. It's hard to write a dedicated test that fails on main but isn't using P2P for the reason you described above.

So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)

+1, I think it's better to have the reconciliation logic within the state machine instead of forcing all users of the state machine to handle those edge cases.

fjetter · 2023-08-10T09:49:02Z

distributed/scheduler.py


-        return {}, {}, {}
+        recommendations: Recs = {}
+        self._propagate_released(ts, recommendations)


Well, this kind of test is triggering this transition

@gen_cluster(nthreads=[("", 1)] * 2, client=True) async def test_no_worker_released(c, s, a, b): f1 = c.submit(inc, 1, workers=[a.address], allow_other_workers=True, key='f1') f2 = c.submit(inc, f1, resources={"C": 1}, key='f2') while not f2.key in s.tasks or s.tasks[f2.key].state != "no-worker": await asyncio.sleep(0.01) assert f1.key in a.data and not f1.key in b.data await a.close() ...

In this test, f2 is in no-worker and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting

The scheduler extension is however expecting two transitions

whatever->released

released->waiting->processing

i.e. the scheduler extension issues a transition to released and is relying on the recommendation system to trigger appropriate follow ups. All other ordinary transitions rather transition immediately to waiting

So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)

Fix deadlock in P2P restarts

62ff851

Docs

65fdcc4

hendrikmakait commented Aug 10, 2023

View reviewed changes

hendrikmakait marked this pull request as ready for review August 10, 2023 09:29

hendrikmakait requested a review from fjetter as a code owner August 10, 2023 09:29

fjetter approved these changes Aug 10, 2023

View reviewed changes

fjetter merged commit 1f8a11c into dask:main Aug 10, 2023
21 of 25 checks passed

This was referenced Aug 10, 2023

Fix additional race condition that can cause P2P restart to deadlock #8094

Merged

[DNM] Fix race condition for P2P restarts #8051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in P2P restarts #8091

Fix deadlock in P2P restarts #8091

hendrikmakait commented Aug 9, 2023

github-actions bot commented Aug 9, 2023 •

edited

Loading

hendrikmakait Aug 10, 2023

fjetter Aug 10, 2023 •

edited

Loading

hendrikmakait Aug 10, 2023

fjetter Aug 10, 2023 •

edited

Loading

Fix deadlock in P2P restarts #8091

Fix deadlock in P2P restarts #8091

Conversation

hendrikmakait commented Aug 9, 2023

github-actions bot commented Aug 9, 2023 • edited Loading

Unit Test Results

hendrikmakait Aug 10, 2023

Choose a reason for hiding this comment

fjetter Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

hendrikmakait Aug 10, 2023

Choose a reason for hiding this comment

fjetter Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Aug 9, 2023 •

edited

Loading

fjetter Aug 10, 2023 •

edited

Loading

fjetter Aug 10, 2023 •

edited

Loading