Use queued tasks in adaptive target #8037

mrocklin · 2023-07-26T12:11:41Z

github-actions · 2023-07-26T14:14:42Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      19 files -       1       19 suites - 1 10h 16m 5s ⏱️ + 1h 6m 50s
  3 744 tests +      2   3 633 ✔️ +      2   106 💤 - 2 5 ❌ +2
34 869 runs +1 209 33 197 ✔️ +1 203 1 667 💤 +5 5 ❌ +1

For more details on these failures, see this check.

Results for commit 2da9df9. ± Comparison against base commit 145c13a.

This pull request removes 2 and adds 4 tests. Note that renamed tests count towards both.

distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]

distributed.deploy.tests.test_adaptive ‑ test_scale_up_large_tasks[1]
distributed.deploy.tests.test_adaptive ‑ test_scale_up_large_tasks[inf]
distributed.tests.test_client ‑ test_context_manager_used_from_different_tasks
distributed.tests.test_client ‑ test_context_manager_used_from_different_threads

♻️ This comment has been updated with latest results.

dchudz · 2023-07-27T03:55:14Z

In case this helps Matt or reviewers: I tried out this PR and it makes my own go-to adaptive example go from scaling smaller than I expect to larger than I expect (because my workers have 4 threads each):

In [4]: c.adapt(minimum=1, maximum=200, target_duration="5 minutes")
2023-07-26 23:45:44,466 - distributed.deploy.adaptive - INFO - Adaptive scaling started: minimum=1 maximum=200
Out[4]: <coiled.cluster.CoiledAdaptive at 0x105725330>

In [5]: def sleep_5_secs(x):
   ...:     import time
   ...:     time.sleep(5)
   ...:     return x
   ...:

In [6]: results = client.map(sleep_5_secs, range(10_000))

My back-of-the-envelope says this should give me about 42 workers.

Before this PR, it gave me 17. After this PR, it gives me 167, which is roughly 42*4, because adaptive doesn't account for worker threads:

        cpu = math.ceil(
            (self.total_occupancy + queued_occupancy) / target_duration
        )  # TODO: threads per worker

But I realize that's a separate problem that shouldn't block this PR. Just adding the context/example in case it helps.

hendrikmakait · 2023-07-27T10:11:43Z

It looks like test_adaptive_target_empty_cluster[True] fails consistently.

mrocklin · 2023-07-27T12:03:52Z

It looks like test_adaptive_target_empty_cluster[True] fails consistently.

Thanks. Fixed.

crusaderky · 2023-07-27T12:37:16Z

distributed/scheduler.py

+        if len(self.queued) < 100:
+            queued_occupancy = 0
+            for ts in self.queued:
+                if ts.prefix.duration_average == -1:


Out of scope for this PR: I think this is problematic. duration among the same TaskPrefix can vary wildly; I would much rather use a metric that is TaskGroup-specific.

I don't disagree. I also suspect that this will be fine in most cases.

crusaderky · 2023-07-27T12:37:55Z

distributed/scheduler.py

+                    queued_occupancy += self.UNKNOWN_TASK_DURATION
+                else:
+                    queued_occupancy += ts.prefix.duration_average


Out of scope nit: this screams for encapsulation in a smart duration_average property

I don't disagree

There is get_task_duration which also handles the case of user provided estimates.

crusaderky

Both this PR and main fail (as the tests demonstrate) in the use case where

there are 0 workers, and
queuing is disabled (distributed.scheduler.worker-saturation: .inf)

In this use case, all tasks will end up in Scheduler.unrunnable instead of Scheduler.queued.

Even when queueing is enabled, this fails when

there are no workers
the tasks require resources

(I understand that the expectation in an adaptive cluster with resources is that all dynamically-started workers provide the resource, e.g. {"GPU": 1}).

Again, tasks will end up in Scheduler.unrunnable.
Please add a test for this use case.

distributed/scheduler.py

crusaderky · 2023-07-27T12:58:56Z

distributed/deploy/tests/test_adaptive.py

+    while not s.tasks:
+        await asyncio.sleep(0.001)


Suggested change

while not s.tasks:

await asyncio.sleep(0.001)

await async_poll_for(lambda: s.tasks, timeout=5)

I prefer not to use these. I find that this removes one line but at the cost of adding a new abstraction (async_poll_for). The tradeoff here doesn't seem positive to me.

We are using these ubiquitously. I think this is not a design choice that should be left to the whim and taste of the individual developers; if you don't like them we should have a team talk about them which should result in either using them everywhere or removing them completely.

crusaderky · 2023-07-27T12:59:23Z

distributed/deploy/tests/test_adaptive.py

+    while len(s.tasks) != 200:
+        await asyncio.sleep(0.001)


Suggested change

while len(s.tasks) != 200:

await asyncio.sleep(0.001)

await async_poll_for(lambda: len(s.tasks) == 200, timeout=5)

crusaderky · 2023-07-27T13:33:14Z

In short:

from itertools import chain
from toolz import peekn

# Note: this relies on HeapSet.__iter__ and set.__iter__ to yield elements in pseudo-random order
queued, _ = peekn(100, chain(self.queued, self.unrunnable))

queued_occupancy = 0
for ts in queued:
    if ts.prefix.duration_average == -1:
        queued_occupancy += self.UNKNOWN_TASK_DURATION
    else:
        queued_occupancy += ts.prefix.duration_average

mrocklin · 2023-07-27T13:36:03Z

As a heads-up I'm unlikely to spend a bunch of time on this. It's more likely that I ask folks like @fjetter to ask people around him (maybe even @crusaderky ) to pick this up.

I'm hopeful that this can be a small fix. I would be mildly sad/surprised if it required a large effort (not that that's what you're saying).

mrocklin · 2023-07-27T14:59:37Z

Never mind. The use of take removes much of the concern about cost here (thanks for the suggestion).

I do think that the request around supporting unrunnable is out of scope (this was broken before) but I've gone ahead and done it anyway. I'm hopeful that we can get this in soon (I'm finding that this is valuable for coiled run work and this is blocking me).

Use queued tasks in adaptive target

ea7d8e0

Fixes dask#8035

mrocklin requested a review from fjetter as a code owner July 26, 2023 12:11

mrocklin mentioned this pull request Jul 26, 2023

Adaptive target incorrectly assigns unknown duration to queued tasks #8035

Closed

add large test as well

43d26b3

dchudz mentioned this pull request Jul 27, 2023

account for nthreads in adaptive_target #8039

Closed

2 tasks

handle None Tasks in queued

1b6a2c5

crusaderky reviewed Jul 27, 2023

View reviewed changes

crusaderky requested changes Jul 27, 2023

View reviewed changes

Support unrunnable in adaptivity, also use take rather than sample

bd4069a

Simplify

2da9df9

fjetter approved these changes Jul 28, 2023

View reviewed changes

fjetter merged commit 9d9702e into dask:main Jul 28, 2023
18 of 25 checks passed

mrocklin deleted the adaptive-queued branch July 28, 2023 15:52

crusaderky mentioned this pull request Jul 31, 2023

Cosmetic tweak to adaptive_target #8052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use queued tasks in adaptive target #8037

Use queued tasks in adaptive target #8037

mrocklin commented Jul 26, 2023

github-actions bot commented Jul 26, 2023 •

edited

Loading

dchudz commented Jul 27, 2023

hendrikmakait commented Jul 27, 2023

mrocklin commented Jul 27, 2023

crusaderky Jul 27, 2023

mrocklin Jul 27, 2023

crusaderky Jul 27, 2023

mrocklin Jul 27, 2023

fjetter Jul 27, 2023

crusaderky left a comment •

edited

Loading

crusaderky Jul 27, 2023

mrocklin Jul 27, 2023

crusaderky Jul 31, 2023

crusaderky Jul 27, 2023

crusaderky commented Jul 27, 2023 •

edited

Loading

mrocklin commented Jul 27, 2023

mrocklin commented Jul 27, 2023

	while not s.tasks:
	await asyncio.sleep(0.001)
	await async_poll_for(lambda: s.tasks, timeout=5)

	while len(s.tasks) != 200:
	await asyncio.sleep(0.001)
	await async_poll_for(lambda: len(s.tasks) == 200, timeout=5)

Use queued tasks in adaptive target #8037

Use queued tasks in adaptive target #8037

Conversation

mrocklin commented Jul 26, 2023

github-actions bot commented Jul 26, 2023 • edited Loading

Unit Test Results

dchudz commented Jul 27, 2023

hendrikmakait commented Jul 27, 2023

mrocklin commented Jul 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jul 27, 2023 • edited Loading

mrocklin commented Jul 27, 2023

mrocklin commented Jul 27, 2023

github-actions bot commented Jul 26, 2023 •

edited

Loading

crusaderky left a comment •

edited

Loading

crusaderky commented Jul 27, 2023 •

edited

Loading