[Autotuner] Improve unit test reliability 1 #2538

luarss · 2024-10-31T18:36:21Z

No description provided.

oharboe

have you considered python mocking to avoid processes that you have to start and stop?

tools/AutoTuner/Makefile

luarss · 2024-11-01T02:55:36Z

have you considered python mocking to avoid processes that you have to start and stop?

That is a good idea, I originally intended it to be as close to the real runtime environment as possible but if tests continue to fail this might be the move.

oharboe · 2024-11-01T06:11:01Z

have you considered python mocking to avoid processes that you have to start and stop?

That is a good idea, I originally intended it to be as close to the real runtime environment as possible but if tests continue to fail this might be the move.

I'm thinking it is going to be both. I've been using bazel-orfs in an autotuner like capacity and the biggest problem I currently have is error handling and resource management. @jeffng-or and I launched an exploration run of, for instance, MAX_UNGROUP_SIZE and also we wanted to do the runs through grt.

The MAX_UNGROUP_SIZE never really completed, but I was able to look at the results that I got and I used them to plot the progress and the conclusion was trivial: there is no correct value of MAX_UNGROUP_SIZE, instead we have to first create a macro placement with SYNTH_HIERARCHICAL=1, but we have to throw away the result of that run and use that result in a run with SYNTH_HIERARCHICAL=0.

For the grt runs, the problem is that this part of the flow can't run in parallel with other runs, since it will then make the servers run out of memory and the servers will crash. I plan to fix bazel-orfs such that it has a rudimentary knowledge of which steps can run in parallel and which cannot. I think grt, route and macro placement have to run alone in a server, whereas the other stages can run in parallel. I need instrumentation in bazel-orfs to track the resident memory set to see what can run in parallel or not. Possibly I have to do a trial run from start to end, then track the memory requirements and CPU usage and come up with some sort of provisioning plan.

tools/AutoTuner/Makefile

vvbandeira · 2025-01-15T16:46:47Z

tools/AutoTuner/installer.sh

+success=false
+
+while [[ $retry_count -lt $max_retries ]]; do
+    if pip3 cache purge && pip3 install --no-cache-dir -U -r "$script_dir/requirements.txt"; then


Is cache purge required with the --no-cache-dir option or are they redundant?

It's not mutually exclusive. Pip cache purge might be needed if system had cache before, and no cache dir just ensures no future caching is done.

Signed-off-by: Jack Luar <[email protected]>

* context: pip install fails on large files due to network instability Signed-off-by: Jack Luar <[email protected]>

Signed-off-by: Vitor Bandeira <[email protected]>

Signed-off-by: Jack Luar <[email protected]>

oharboe suggested changes Oct 31, 2024

View reviewed changes

tools/AutoTuner/Makefile Show resolved Hide resolved

luarss changed the title ~~[Autotuner] Improve reliability 1~~ [Autotuner] Improve unit test reliability 1 Nov 1, 2024

luarss force-pushed the topic/at-reliable branch from 437ab5b to 76d7344 Compare December 24, 2024 04:56

luarss requested a review from vvbandeira December 25, 2024 06:06

luarss marked this pull request as ready for review December 25, 2024 06:06

luarss added the autotuner Flow autotuner label Dec 25, 2024

luarss mentioned this pull request Jan 9, 2025

ci: enable AutoTuner CI #2659

Merged

vvbandeira reviewed Jan 9, 2025

View reviewed changes

tools/AutoTuner/Makefile Outdated Show resolved Hide resolved

luarss force-pushed the topic/at-reliable branch from 03516f0 to 7a48e12 Compare January 10, 2025 14:22

luarss marked this pull request as draft January 10, 2025 17:36

vvbandeira reviewed Jan 15, 2025

View reviewed changes

luarss and others added 4 commits January 21, 2025 12:58

fix pip cache, pip-compile lock reqs

84c7602

Signed-off-by: Jack Luar <[email protected]>

add ensure ray stop before starting any tests

805d277

Signed-off-by: Jack Luar <[email protected]>

add retry mechanism and makefile phony

9158965

* context: pip install fails on large files due to network instability Signed-off-by: Jack Luar <[email protected]>

Apply suggestions from code review

9b0671c

Signed-off-by: Vitor Bandeira <[email protected]>

luarss force-pushed the topic/at-reliable branch from 109679e to 9b0671c Compare January 21, 2025 13:00

ensure ray stop for all tests

1792b5a

Signed-off-by: Jack Luar <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autotuner] Improve unit test reliability 1 #2538

[Autotuner] Improve unit test reliability 1 #2538

luarss commented Oct 31, 2024

oharboe left a comment

luarss commented Nov 1, 2024

oharboe commented Nov 1, 2024

vvbandeira Jan 15, 2025

luarss Jan 21, 2025

[Autotuner] Improve unit test reliability 1 #2538

Are you sure you want to change the base?

[Autotuner] Improve unit test reliability 1 #2538

Conversation

luarss commented Oct 31, 2024

oharboe left a comment

Choose a reason for hiding this comment

luarss commented Nov 1, 2024

oharboe commented Nov 1, 2024

vvbandeira Jan 15, 2025

Choose a reason for hiding this comment

luarss Jan 21, 2025

Choose a reason for hiding this comment