Add scheduler lookahead to elide buffer resizes #298

fknorr · 2024-11-03T11:08:00Z

Buffer resizes are slow and will cause OOM conditions when a required allocation size exceeds half the device memory capacity, as we saw with weak scaling experiments on wave_sim.

The current workaround is to have an initial "resize dummy kernel" which touches the full area later accessed by the program. This is very ugly since it requires internal knowledge about Celerity's work assignment and cannot easily be communicated to a user.

As a transparent solution to this issue, this PR refactors the scheduler to operate on separate task- and command queues, which allows commands to be buffered in order to merge their allocation requirements with future CG submisssions.

The default lookahead heuristic will keep delaying instruction generation until the last allocation request has passed behind two horizons. This is enough to eliminate resizes entirely in wave_sim and RSim. Experimental APIs on celerity::queue allow changing this behavior to either always or never flush commands, and an explicit flush operation is exposed as well.

New testing infrastructure is added to inspect scheduler output based on commands and instructions generated at any point in time.

PeterTh

Awesome to see a general fix for these types of patterns (which have been plaguing us for years at this point)!

One general remark: I wonder if we should offer the ability to set the maximum lookahead distance via an environment variable. I think this could allow for much faster experimentation/perf evaluation for new applications or HW platforms.

include/scheduler.h

include/queue.h

test/scheduler_tests.cc

test/runtime_tests.cc

github-actions · 2024-11-20T11:52:53Z

Check-perf-impact results: (e05743d17fba63a8011ddd314448d04e)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

fknorr · 2024-11-20T12:47:40Z

The instruction-graph generator now emits a warning if it detects a high number of allocations or resizes in a single buffer. This doesn't catch all resizes though. Should we have a stricter test that warns on any (also the first) resize?

psalz

Very cool! I particularly like that the whole thing is mostly self-contained inside the new scheduler implementation!

include/scheduler.h

src/scheduler.cc

src/instruction_graph_generator.cc

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

src/platform_specific/affinity.unix.cc

src/affinity.cc

coveralls · 2024-11-21T14:17:35Z

Pull Request Test Coverage Report for Build 11970970296

Details

233 of 234 (99.57%) changed or added relevant lines in 6 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.1%) to 94.911%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/config.cc	11	12	91.67%

Files with Coverage Reduction	New Missed Lines	%
src/scheduler.cc	1	98.23%

Totals
Change from base Build 11954307236:	0.1%
Covered Lines:	7049
Relevant Lines:	7161

💛 - Coveralls

github-actions · 2024-11-21T14:48:00Z

Check-perf-impact results: (f49f87b45e71230f6a4fd4037130429f)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command- and instruction graphs in a dedicated scheduler thread for N nodes - 1 > throttled submission to a scheduler thread at 10 us per task / expanding tree topology
🚀 Significant speedup (<0.80x) in some microbenchmark results: 4 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

command-graph : 1.06x
graph-nodes : 1.03x
grid : 1.01x
instruction-graph : 1.02x
scheduler : 1.02x
system : 1.01x
task-graph : 1.06x

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

src/config.cc

github-actions · 2024-11-21T15:21:37Z

Check-perf-impact results: (32e33781a20e2a3976ece8f7dc81ae2b)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command- and instruction graphs in a dedicated scheduler thread for N nodes - 1 > throttled submission to a scheduler thread at 10 us per task / expanding tree topology
🚀 Significant speedup (<0.80x) in some microbenchmark results: 6 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

command-graph : 1.01x
graph-nodes : 1.01x
grid : 1.02x
instruction-graph : 1.03x
scheduler : 1.03x
system : 0.95x
task-graph : 1.05x

Buffer resizes are slow and will cause OOM conditions when a required allocation size exceeds half the device memory capacity. This commit refactors the scheduler to operate on separate task- and command queues, which allows commands to be buffered in order to merge their allocation requirements with future CG submisssions. The default lookahead heuristic will keep delaying instruction generation until the last allocation request has passed behind two horizons. This is enough to eliminate resizes entirely in wave_sim and RSim. Experimental APIs on celerity::queue allow changing this behavior to either always or never flush commands, and an explicit flush operation is exposed as well. New testing infrastructure is added to inspect scheduler output based on commands and instructions generated at any point in time.

fknorr added this to the 0.7.0 milestone Nov 3, 2024

fknorr requested a review from PeterTh November 3, 2024 11:08

fknorr self-assigned this Nov 3, 2024

fknorr requested a review from psalz November 3, 2024 11:14