Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scheduler lookahead to elide buffer resizes #298

Merged
merged 3 commits into from
Nov 22, 2024
Merged

Add scheduler lookahead to elide buffer resizes #298

merged 3 commits into from
Nov 22, 2024

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Nov 3, 2024

Buffer resizes are slow and will cause OOM conditions when a required allocation size exceeds half the device memory capacity, as we saw with weak scaling experiments on wave_sim.

The current workaround is to have an initial "resize dummy kernel" which touches the full area later accessed by the program. This is very ugly since it requires internal knowledge about Celerity's work assignment and cannot easily be communicated to a user.

As a transparent solution to this issue, this PR refactors the scheduler to operate on separate task- and command queues, which allows commands to be buffered in order to merge their allocation requirements with future CG submisssions.

The default lookahead heuristic will keep delaying instruction generation until the last allocation request has passed behind two horizons. This is enough to eliminate resizes entirely in wave_sim and RSim. Experimental APIs on celerity::queue allow changing this behavior to either always or never flush commands, and an explicit flush operation is exposed as well.

New testing infrastructure is added to inspect scheduler output based on commands and instructions generated at any point in time.

@fknorr fknorr added this to the 0.7.0 milestone Nov 3, 2024
@fknorr fknorr requested a review from PeterTh November 3, 2024 11:08
@fknorr fknorr self-assigned this Nov 3, 2024
@fknorr fknorr requested a review from psalz November 3, 2024 11:14
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

@celerity celerity deleted a comment from github-actions bot Nov 3, 2024

This comment was marked as outdated.

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see a general fix for these types of patterns (which have been plaguing us for years at this point)!

One general remark: I wonder if we should offer the ability to set the maximum lookahead distance via an environment variable. I think this could allow for much faster experimentation/perf evaluation for new applications or HW platforms.

include/scheduler.h Outdated Show resolved Hide resolved
include/queue.h Show resolved Hide resolved
test/scheduler_tests.cc Show resolved Hide resolved
test/runtime_tests.cc Show resolved Hide resolved
Copy link

Check-perf-impact results: (e05743d17fba63a8011ddd314448d04e)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot]

This comment was marked as outdated.

@fknorr
Copy link
Contributor Author

fknorr commented Nov 20, 2024

The instruction-graph generator now emits a warning if it detects a high number of allocations or resizes in a single buffer. This doesn't catch all resizes though. Should we have a stricter test that warns on any (also the first) resize?

@fknorr fknorr closed this Nov 20, 2024
@fknorr fknorr reopened this Nov 20, 2024
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I particularly like that the whole thing is mostly self-contained inside the new scheduler implementation!

include/scheduler.h Outdated Show resolved Hide resolved
src/scheduler.cc Outdated Show resolved Hide resolved
src/scheduler.cc Outdated Show resolved Hide resolved
src/scheduler.cc Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
@fknorr fknorr force-pushed the lookahead branch 2 times, most recently from 2a485f6 to 85a5778 Compare November 21, 2024 13:07
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

src/platform_specific/affinity.unix.cc Outdated Show resolved Hide resolved
src/platform_specific/affinity.unix.cc Outdated Show resolved Hide resolved
src/platform_specific/affinity.unix.cc Outdated Show resolved Hide resolved
src/platform_specific/affinity.unix.cc Outdated Show resolved Hide resolved
src/platform_specific/affinity.unix.cc Outdated Show resolved Hide resolved
src/platform_specific/affinity.unix.cc Outdated Show resolved Hide resolved
src/affinity.cc Outdated Show resolved Hide resolved
src/affinity.cc Outdated Show resolved Hide resolved
src/affinity.cc Outdated Show resolved Hide resolved
src/affinity.cc Outdated Show resolved Hide resolved
@celerity celerity deleted a comment from github-actions bot Nov 21, 2024
@fknorr fknorr force-pushed the lookahead branch 2 times, most recently from b8ed7f9 to 5344348 Compare November 21, 2024 13:39
@coveralls
Copy link

coveralls commented Nov 21, 2024

Pull Request Test Coverage Report for Build 11970970296

Details

  • 233 of 234 (99.57%) changed or added relevant lines in 6 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 94.911%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/config.cc 11 12 91.67%
Files with Coverage Reduction New Missed Lines %
src/scheduler.cc 1 98.23%
Totals Coverage Status
Change from base Build 11954307236: 0.1%
Covered Lines: 7049
Relevant Lines: 7161

💛 - Coveralls

Copy link

Check-perf-impact results: (f49f87b45e71230f6a4fd4037130429f)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command- and instruction graphs in a dedicated scheduler thread for N nodes - 1 > throttled submission to a scheduler thread at 10 us per task / expanding tree topology
🚀 Significant speedup (<0.80x) in some microbenchmark results: 4 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.06x
  • graph-nodes : 1.03x
  • grid : 1.01x
  • instruction-graph : 1.02x
  • scheduler : 1.02x
  • system : 1.01x
  • task-graph : 1.06x

@celerity celerity deleted a comment from github-actions bot Nov 21, 2024
@celerity celerity deleted a comment from github-actions bot Nov 21, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

src/config.cc Show resolved Hide resolved
Copy link

Check-perf-impact results: (32e33781a20e2a3976ece8f7dc81ae2b)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command- and instruction graphs in a dedicated scheduler thread for N nodes - 1 > throttled submission to a scheduler thread at 10 us per task / expanding tree topology
🚀 Significant speedup (<0.80x) in some microbenchmark results: 6 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.01x
  • graph-nodes : 1.01x
  • grid : 1.02x
  • instruction-graph : 1.03x
  • scheduler : 1.03x
  • system : 0.95x
  • task-graph : 1.05x

fknorr and others added 3 commits November 22, 2024 11:35
Buffer resizes are slow and will cause OOM conditions when a required
allocation size exceeds half the device memory capacity.

This commit refactors the scheduler to operate on separate task- and
command queues, which allows commands to be buffered in order to merge
their allocation requirements with future CG submisssions.

The default lookahead heuristic will keep delaying instruction
generation until the last allocation request has passed behind two
horizons. This is enough to eliminate resizes entirely in wave_sim and
RSim. Experimental APIs on celerity::queue allow changing this behavior
to either always or never flush commands, and an explicit flush
operation is exposed as well.

New testing infrastructure is added to inspect scheduler output based on
commands and instructions generated at any point in time.
@fknorr fknorr merged commit 2cfcb69 into master Nov 22, 2024
17 checks passed
@fknorr fknorr deleted the lookahead branch November 22, 2024 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants