-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scheduler lookahead to elide buffer resizes #298
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome to see a general fix for these types of patterns (which have been plaguing us for years at this point)!
One general remark: I wonder if we should offer the ability to set the maximum lookahead distance via an environment variable. I think this could allow for much faster experimentation/perf evaluation for new applications or HW platforms.
Check-perf-impact results: (e05743d17fba63a8011ddd314448d04e) ❓ No new benchmark data submitted. ❓ |
The instruction-graph generator now emits a warning if it detects a high number of allocations or resizes in a single buffer. This doesn't catch all resizes though. Should we have a stricter test that warns on any (also the first) resize? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! I particularly like that the whole thing is mostly self-contained inside the new scheduler implementation!
2a485f6
to
85a5778
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/1)
b8ed7f9
to
5344348
Compare
Pull Request Test Coverage Report for Build 11970970296Details
💛 - Coveralls |
Check-perf-impact results: (f49f87b45e71230f6a4fd4037130429f)
Relative execution time per category: (mean of relative medians)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/1)
Check-perf-impact results: (32e33781a20e2a3976ece8f7dc81ae2b)
Relative execution time per category: (mean of relative medians)
|
Buffer resizes are slow and will cause OOM conditions when a required allocation size exceeds half the device memory capacity. This commit refactors the scheduler to operate on separate task- and command queues, which allows commands to be buffered in order to merge their allocation requirements with future CG submisssions. The default lookahead heuristic will keep delaying instruction generation until the last allocation request has passed behind two horizons. This is enough to eliminate resizes entirely in wave_sim and RSim. Experimental APIs on celerity::queue allow changing this behavior to either always or never flush commands, and an explicit flush operation is exposed as well. New testing infrastructure is added to inspect scheduler output based on commands and instructions generated at any point in time.
Buffer resizes are slow and will cause OOM conditions when a required allocation size exceeds half the device memory capacity, as we saw with weak scaling experiments on
wave_sim
.The current workaround is to have an initial "resize dummy kernel" which touches the full area later accessed by the program. This is very ugly since it requires internal knowledge about Celerity's work assignment and cannot easily be communicated to a user.
As a transparent solution to this issue, this PR refactors the scheduler to operate on separate task- and command queues, which allows commands to be buffered in order to merge their allocation requirements with future CG submisssions.
The default lookahead heuristic will keep delaying instruction generation until the last allocation request has passed behind two horizons. This is enough to eliminate resizes entirely in wave_sim and RSim. Experimental APIs on celerity::queue allow changing this behavior to either always or never flush commands, and an explicit flush operation is exposed as well.
New testing infrastructure is added to inspect scheduler output based on commands and instructions generated at any point in time.