-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stream operations to accelerator components #12356
Conversation
Could you please add stub functions for the ZE component? |
3d87780
to
ead2c40
Compare
@hppritcha I added stubs returning |
df07a9c
to
236af1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few minor comments/requests, nothing major, looks good overall.
1c4d11b
to
529a68d
Compare
ebad01d
to
97dbb09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @devreal . Some high-level questions:
- I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?
- How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.
0765a9e
to
7de3f68
Compare
7de3f68
to
f3133b4
Compare
f3133b4
to
26185d6
Compare
@devreal could you rebase on top of head of main to pull in the CI ZE compiler sanity check? |
26185d6
to
c13779f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi @devreal No more comments on the PR changes but I didn't get an answer for the above questions. Apologies if I've missed the response somewhere. |
@Akshay-Venkatesh Sorry, I only replied to your comments inline.
For operations, we may need to allocate temporary device buffers. Maybe we can work around that eventually but it probably doesn't hurt to have that functionality available.
There is no infrastructure for that yet (aside from some research projects). We can take advantage of it in the collective reduction operations though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the content of this PR I cannot figure out why this functionality is needed. What is the grand scheme in which we need to :
- expose the memory bandwidth of a device ?
- have an asynchronous memmove ?
- have an explicit synchronization routine instead of relying on existing mechanisms (create / record / and wait for an event to complete) ?
The context is in the referenced PR (#12318) but that's a monster so it might not have been clear.
That is needed to decide whether to fetch the data to the host and perform the operator there or to submit a kernel. This will be part of a separate PR that has yet to come. I was hoping to get the basics in first but I can post all PRs and we synchronize across.
See #12570. We could do without (i.e., use blocking memmove instead) and I'm not even sure how often memmove is actually used there but for the purpose of completeness I thought I'd be the right thing.
See #12570. Not sure how useful events would be there. Sounds like more overhead to create an event than to simply synchronize the stream. That's used to submit multiple data transfers and synchronizing once instead copying piecemeal. |
it would be good if we could have this PR merged in the near future, since this would simplify evaluating/testing the subsequent PRs |
Will merge after resolve |
a3c8443
to
1d23d06
Compare
- Stream-based alloc and free - Stream-based memmove - Wait for stream to complete Also, enable querying for number of devices and memory bandwidth. These operations are needed for operation device offloading. Co-authored-by: Phuong Nguyen <[email protected]> Signed-off-by: Joseph Schuchart <[email protected]>
1d23d06
to
684ff97
Compare
Rebased and fixed conflicts. |
First batch of changes from #12318 (offloading of reductions to devices).
This PR adds stream operations to the accelerator components:
Default streamAlso, enable querying for number of devices and memory bandwidth.
This PR is missing implementations for the
ze
component because I haven't had time to dig into that. Maybe someone familiar with that API can contribute the implementation? Otherwise I will need to find some time in the coming week(s) to implement them myself (theze
component didn't exist when I made these changes).