-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Graph][UR] Propagate graph update list to UR #17019
base: sycl
Are you sure you want to change the base?
Conversation
edd0ae1
to
099e2b3
Compare
099e2b3
to
81e25a9
Compare
81e25a9
to
0176505
Compare
Update the `urCommandBufferUpdateKernelLaunchExp` API for updating commands in a command-buffer to take a list of commands. The current API operates on a single command, this means that the SYCL-Graph `update(std::vector<nodes>)` API needs to serialize the list into N calls to the UR API. Given that both OpenCL `clUpdateMutableCommandsKHR` and Level-Zero `zeCommandListUpdateMutableCommandsExp` can operate on a list of commands, this serialization at the UR layer of the stack introduces extra host API calls. This PR improves the `urCommandBufferUpdateKernelLaunchExp` API so that a list of commands is passed all the way from SYCL to the native backend API. As highlighted in oneapi-src/unified-runtime#2671 this patch requires the handle translation auto generated code to be able to handle a list of structs, which is not currently the case. This is PR includes a API specific temporary workaround in the mako file which will unblock this PR until a more permanent solution is completed. Co-authored-by: Ross Brunton <[email protected]>
0176505
to
1fd571d
Compare
// Skip if device doesn't support out-of-order queues, we need | ||
// to create one for both instantiations of the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this comment get pasted from somewhere else? I'm not sure I understand it in this context.
for (const auto &Partition : MPartitions) { | ||
std::vector<std::shared_ptr<node_impl>> NodesForPartition; | ||
const auto PartitionBegin = Partition->MSchedule.begin(); | ||
const auto PartitionEnd = Partition->MSchedule.end(); | ||
for (auto &Node : Nodes) { | ||
auto ExecNode = MIDCache.find(Node->MID); | ||
assert(ExecNode != MIDCache.end() && "Node ID was not found in ID cache"); | ||
|
||
if (std::find_if(PartitionBegin, PartitionEnd, | ||
[ExecNode](const auto &PartitionNode) { | ||
return PartitionNode->MID == ExecNode->second->MID; | ||
}) != PartitionEnd) { | ||
NodesForPartition.push_back(Node); | ||
} | ||
} | ||
if (!NodesForPartition.empty()) { | ||
PartitionedNodes.insert({Partition, NodesForPartition}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially quite expensive to be doing during the update call (having to search through the partition nodes repeatedly). Could we cache the partition that a node belongs to (since this will be known when we do the partitioning during finalize), such that we can easily skip nodes that are not part of this partition or restructure it such that we only loop over Nodes
and just add each one to the appropriate partition map.
In the latter case I guess we'd have to insert empty vectors before the loop for each partition and then retrieve them inside the loop over the nodes, but it feels like that should probably still be faster, not sure.
// Update ExecNode with new values from Node, in case we ever need to | ||
// rebuild the command buffers | ||
ExecNode->second->updateFromOtherNode(Node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't make as much sense for host task nodes specifically because ExecNode
is the definitive state for host-tasks when they are launched.
auto Device = MQueue->get_device(); | ||
for (auto It = PartitionedNodes.begin(); It != PartitionedNodes.end(); It++) { | ||
auto CommandBuffer = It->first->MCommandBuffers[Device]; | ||
MGraph->updateKernelsImpl(CommandBuffer, It->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've gone from a generically named update call to a kernel specific one. How would you envision this looking if we supported more node types than just kernels here?
ur_exp_command_buffer_update_kernel_launch_desc_t &UpdateDesc); | ||
|
||
/// Updates host-task nodes in the graph | ||
/// @param CommandBuffer The UR command-buffer to update commands in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function doesn't have this parameter.
// Rebuild cached requirements and accessor storage for this graph with | ||
// updated nodes | ||
MRequirements.clear(); | ||
MAccessors.clear(); | ||
for (auto &Node : MNodeStorage) { | ||
if (!Node->MCommandGroup) | ||
continue; | ||
MRequirements.insert(MRequirements.end(), | ||
Node->MCommandGroup->getRequirements().begin(), | ||
Node->MCommandGroup->getRequirements().end()); | ||
MAccessors.insert(MAccessors.end(), | ||
Node->MCommandGroup->getAccStorage().begin(), | ||
Node->MCommandGroup->getAccStorage().end()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was only moved not changed here, but looking at it I'm wondering if it should be done both here (but only for the non-scheduler path), and also after the scheduler path has called updateKernelsImpl()
for all partitions. In the scheduler path the nodes in MNodeStorage
have not yet been updated.
Host-tasks make the slightly more complicated because they have already been updated, so I think something needs to happen here to keep the accessors alive for them. Perhaps here (or during host task node update) we insert new requirements/accessors without clearing first, so that they are kept alive for host-tasks. Then after all updates have been completed for both paths we can fully rebuild them, which would take updated requirements from other node types into account.
@@ -1435,94 +1491,17 @@ void exec_graph_impl::update( | |||
// ensure it is ordered correctly. | |||
NeedScheduledUpdate |= MExecutionEvents.size() > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment on the line directly but we're leaving execution event cleanup in needsScheduledUpdate()
which is a little strange to me since it does not affect the bool returned from this function, would prefer to keep that in the main update function so it's clear what is happening.
Update the
urCommandBufferUpdateKernelLaunchExp
API for updating commands in a command-buffer to take a list of commands.The current API operates on a single command, this means that the SYCL-Graph
update(std::vector<nodes>)
API needs to serialize the list into N calls to the UR API. Given that both OpenCLclUpdateMutableCommandsKHR
and Level-ZerozeCommandListUpdateMutableCommandsExp
can operate on a list of commands, this serialization at the UR layer of the stack introduces extra host API calls.This PR improves the
urCommandBufferUpdateKernelLaunchExp
API so that a list of commands is passed all the way from SYCL to the native backend API.As highlighted in oneapi-src/unified-runtime#2671 this patch requires the handle translation auto generated code to be able to handle a list of structs, which is not currently the case. This is PR includes a API specific temporary workaround in the mako file which will unblock this PR until a more permanent solution is completed.
Co-authored-by: Ross Brunton [email protected]