-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling L2+ Optimizations for EPs #23517
base: main
Are you sure you want to change the base?
Conversation
…Capability, selection function and ComputeCapability
…uteCapability, selection function and ComputeCapability
dq_node_index_set.insert(index); | ||
} | ||
|
||
ConstantFoldingDQ* transformer = static_cast<ConstantFoldingDQ*>(graph_transformer_mgr.GetTransformerByName(optimizer_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought to potentially get rid of GraphTransformerManager
:
Would it make sense to instantiate the ConstantFoldingDQ transformer directly instead of querying the graph_transformer_manager
? For example, we have code in InferenceSession::TransformGraph that directly instantiates an instance of EnsureUniqueDQ
and just calls .Apply()
on it:
onnxruntime/onnxruntime/core/session/inference_session.cc
Lines 1231 to 1232 in a770a8d
EnsureUniqueDQForNodeUnit ensure_unique_dq_for_node_unit{}; | |
ORT_RETURN_IF_ERROR_SESSIONID_(apply_transformer_once(ensure_unique_dq_for_node_unit, *session_logger_, graph)); |
So in this case:
// Would need to pass the EP name to GetEPOptimizerByName()?
std::unique_ptr<ConstantFoldingDQ> const_fold_dq_transformer = std::make_unique<ConstantFoldingDQ>(/*...*/);
const_fold_dq_transformer.Apply(/*...*/);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may also allow the EP to provide key/value strings to configure an optimizer. This way we can avoid having to create a completely new derived class every time an EP needs a slightly different variation of an optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this wouldn't work because we need to pass in CPU EP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to instantiate/run directly. May need some structure to facilitate that. e.g. if we need things like the CPU allocator to be available to create an optimizer.
In this case I think we need the derived class as we're augmenting the behavior of the constant folding optimizer, but I expect this to be an exception rather than typical.
For other optimizers I expect we'll want something like key/value strings to configure instead of creating derived classes. Need to figure out what sort of configuration is required. Some existing things are that an EP supports a subset of data types for a fusion, or a subset of operators for a fusion (e.g. subset of activation ops can be fused into Conv to create a FusedConv node).
GraphPartitioner(KernelRegistryManager& kernel_registry_mgr, const GraphTransformerManager& graph_transformer_mgr, const ExecutionProviders& providers) | ||
: kernel_registry_mgr_(kernel_registry_mgr), | ||
graph_transformer_mgr_(graph_transformer_mgr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of wiring in the GraphTransformerManager would it be better to add a new standalone class that provides lookup-based access to a set of L2 optimizers that are directly usable by an EP? That would decouple the GraphPartitioner from the optimizer registration/lookup a bit more. I don't think GraphTransformerManager is providing value in this case as I don't think we need to loop.
The registration/lookup class for re-usable optimizers could be a singleton with a static 'create' method that calls GenerateTransformersForEP and saves the returned list. We could have InferenceSession call the 'create' method to provide any other required things like the CPU allocator, and potentially apply the optimization level when doing so if we want to do that on the ORT side. GraphPartitioner could call a static 'get' method to get the instance so we don't need to wire it through from the inference session.
@@ -28,6 +28,17 @@ class ConstantFolding : public GraphTransformer { | |||
const InlinedHashSet<std::string_view>& compatible_execution_providers = {}, | |||
const InlinedHashSet<std::string>& excluded_initializers = {}) noexcept; | |||
|
|||
/* Same as above but with a name provided by derived class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* Same as above but with a name provided by derived class. | |
protected: | |
/* Same as above but with a name provided by derived class. |
const InlinedHashSet<std::string_view>& compatible_execution_providers = {}, | ||
const InlinedHashSet<std::string>& excluded_initializers = {}) noexcept; | ||
|
||
virtual bool AllowConstantFolding(const Node& node) const { return true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add comment with quick explanation of how this is used
const InlinedHashSet<std::string>& excluded_initializers = {}) noexcept; | ||
|
||
bool AllowConstantFolding(const Node& node) const; | ||
Status UpdateNodeIndexSet(InlinedHashSet<onnxruntime::NodeIndex>& node_index_set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a new instance of the optimizer for each node_index_set instead of having an update method in order to keep it a little cleaner?
@@ -142,6 +142,10 @@ struct Node__EdgeIterator { | |||
struct ProviderHost { | |||
virtual const OrtApiBase* OrtGetApiBase() = 0; | |||
|
|||
virtual Status GetEPOptimizerByName(const std::string& name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Sounds like this optimizes an EP. Could we call it GetOptimizerByName
?
static const std::string kEP_GRAPH_TRANSFORMER_CONSTANT_FOLDING_DQ = "ConstantFoldingDQ"; | ||
|
||
// ConstantFoldingDQ optimization function | ||
auto constant_folding_dq_optimization = [&](Graph& graph, const ComputeCapability& optimization_cc, ComputeCapability& cc_to_update) -> Status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this in a separate class/file, maybe in the optimizer library?
} | ||
cc_to_update.sub_graph->nodes = updated_nodes; | ||
|
||
auto original_meta_def = cc_to_update.sub_graph->GetMetaDef(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating a new MetaDef, now that we have a use-case where it makes sense to update it could we instead add a method to get a mutable MetaDef from sub_graph?
*/ | ||
|
||
std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&)> selection_func; | ||
auto status = g_host->GetEPOptimizerByName("ConstantFoldingDQ", graph_transformer_mgr, selection_func); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check status
?
There are some requirements to modify the graph which are specific to the hardware GPU/NPU.
ORT has the hardcoded EP list for optimizations but that can't scale and it's hard be extended to enable EP custom optimizations.
Here is the prototype to enable L2+ optimizations for EPs (The original overview is provided by @skottmckay) as well as the TRT EP implementation for the ConstantFoldingDQ optimization.
Signatures for selection/optimization functions:
GetCapability
call (new) provider bridge API to lookup pre-defined optimizer by name and get selection function
EP has to update the returning ComputeCapability to include the optimization ComputeCapability in nodes_to_optimize. So that later ORT can perform optimization/transformation accordingly.
GraphPartitioner