-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graphs with single StaticReshape or StaticTranspose fail #1858
Comments
Hi @richard-lemurian , thank you for the question and looking into the implementation details. Your observation and analysis look correct to me. Currently there are a bunch of operations defined in the spec and library (https://oneapi-src.github.io/oneDNN/graph_supported_operations.html). |
Thanks for your response. It is unfortunate that the graph API cannot handle all of its advertised operators in arbitrary circumstances. If some cannot be fused, it would be good if they were automatically supported using the corresponding non-graph oneDNN API. As it is, using the graph API gives me a collection of partitions, some supported, and others not supported. And there are data dependencies between these. So if I have to provide alternate implementations of the unsupported partitions, is there a way to integrate those alternate implementations with those that are supported? Do oneDNN streams do scheduling based on data dependencies, waiting to execute some partitions until their dependent partitions have been executed? I noticed the following:
So I'm guessing that in the dnnl_stream_in_order mode, the partitions are executed one after another. But if dnnl_stream_out_of_order is selected, does that mean that oneDNN would start some partitions in parallel, but schedule them to respect data dependencies? I'm just trying to figure out how to handle the case of mixed supported and unsupported partitions. |
Let's say you have The output of the first supported partition would be input to your op corresponding to the first unsupported partition.
On CPU, |
Thanks for your response. So it sound like I will need to do the scheduling of the partitions myself. Or I could identify collections of partitions that could be run together and schedule them as a group. Sounds doable. But if only the graph API handled all of the operators it documents, it would be so much easier! Any chance that could be something that would be undertaken by the oneDNN graph API team? I had originally thought of the graph API as a replacement for the non-graph oneDNN API, but it is not unless all the operators are unconditionally supported. |
So, could this issue be turned into an enhancement request? |
Hi @richard-lemurian, Another thing is about performance expectation. oneDNN is a performance library targeting the optimizations for key operations/layers in deep learning applications. That's also one of the reasons of why it does not support all operations under arbitrary circumstances. Take this StaticReshape/StaticTranspose as an example, we provide optimizations when they can be fused with other operations. But we don't see much benefit (and did not receive any request) of handling a single StaticReshape/StaticTranspose in the library vs. calling the existing implementations in the frameworks. So when you request for the implementations for all the operations, are you requesting for both functionality and performance? Do you see a performance issue if the "unsupported" partitions? Thank you! |
@TaoLv, thank you for your comment and question. A framework provides building blocks that an application can use to implement their application. Recently we decided to try the oneDNN framework. Originally we were using the "primitive" oneDNN version, but then it seemed that the graph API would give better performance, so we decided to try it. We were going to use the primitive version for operators not supported by the graph API, but use the graph API for those it supports. It then surprised us that the graph API did not accept some of the operators advertised as being supported by the graph API. It seems that the situation with the graph API is one of trial and error. Build a model using the operators that are documented as being supported. Then try compiling and running it on the graph API, but receive errors of unsupported partitions. Then implement those unsupported partitions (though they only contain operators that were documented as supported) outside the graph API. The thing that is confusing is that a given operator might be accepted in some contexts in the graph API, but not others. So one cannot really say whether an operator is supported by the graph API or not. If the alternate to having the graph API support single operators is to use the primitive API, then it seems there should be some semi-automatic way to glue the graph API to the primitive API such that to the graph API users it appeared that the graph API accepted all its documented operators in all circumstances. I supposed the situation might be somewhat better if the documentation told under what circumstances an operator could be used. But then that documentation would have to change whenever new fusion passes were added. Of course, we can always read the oneDNN source code, but one would hope the user's don't have to do that. The other alternative is the trial and error approach mentioned above. I hope you can see that the situation would be much less confusing and easier for the user if all of the documented operators were accepted all of the time. The user tuning for performance would still want to look at the actual partitions produced, to see if the expected fusions were happening, but at least they would have a functional model. If the single operators were at least implemented with the performance of a "primitive" oneDNN API call, then that would be as good as the user is likely to get anyway, unless they implemented the kernel themselves. |
@TaoLv, does having this issue assigned to you mean that you will work on an implementation of the enhancement, or just that you were assigned to triage it? It would be exciting if you were going to work on it. Do you think there is some semi-automatics way that the "primitive" operator implementations can be glued into the graph API implementation? |
Hi @richard-lemurian, sorry for late reply. What you described looks like a new use scenario for us. I will need to discuss with my colleagues about how to improve the usability of graph API to cover different usages. But so far, I don't have a concrete plan yet. Besides that, I still have a few things to clarify:
Not sure if this is a terminology issue. But as clearly mentioned in the README, oneDNN is a performance library, not a framework for end-to-end model enabling. Deep learning practitioners are suggested to use frameworks and toolkits (PyTorch, TensorFlow, OpenVINO, etc.) with oneDNN enabled. The main differences are:
The decision of support or not support happens at the graph partitioning stage which is relatively early in frameworks. They still have a lot of chances to handle the unsupported partitions and fallback to other implementations. From this perspective, "unsupported partition" is not an error, it's a designed communication mechanism with the callers. Btw, even the library returns a partition as supported, I see in some framework integrations, they still have the flexibility to decide not using the library and calling other implementations.
Currently we don't have the guarantee that each graph operation can be implemented with existing primitive API. If we commit to support the usage requested here, there will be quite a few development work in front of us and in the future if we add new operations into the opset.
We can try to see if we can provide a patch if StaticReshape/StaticTranspose are the only interest here. Lastly, as oneDNN only supports a small subset of deep learning, with deep learning frameworks in mind, we thought that users would always have reference implementations if an operator is not supported by the library. But now I agree that it may not be that easy if one wants to create an application from scratch without reference implementations like in frameworks. They will want to the library to provide as many functionality as possible. Thank you again for the detailed explanation and patience! CC @igorsafo |
@TaoLv, thanks for your further explanations. Can you point me to how and where existing frameworks integrate with the oneDNN graph API into their framework? I was using as a model of using the graph API the example code in oneDNN/examples/graph. I suppose you would say that we are building a framework, so it would be helpful to know how other frameworks were interacting with the oneDNN graph API. |
Hi @richard-lemurian , here are a few links for your information:
|
@TaoLv, Great! Thanks so much. I will study these references. |
StaticReshape and StaticTranspose have been supported by commit 6e75504 . |
Summary
StaticReshape or StaticTranspose not supported as documented
Version
v3.4.1, github hash f5ff0a6
Environment
oneDNN includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 5950X 16-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU max MHz: 5083.3979
CPU min MHz: 2200.0000
BogoMIPS: 6786.93
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx1
6 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibr
s ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec
xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_loc
k nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip p
ku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
OS version (
uname -a
)uname -a
Linux samson 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
gcc --version
)gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
cmake --version
)cmake version 3.28.3
git log -1 --format=%H
)f5ff0a6
Steps to reproduce
The following program demonstrates the problem for StaticTranspose.
Observed behavior
The above program output indicates that a graph with a single StaticTranspose operator is not supported.
Expected behavior
Nothing in the oneAPI documentation indicates that the documented operators must be used in combination with other operators.
I am trying to use the oneDNN Graph API for a machine learning model. We are doing an external partitioning of our operator graph into parts that are supported by the oneDNN graph API ( according to its documentation) and parts that are not. Then we will send the subgraphs that are supposedly supported by the oneDNN Graph API to it.
For the purpose of unit testing, we want to be able to create graphs with single operators, for each of the graph API operators that we are using.
However, when I make a graph that has only a single StaticReshape, or a single StaticTranspose, a call to partition.is_supported() fails.
In studying the logic of partitioning, it appears that partitions are chosen using passes.
Each pass has a chance to look at the remaining unpartitioned operators and decide if it can handle them. If so, then it creates a partition, moving the matching operators to the partition.
This continues until there are no unmatched operators.
However, it appears that if the dnnl backend (or others, if there are others) passes leave some operators unmatched, then a "fake backend" will match the remainder.
Then the check for whether a partition is supported checks to see if it is using the fake backend.
In looking at the existing patterns in the dnnl backend, it does not appear that the patterns are exhaustive. This is inconvenient because it appears that operators are usable in some contexts, but not in others.
I don't know if I am missing something, but it appears that there needs to be a catch-all pattern that runs after all of the others have run. This pattern would, at a minimum, match single remaining operators, make a partition for them, and then use a kernel that could execute that single operator.
The text was updated successfully, but these errors were encountered: