Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphs with single StaticReshape or StaticTranspose fail #1858

Closed
richard-lemurian opened this issue Apr 4, 2024 · 13 comments
Closed

Graphs with single StaticReshape or StaticTranspose fail #1858

richard-lemurian opened this issue Apr 4, 2024 · 13 comments
Assignees
Labels
enhancement A feature or an optimization request
Milestone

Comments

@richard-lemurian
Copy link

Summary

StaticReshape or StaticTranspose not supported as documented

Version

v3.4.1, github hash f5ff0a6

Environment

oneDNN includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:

  • Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Address sizes: 48 bits physical, 48 bits virtual
    Byte Order: Little Endian
    CPU(s): 32
    On-line CPU(s) list: 0-31
    Vendor ID: AuthenticAMD
    Model name: AMD Ryzen 9 5950X 16-Core Processor
    CPU family: 25
    Model: 33
    Thread(s) per core: 2
    Core(s) per socket: 16
    Socket(s): 1
    Stepping: 0
    Frequency boost: enabled
    CPU max MHz: 5083.3979
    CPU min MHz: 2200.0000
    BogoMIPS: 6786.93
    Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
    pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx1
    6 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
    fetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibr
    s ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec
    xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_loc
    k nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip p
    ku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm

  • OS version (uname -a)

uname -a
Linux samson 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

  • Compiler version (gcc --version)

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

  • CMake version (cmake --version)

cmake version 3.28.3

  • CMake output log
  • git hash (git log -1 --format=%H)

f5ff0a6

Steps to reproduce

The following program demonstrates the problem for StaticTranspose.

#include <iostream>
#include <memory>
#include <vector>
#include <assert.h>

#include "oneapi/dnnl/dnnl_graph.hpp"
using namespace dnnl;
using data_type = dnnl::graph::logical_tensor::data_type;
using dnnl_status = dnnl::graph::status;
using op = dnnl::graph::op;
using partition = dnnl::graph::partition;
using logical_tensor = dnnl::graph::logical_tensor;

int main() {
  auto g = std::make_unique<dnnl::graph::graph>(dnnl::engine::kind::cpu);
  logical_tensor input_lts {0, data_type::f32};
  logical_tensor output_logical_tensor {1, data_type::f32};
  std::vector<int64_t> transpose_order {0, 2, 1, 3};
  op transpose_op(1, op::kind::StaticTranspose, {input_lts}, {output_logical_tensor}, "permute");
  transpose_op.set_attr(op::attr::order, transpose_order);
  auto add_status = g -> add_op(transpose_op);
  assert(add_status == dnnl_status::success);
  g->finalize();
  auto partitions = g->get_partitions();
  auto num_partitions = partitions.size();
  std::cout << "Number of partitions: " << num_partitions << std::endl;
  int which_partition = -1;
  for (const auto &partition : partitions) {
    which_partition++;
    if (!partition.is_supported()) {
      std::cout << "Partition " << which_partition << " is not supported" << std::endl;
      continue;
    }
  }
  std::cout << "Finished" << std::endl;
  return 0;
}

Observed behavior

The above program output indicates that a graph with a single StaticTranspose operator is not supported.

Expected behavior

Nothing in the oneAPI documentation indicates that the documented operators must be used in combination with other operators.

I am trying to use the oneDNN Graph API for a machine learning model. We are doing an external partitioning of our operator graph into parts that are supported by the oneDNN graph API ( according to its documentation) and parts that are not. Then we will send the subgraphs that are supposedly supported by the oneDNN Graph API to it.

For the purpose of unit testing, we want to be able to create graphs with single operators, for each of the graph API operators that we are using.

However, when I make a graph that has only a single StaticReshape, or a single StaticTranspose, a call to partition.is_supported() fails.

In studying the logic of partitioning, it appears that partitions are chosen using passes.

Each pass has a chance to look at the remaining unpartitioned operators and decide if it can handle them. If so, then it creates a partition, moving the matching operators to the partition.

This continues until there are no unmatched operators.

However, it appears that if the dnnl backend (or others, if there are others) passes leave some operators unmatched, then a "fake backend" will match the remainder.

Then the check for whether a partition is supported checks to see if it is using the fake backend.

In looking at the existing patterns in the dnnl backend, it does not appear that the patterns are exhaustive. This is inconvenient because it appears that operators are usable in some contexts, but not in others.

I don't know if I am missing something, but it appears that there needs to be a catch-all pattern that runs after all of the others have run. This pattern would, at a minimum, match single remaining operators, make a partition for them, and then use a kernel that could execute that single operator.

@richard-lemurian richard-lemurian added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Apr 4, 2024
@TaoLv
Copy link
Contributor

TaoLv commented Apr 4, 2024

Hi @richard-lemurian , thank you for the question and looking into the implementation details. Your observation and analysis look correct to me. Currently there are a bunch of operations defined in the spec and library (https://oneapi-src.github.io/oneDNN/graph_supported_operations.html).
From backend implementation perspective, it's possible that an operation is passed to the library but not supported by the backend (not fused with other operations and not implemented as single op partition). Those operations will still be returned to users but as "unsupported" partitions. That's the purpose of the API partition::is_supported(). Users need to check the supporting status of a partition via the API before compiling it. If a partition is not supported by the library, users will have to handle it by themselves. This is documented in the API documentation.
For StaticReshape and StaticTranspose mentioned in this issue, they are supported by the backend via fusions (eg, 1, 2, 3, etc.) but no single op implementation if the fusion dose not match.

@richard-lemurian
Copy link
Author

Hi @richard-lemurian , thank you for the question and looking into the implementation details. Your observation and analysis look correct to me. Currently there are a bunch of operations defined in the spec and library (https://oneapi-src.github.io/oneDNN/graph_supported_operations.html). From backend implementation perspective, it's possible that an operation is passed to the library but not supported by the backend (not fused with other operations and not implemented as single op partition). Those operations will still be returned to users but as "unsupported" partitions. That's the purpose of the API partition::is_supported(). Users need to check the supporting status of a partition via the API before compiling it. If a partition is not supported by the library, users will have to handle it by themselves.

Thanks for your response.

It is unfortunate that the graph API cannot handle all of its advertised operators in arbitrary circumstances. If some cannot be fused, it would be good if they were automatically supported using the corresponding non-graph oneDNN API.

As it is, using the graph API gives me a collection of partitions, some supported, and others not supported. And there are data dependencies between these. So if I have to provide alternate implementations of the unsupported partitions, is there a way to integrate those alternate implementations with those that are supported? Do oneDNN streams do scheduling based on data dependencies, waiting to execute some partitions until their dependent partitions have been executed? I noticed the following:

/// @brief Stream flags.
typedef enum {
    // In-order execution.
    dnnl_stream_in_order = 0x1U,
    /// Out-of-order execution.
    dnnl_stream_out_of_order = 0x2U,
    /// Default stream configuration.
    dnnl_stream_default_flags = dnnl_stream_in_order,
   ...
} dnnl_stream_flags_t;

So I'm guessing that in the dnnl_stream_in_order mode, the partitions are executed one after another. But if dnnl_stream_out_of_order is selected, does that mean that oneDNN would start some partitions in parallel, but schedule them to respect data dependencies?

I'm just trying to figure out how to handle the case of mixed supported and unsupported partitions.

@sanchitintel
Copy link

So if I have to provide alternate implementations of the unsupported partitions, is there a way to integrate those alternate implementations with those that are supported

Let's say you have supported partition -> unsupported partition -> supported partition.

The output of the first supported partition would be input to your op corresponding to the first unsupported partition.
The output of that operator (whose implementation would be handled by you) will be input to the second supported partition.

So I'm guessing that in the dnnl_stream_in_order mode, the partitions are executed one after another. But if dnnl_stream_out_of_order is selected, does that mean that oneDNN would start some partitions in parallel, but schedule them to respect data dependencies

On CPU, dnnl::stream only supports out-of-order execution with SYCL runtime, but the default runtime is OpenMP.

@richard-lemurian
Copy link
Author

richard-lemurian commented Apr 4, 2024

Let's say you have supported partition -> unsupported partition -> supported partition.

The output of the first supported partition would be input to your op corresponding to the first unsupported partition. The output of that operator (whose implementation would be handled by you) will be input to the second supported partition.

Thanks for your response.

So it sound like I will need to do the scheduling of the partitions myself. Or I could identify collections of partitions that could be run together and schedule them as a group. Sounds doable.

But if only the graph API handled all of the operators it documents, it would be so much easier!

Any chance that could be something that would be undertaken by the oneDNN graph API team? I had originally thought of the graph API as a replacement for the non-graph oneDNN API, but it is not unless all the operators are unconditionally supported.

@richard-lemurian
Copy link
Author

So, could this issue be turned into an enhancement request?

@vpirogov vpirogov added enhancement A feature or an optimization request and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Apr 4, 2024
@TaoLv
Copy link
Contributor

TaoLv commented Apr 6, 2024

Hi @richard-lemurian,
oneDNN has less operations than frameworks (eg, PyTorch or TensorFlow), whether in graph API or non-graph API (a.k.a primitive API). It means the frameworks or applications will have to handle the unsupported operations by themselves while calling oneDNN to optimize those supported and optimized ones. Could you please explain a bit more why handling all documented operations will make it easier for you, especially when you still have many undocumented operations to handle?

Another thing is about performance expectation. oneDNN is a performance library targeting the optimizations for key operations/layers in deep learning applications. That's also one of the reasons of why it does not support all operations under arbitrary circumstances. Take this StaticReshape/StaticTranspose as an example, we provide optimizations when they can be fused with other operations. But we don't see much benefit (and did not receive any request) of handling a single StaticReshape/StaticTranspose in the library vs. calling the existing implementations in the frameworks. So when you request for the implementations for all the operations, are you requesting for both functionality and performance? Do you see a performance issue if the "unsupported" partitions?

Thank you!

@richard-lemurian
Copy link
Author

richard-lemurian commented Apr 7, 2024

@TaoLv, thank you for your comment and question.

A framework provides building blocks that an application can use to implement their application. Recently we decided to try the oneDNN framework. Originally we were using the "primitive" oneDNN version, but then it seemed that the graph API would give better performance, so we decided to try it. We were going to use the primitive version for operators not supported by the graph API, but use the graph API for those it supports. It then surprised us that the graph API did not accept some of the operators advertised as being supported by the graph API.

It seems that the situation with the graph API is one of trial and error. Build a model using the operators that are documented as being supported. Then try compiling and running it on the graph API, but receive errors of unsupported partitions. Then implement those unsupported partitions (though they only contain operators that were documented as supported) outside the graph API. The thing that is confusing is that a given operator might be accepted in some contexts in the graph API, but not others. So one cannot really say whether an operator is supported by the graph API or not.

If the alternate to having the graph API support single operators is to use the primitive API, then it seems there should be some semi-automatic way to glue the graph API to the primitive API such that to the graph API users it appeared that the graph API accepted all its documented operators in all circumstances.

I supposed the situation might be somewhat better if the documentation told under what circumstances an operator could be used. But then that documentation would have to change whenever new fusion passes were added. Of course, we can always read the oneDNN source code, but one would hope the user's don't have to do that. The other alternative is the trial and error approach mentioned above.

I hope you can see that the situation would be much less confusing and easier for the user if all of the documented operators were accepted all of the time. The user tuning for performance would still want to look at the actual partitions produced, to see if the expected fusions were happening, but at least they would have a functional model. If the single operators were at least implemented with the performance of a "primitive" oneDNN API call, then that would be as good as the user is likely to get anyway, unless they implemented the kernel themselves.

@richard-lemurian
Copy link
Author

@TaoLv, does having this issue assigned to you mean that you will work on an implementation of the enhancement, or just that you were assigned to triage it? It would be exciting if you were going to work on it. Do you think there is some semi-automatics way that the "primitive" operator implementations can be glued into the graph API implementation?

@TaoLv
Copy link
Contributor

TaoLv commented Apr 11, 2024

Hi @richard-lemurian, sorry for late reply. What you described looks like a new use scenario for us. I will need to discuss with my colleagues about how to improve the usability of graph API to cover different usages. But so far, I don't have a concrete plan yet. Besides that, I still have a few things to clarify:

Recently we decided to try the oneDNN framework.

Not sure if this is a terminology issue. But as clearly mentioned in the README, oneDNN is a performance library, not a framework for end-to-end model enabling. Deep learning practitioners are suggested to use frameworks and toolkits (PyTorch, TensorFlow, OpenVINO, etc.) with oneDNN enabled. The main differences are:

  • oneDNN focuses on performance optimizations.
  • oneDNN focuses on key operations and layers which are far less compared to frameworks and toolkits.
  • oneDNN does not provide the graph executors which is an essential part of frameworks to handle operation execution and dependencies.

Then try compiling and running it on the graph API, but receive errors of unsupported partitions.

The decision of support or not support happens at the graph partitioning stage which is relatively early in frameworks. They still have a lot of chances to handle the unsupported partitions and fallback to other implementations. From this perspective, "unsupported partition" is not an error, it's a designed communication mechanism with the callers. Btw, even the library returns a partition as supported, I see in some framework integrations, they still have the flexibility to decide not using the library and calling other implementations.

If the single operators were at least implemented with the performance of a "primitive" oneDNN API call

Currently we don't have the guarantee that each graph operation can be implemented with existing primitive API. If we commit to support the usage requested here, there will be quite a few development work in front of us and in the future if we add new operations into the opset.

Do you think there is some semi-automatics way that the "primitive" operator implementations can be glued into the graph API implementation?

We can try to see if we can provide a patch if StaticReshape/StaticTranspose are the only interest here.

Lastly, as oneDNN only supports a small subset of deep learning, with deep learning frameworks in mind, we thought that users would always have reference implementations if an operator is not supported by the library. But now I agree that it may not be that easy if one wants to create an application from scratch without reference implementations like in frameworks. They will want to the library to provide as many functionality as possible.

Thank you again for the detailed explanation and patience!

CC @igorsafo

@richard-lemurian
Copy link
Author

@TaoLv, thanks for your further explanations. Can you point me to how and where existing frameworks integrate with the oneDNN graph API into their framework? I was using as a model of using the graph API the example code in oneDNN/examples/graph. I suppose you would say that we are building a framework, so it would be helpful to know how other frameworks were interacting with the oneDNN graph API.

@TaoLv
Copy link
Contributor

TaoLv commented Apr 12, 2024

@richard-lemurian
Copy link
Author

@TaoLv, Great! Thanks so much. I will study these references.

@TaoLv
Copy link
Contributor

TaoLv commented Jul 21, 2024

StaticReshape and StaticTranspose have been supported by commit 6e75504 .

@vpirogov vpirogov added this to the v3.6 milestone Aug 15, 2024
@vpirogov vpirogov closed this as completed Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request
Projects
None yet
Development

No branches or pull requests

4 participants