Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running destructors before completion of a primitive #1814

Closed
rscohn2 opened this issue Feb 29, 2024 · 8 comments
Closed

running destructors before completion of a primitive #1814

rscohn2 opened this issue Feb 29, 2024 · 8 comments
Assignees
Labels
enhancement A feature or an optimization request
Milestone

Comments

@rscohn2
Copy link
Member

rscohn2 commented Feb 29, 2024

I want to launch a primitive in a function and return immediately. I am getting a segv that looks like a memory corruption. If I wait for completion before returning, it works. I think the problem is that destructors cannot run before completion. In the code below what instances cannot be destroyed before the primitive completes?

struct onednn_engine {
  onednn_engine(dnnl::engine::kind engine_kind) :
    engine_(engine_kind, 0),
    stream_(engine_)
  {}

  void softmax(const memory::dims &dims, const int axis, float *data) {
    auto src_md = memory::desc(dims, dt::f32, tag::nc);
    auto dst_md = memory::desc(dims, dt::f32, tag::nc);
    auto src_mem = memory(src_md, engine_, data);

    // This can be expensive, but dnn will cache it                                                                                      
    auto softmax_pd = softmax_forward::primitive_desc(engine_,
            prop_kind::forward_training, algorithm::softmax_accurate, src_md,
            dst_md, axis);

    auto softmax_prim = softmax_forward(softmax_pd);

    // Set up in-place execution by assigning src as DST.                                                                                
    std::unordered_map<int, memory> softmax_args;
    softmax_args.insert({DNNL_ARG_SRC, src_mem});
    softmax_args.insert({DNNL_ARG_DST, src_mem});

    softmax_prim.execute(stream_, softmax_args);
    // Fails with segv if we do not wait                                                                                                
    stream_.wait();
  }

  void wait() {
    stream_.wait();
  }

  dnnl::engine engine_;
  dnnl::stream stream_;
};
@dzarukin
Copy link
Contributor

Hi @rscohn2, thanks for bringing this one.
Could you, please, supply ONEDNN_VERBOSE=all output with an example?

@rscohn2
Copy link
Member Author

rscohn2 commented Feb 29, 2024

It does not fail with logging:

(cs149) rscohn1@gkdse-fcp-19:examples$ ./primitives-gpt149-softmax-cpp
Segmentation fault (core dumped)
(cs149) rscohn1@gkdse-fcp-19:examples$ ONEDNN_VERBOSE=all ./primitives-gpt149-softmax-cpp
onednn_verbose,info,oneDNN v3.3.0 (commit a96e9b1a6769171e74b0b8e031489303438906e5)
onednn_verbose,info,cpu,runtime:DPC++,nthr:128
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit in\teger support
onednn_verbose,info,gpu,runtime:DPC++
onednn_verbose,info,cpu,engine,0,backend:OpenCL,name:Intel(R) Xeon(R) Gold 6438M,driver_version:2023.16.12,binary_kernels:disabled
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,p\roblem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,ba\ckend,exec_time
onednn_verbose,primitive,create:cache_miss,cpu,softmax,jit:avx512_core_fp16,forward_training,src_f32::blocked:ab::f0 dst_f32::blocke\d:ab::f0 diff_dst_undef::undef:::,,alg:softmax_accurate axis:1,3x1000,0.176025
onednn_verbose,primitive,exec,cpu,softmax,jit:avx512_core_fp16,forward_training,src_f32::blocked:ab::f0 dst_f32::blocked:ab::f0 diff\_dst_undef::undef:::,,alg:softmax_accurate axis:1,3x1000,0.677979
Example passed on CPU.
(cs149) rscohn1@gkdse-fcp-19:examples$

It fails in host_ptr:

Thread 129 "primitives-gpt1" received signal SIGSEGV, Segmentation fault.
[Switching to thread 129 (Thread 0x7ffdbbfff640 (LWP 2215990))]
0x00007ffff374bff0 in dnnl::impl::exec_ctx_t::host_ptr(dnnl::impl::memory_storage_t const*) const ()
   from /localdisk/opt/intel/oneapi/dnnl/2024.0/lib/libdnnl.so.3
(gdb)x/i $pc
=> 0x7ffff374bff0 <_ZNK4dnnl4impl10exec_ctx_t8host_ptrEPKNS0_16memory_storage_tE+32>:   mov    (%rsi),%rax
(gdb) x/i $rsi
   0x5550035effaa:      Cannot access memory at address 0x5550035effaa
(gdb) p/x $rsi
$1 = 0x5550035effaa
(gdb)

Source files are in the attached zip
gpt149.zip

@densamoilov
Copy link
Contributor

@rscohn2

In the code below what instances cannot be destroyed before the primitive completes?

For the engine kind CPU and CPU runtime SYCL the memory objects are currently required to stay alive until the primitive is complete.

@densamoilov
Copy link
Contributor

@rscohn2,

I want to launch a primitive in a function and return immediately

The non-blocking execution model is typically used and beneficial for GPU, is there any reason why you want to use it for CPU?

@rscohn2
Copy link
Member Author

rscohn2 commented Feb 29, 2024

I was looking at making the CPU async to overlap framework time with kernel time. This was a suggestion from @ddkalamk. I am trying to do a prototype to see if it helps.

I guess there is some reference counting for the memory object? I am passing in a copy in the softmax_args The memory object constructor takes a pointer, and does not own the data. It is no problem to keep a memory object live, I am just curious why I can't destruct the original after passing it by value.

@rscohn2
Copy link
Member Author

rscohn2 commented Feb 29, 2024

Storing the memory objects associated with pending operations in a std::vector and clearing the vector after the wait() fixed the problem. Thanks!

@densamoilov
Copy link
Contributor

@rscohn2,

I guess there is some reference counting for the memory object?

Actually, there isn't but there should be. For the GPU kind we don't need it because the memory objects themselves are not used by the GPU kernels so the user only needs to keep the underlying buffer alive if it's USM and doesn't have to do anything if it's SYCL buffer. For the CPU kind, we re-use the classic CPU kernels written in C++ or JIT Assembly, which additionally requires the memory objects to stay alive. We plan to add the reference count mechanism in the future though the interest in SYCL CPU is quite low and therefore the task is not a high priority at this point.

@vpirogov vpirogov added enhancement A feature or an optimization request and removed question labels Feb 29, 2024
@vpirogov vpirogov added this to the v3.6 milestone Jul 16, 2024
@vpirogov vpirogov assigned spalicki and unassigned densamoilov Jul 16, 2024
@vpirogov vpirogov modified the milestones: v3.6, v3.7 Sep 12, 2024
@vpirogov
Copy link
Member

Fixed by 28ec30d and c4b2695.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request
Projects
None yet
Development

No branches or pull requests

5 participants