-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][HIP] Memory access fault by GPU on address (nil) #4688
Comments
Hello, I encountered a similar issue not long ago. A PR to fix this was merged a few days ago after this issue was posted. Many Thanks, |
@AidanBeltonS, which PR fixed this? |
PR #4604 resolved a similar error. |
HI Aidan, Thank you for your work. I typed 'git pull' in the llvm directory, and then built the compiler with HIP support from scratch. The message is: Memory access fault by GPU node-2 (Agent handle: 0x519c50) on address (nil). Reason: Page not present or supervisor privilege. Thread 7 "ced" received signal SIGSEGV, Segmentation fault. |
Okay thank you for testing that out. |
You are welcome. |
The gdb error was reported to AMD. ROCm/ROCgdb#9 |
Hello @zjin-lcf are you still seeing this issue? I'm not able to reproduce the crash with the latest dpc++ and benchmark on MI100. |
Could you please explain which changes you made fix the related issues ? Thanks. |
Oooh, nevermind, I can actually reproduce it, I didn't realize the binary had changed name and is |
Okay, so I can now confirm that the following patch fixes Basically this is an issue if you have multiple kernels and one has less arguments than the previous one. At the moment the global offset value is loaded from an address right after the kernel arguments, this is initialized to 0 so for a single kernel it works, but when there is multiple kernel there might be something different in the memory so it leads to random global offset values. Which causes crashes like you were seeing in this sample. But with the patch it should run fine. |
I typed 'git pull' in my local llvm directory, built the repo from scratch, and then found that your change is not in the sycl branch. Thanks |
You might need to do a clean build after making the changes, the CMake doesn't pick up changes to |
Yes. |
Yes, it can. |
Do you know who is an expert in CMake ? |
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/ced-sycl
./ced -a 0
Running the program shows the following error on an AMD GPU. Could you reproduce the error ? Thanks.
Memory access fault by GPU node-2 (Agent handle: 0x51b550) on address (nil). Reason: Page not present or supervisor privilege.
bt
gdb message:
Thread 2 "ced" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff2ec5700 (LWP 2852315)]
0x00007ffff766c18b in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff766c18b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff764b859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff3344a7f in rocr::core::Runtime::VMFaultHandler(long, void*) () from /opt/rocm/hip/lib/../../lib/libhsa-runtime64.so.1
#3 0x00007ffff334753b in rocr::core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/hip/lib/../../lib/libhsa-runtime64.so.1
#4 0x00007ffff32ef497 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm/hip/lib/../../lib/libhsa-runtime64.so.1
#5 0x00007ffff7fa1609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007ffff7748293 in clone () from /lib/x86_64-linux-gnu/libc.so.6
The text was updated successfully, but these errors were encountered: