Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][HIP] Memory access fault by GPU on address (nil) #4688

Closed
zjin-lcf opened this issue Oct 2, 2021 · 16 comments · Fixed by #4905
Closed

[SYCL][HIP] Memory access fault by GPU on address (nil) #4688

zjin-lcf opened this issue Oct 2, 2021 · 16 comments · Fixed by #4905
Labels
bug Something isn't working hip Issues related to execution on HIP backend. runtime Runtime library related issue

Comments

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Oct 2, 2021

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/ced-sycl

./ced -a 0

Running the program shows the following error on an AMD GPU. Could you reproduce the error ? Thanks.

Memory access fault by GPU node-2 (Agent handle: 0x51b550) on address (nil). Reason: Page not present or supervisor privilege.
bt

gdb message:
Thread 2 "ced" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff2ec5700 (LWP 2852315)]
0x00007ffff766c18b in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff766c18b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff764b859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff3344a7f in rocr::core::Runtime::VMFaultHandler(long, void*) () from /opt/rocm/hip/lib/../../lib/libhsa-runtime64.so.1
#3 0x00007ffff334753b in rocr::core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/hip/lib/../../lib/libhsa-runtime64.so.1
#4 0x00007ffff32ef497 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm/hip/lib/../../lib/libhsa-runtime64.so.1
#5 0x00007ffff7fa1609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007ffff7748293 in clone () from /lib/x86_64-linux-gnu/libc.so.6

@zjin-lcf zjin-lcf added the bug Something isn't working label Oct 2, 2021
@alexbatashev alexbatashev added the hip Issues related to execution on HIP backend. label Oct 2, 2021
@AerialMantis AerialMantis added the runtime Runtime library related issue label Oct 3, 2021
@AidanBeltonS
Copy link
Contributor

Hello,

I encountered a similar issue not long ago. A PR to fix this was merged a few days ago after this issue was posted.
Could you check if this is still an issue for a tip build of llvm, so I know if it is the same issue or not?

Many Thanks,
Aidan

@bader
Copy link
Contributor

bader commented Oct 8, 2021

@AidanBeltonS, which PR fixed this?

@AidanBeltonS
Copy link
Contributor

PR #4604 resolved a similar error.
The memory access errors are quite generic though, so it could be a separate issue.

@zjin-lcf
Copy link
Contributor Author

zjin-lcf commented Oct 8, 2021

HI Aidan,

Thank you for your work. I typed 'git pull' in the llvm directory, and then built the compiler with HIP support from scratch. The message is:

Memory access fault by GPU node-2 (Agent handle: 0x519c50) on address (nil). Reason: Page not present or supervisor privilege.

Thread 7 "ced" received signal SIGSEGV, Segmentation fault.
[Switching to AMDGPU Thread 2:4:1:1 (0,0,0)/0]
main::{lambda()#1}::operator()() const::{lambda(cl::sycl::handler&)#5}::operator()(cl::sycl::handler&) const::{lambda(cl::sycl::nd_item<2>)#1}::operator()(cl::sycl::nd_item<2>) const (
/home/release/git/aomp13/ROCgdb/gdb/dwarf2/frame.c:1029: internal-error: Unknown CFA rule.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

@AidanBeltonS
Copy link
Contributor

Okay thank you for testing that out.
Then this is a separate issue. That helps narrow this down.

@zjin-lcf
Copy link
Contributor Author

You are welcome.

@zjin-lcf
Copy link
Contributor Author

The gdb error was reported to AMD. ROCm/ROCgdb#9

@npmiller
Copy link
Contributor

Hello @zjin-lcf are you still seeing this issue? I'm not able to reproduce the crash with the latest dpc++ and benchmark on MI100.

@zjin-lcf
Copy link
Contributor Author

Could you please explain which changes you made fix the related issues ? Thanks.

@npmiller
Copy link
Contributor

Oooh, nevermind, I can actually reproduce it, I didn't realize the binary had changed name and is main now, it seems that it was working with the old one I had though.

@npmiller
Copy link
Contributor

npmiller commented Nov 12, 2021

Okay, so I can now confirm that the following patch fixes ced:

Basically this is an issue if you have multiple kernels and one has less arguments than the previous one.

At the moment the global offset value is loaded from an address right after the kernel arguments, this is initialized to 0 so for a single kernel it works, but when there is multiple kernel there might be something different in the memory so it leads to random global offset values. Which causes crashes like you were seeing in this sample.

But with the patch it should run fine.

@bader bader linked a pull request Nov 14, 2021 that will close this issue
@zjin-lcf
Copy link
Contributor Author

I typed 'git pull' in my local llvm directory, built the repo from scratch, and then found that your change is not in the sycl branch.
I manually changed the two source files listed in your PR, and then built the hip plugin incrementally.
I observed the same error message. Am I right ?

Thanks

@npmiller
Copy link
Contributor

I typed 'git pull' in my local llvm directory, built the repo from scratch, and then found that your change is not in the sycl branch. I manually changed the two source files listed in your PR, and then built the hip plugin incrementally. I observed the same error message. Am I right ?

Thanks

You might need to do a clean build after making the changes, the CMake doesn't pick up changes to libclc very well, so it's likely your current build is still using the old code for libclc/amdgcn/libspirv/workitem/get_global_offset.cl.

@zjin-lcf
Copy link
Contributor Author

Yes.
@bader Do you know if CMake can include .cl files ? Doing a clean build for every changes of .cl files takes too much time.

@bader
Copy link
Contributor

bader commented Nov 17, 2021

Do you know if CMake can include .cl files ?

Yes, it can.
It looks like a bug in CMake script (missing dependencies in particular), if clean build is required to pick up source file updates.

@zjin-lcf
Copy link
Contributor Author

Do you know who is an expert in CMake ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend. runtime Runtime library related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants