Skip to content
This repository was archived by the owner on Nov 18, 2020. It is now read-only.

Add OpenCL runtime support #24

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Add OpenCL runtime support #24

wants to merge 7 commits into from

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Dec 25, 2019

Abstract runtime functionality by HSA (TODO: OCL)
Use Requires to load OpenCL bindings
Allow choosing runtime via environment variables

Closes #20, closes #23

Abstract runtime functionality by HSA (TODO: OCL)
Use Requires to load OpenCL bindings
Allow choosing runtime via environment variables
@vchuravy
Copy link
Member

Whoooo!

@jpsamaroo
Copy link
Member Author

jpsamaroo commented Dec 27, 2019

So far I've gotten kernels to launch through OpenCL, however they currently segfault on the GPU because (as I understand it) we aren't extracting the correct device-side pointer from cl.Buffer when we convert it to ROCDeviceArray (and then to FakeDeviceArray), so this shouldn't be expected to work right now. I'll probably need to figure out how to allocate buffers from OpenCL that mirror what we do with hsa_memory_alloc in finegrained mode, and then we should be able to extract (somehow) a pointer which works from host or device and pass that in.

Key note for reviewers: we (and LLVM) expect our array arguments to be of type ROCDeviceArray during compilation, so our kernels extract the pointer from that struct to get the actual buffer pointer. OpenCL apparently just passes pointers to raw buffers (like how things are done in C) instead of using nested structs, so we need to trick OpenCL into writing our ROCDeviceArray structs directly into the kernarg buffer. This part is working thanks to some code in OpenCL.jl which automatically handles isbits structs, so it's now on us to ensure that the right device-accessible pointer is embedded into the struct.

@jpsamaroo
Copy link
Member Author

Note to self: If we do implement a hacky (slow) workaround to getting the device pointer, we should also provide a shortcut via clSVMAlloc which supposedly does exactly what we do with HSARuntime. This of course requires OpenCL 2.0, but that's reasonable to expect if one wants the best performance.

@jpsamaroo
Copy link
Member Author

Now I've got kernels running without segfaults (see the new test/opencl.jl test script), but it appears that the C array never gets written to. If anyone has an idea for why this is happening, I'm all ears!

@vchuravy
Copy link
Member

If anyone has an idea for why this is happening, I'm all ears!

Do you need to synchronize the memory?

@jpsamaroo
Copy link
Member Author

It doesn't seem like that's the issue since we wait on the kernel's event, and even adding in a sync_workgroup() call to the kernel doesn't seem to do anything.

@jpsamaroo
Copy link
Member Author

If anyone has a working ROCm debugger setup, it would be great if we could see what instructions the GPU is actually executing (including memory addresses). I suspect we aren't writing to the correct location.

@jpsamaroo jpsamaroo changed the title [WIP] Add OpenCL runtime support Add OpenCL runtime support May 12, 2020
@jpsamaroo jpsamaroo marked this pull request as draft May 12, 2020 12:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support OpenCL.jl as device runtime Agent and queue are ignored in calls to at-roc
2 participants