Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared object support #15

Merged
merged 83 commits into from
Jul 19, 2023
Merged

Shared object support #15

merged 83 commits into from
Jul 19, 2023

Conversation

n-eiling
Copy link
Member

@n-eiling n-eiling commented Feb 17, 2023

Adds support for launching kernels from shared objects loaded during runtime using dlopen. As this is how pytorch uses CUDA, this should enable pytorch support to Cricket.
This involved adding support for decoding the fatbinary metadata before the embedded cubin ELF in binaries compiled by nvcc. Cricket becomes able to extract the cubin from a binary, send it via RPC to the server, where it will be executed using the driver APIs cuModuleLoadData.

  • decode fatbinary
  • extract cubin
  • send cubin to server
  • add registry for tranferred cubins and kernel functions so Cricket is able to identify them when launching kernels
  • switch over old kernel launching functionality to always use the new registry instead of relying on kernel locations being the same on client and server
  • use libelf to read kernel infos instead of relying on cuobjdump which does not support in-memory ELFs
  • read parameter infos using libelf.
  • enable reading CUDA elfs with debugging infos and compressed elfs
  • Test with minimal pytorch (deactivated some features, no kernel compression)
  • Test with default pytorch (no kernel compression)
  • Fix compression and test with default pytorch (with kernel compression)
  • fix CI
  • large-scale test / verification
    • YOLOv5
  • cuDNN implementation

This also makes LD_PRELOADing on the server side not necessary anymore, as we now also extract and send cubins for normal applications.

This is work in progress. Addresses #6

@n-eiling n-eiling added enhancement New feature or request doing labels Feb 17, 2023
@n-eiling n-eiling self-assigned this Feb 17, 2023
@n-eiling n-eiling changed the title WIP: Share object support WIP: Shared object support Feb 17, 2023
… able to identify them when launching kernels

Signed-off-by: Niklas Eiling <[email protected]>
@nravic
Copy link

nravic commented Feb 20, 2023

Just saw this! Thanks for taking it on haha, was about to start this weekend. I'd love to help with this effort, let me know if there's anything I can do.

@jin-zhengnan
Copy link

jin-zhengnan commented Mar 23, 2023

@n-eiling When will the entire test be completed? I am looking forward to this!

@n-eiling
Copy link
Member Author

I updated my todo list. There are still some open issues that need adressing. CUDA relies on the .nv.info section for information regarding kernel parameter sizes and offsets. I used to parse them using cuobjdump, but this doesn't support in-memory ELFs - only files.

@n-eiling n-eiling force-pushed the share-object-support branch 5 times, most recently from 78444ec to 36498a8 Compare July 18, 2023 15:01
@n-eiling n-eiling force-pushed the share-object-support branch from 36498a8 to 088b6fc Compare July 18, 2023 15:07
@n-eiling n-eiling marked this pull request as ready for review July 19, 2023 07:33
@n-eiling
Copy link
Member Author

I will merge this because the branch has diverged quite a bit and the original PR feature is working well. For pytorch, I still have some issues with the cudnnBackend API, which I will work on on a different branch.

@n-eiling n-eiling merged commit bcc5c93 into master Jul 19, 2023
@n-eiling n-eiling changed the title WIP: Shared object support Shared object support Jul 19, 2023
@mkroening mkroening deleted the share-object-support branch November 13, 2023 22:08
@KangYingjie0
Copy link

padding = ((8 - (size_t)(input + input_read)) % 8); maybe change it padding = (8 - (size_t)(input + input_read)% 8); for this?@n-eiling and If the size is exactly divided by 8, do you still need to add padding?

Isn't this what the original code achieves? Your suggestion doesn't work when the difference is exactly divisible by 8, while the original code does not add padding in this case.

LOGE(LOG_ERROR, "cannot find kernel %s kernel_info_t") this log miss the parameter?@n-eiling, and I want to know,cuGetExportTable is work normal in project?I find you do a lot of work for this。

Thanks for catching the error! I fixed it. cuGetExportTable is part of the interface between runtime and driver API. I experimented a lot with getting the runtime API working while only implementing the driver API in Cricket. cuGetExportTable exchanges some pretty deep data structures between the APIs and I did not manage to figure out all the memory I need to copy. So it currently does not work correctly.

I tested the functionality of cuGetExportTable, it input char [16] by exportTableId, then output matched hidden function info (in libcuda.so) by ppExportTable. If ppExportTable is only a function pointer, the hidden function will be called by it. I'm not entirely sure, ppExportTable could also contain more complex information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doing enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants