Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC]: Make CUB device-side algorithms work with NVRTC/Jitify #403

Closed
3 tasks
jrhemstad opened this issue Sep 5, 2023 · 4 comments · Fixed by #1081
Closed
3 tasks

[EPIC]: Make CUB device-side algorithms work with NVRTC/Jitify #403

jrhemstad opened this issue Sep 5, 2023 · 4 comments · Fixed by #1081
Assignees
Labels
feature request New feature or request.

Comments

@jrhemstad
Copy link
Collaborator

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

As a user of CUB, I would like to be able to use device-side algorithms like cub::BlockReduce in kernels that are compiled at runtime with NVRTC/Jitify.

However, this is not an explicitly supported use case nor does CUB have any testing that verifies this works.

Describe the solution you'd like

All CUB warp/block headers should support runtime compilation with NVRTC and/or Jitify.

Furthermore, CUB should expand its testing infrastructure to enable testing device-side algorithm headers.

Tasks

Preview Give feedback

Describe alternatives you've considered

No response

Additional context

No response

@jrhemstad jrhemstad added the feature request New feature or request. label Sep 5, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Sep 5, 2023
@jrhemstad
Copy link
Collaborator Author

maybe we can kill two birds here and tackle #318 as part of this as well.

@leofang
Copy link
Member

leofang commented Sep 19, 2023

I have a few thoughts after working intensively on integrating CCCL + Jiitfy in CuPy. (Jitify 1, to be precise.) xref: cupy/cupy#7851, cupy/cupy#7869

This is the anatomy of Jitify from my perspective. For any user-provided CUDA C++ kernel string, it

  1. Aggressively searches for std includes not available to (or not usable by) NVRTC
  2. Applies custom std patches to the found std includes
  3. Abstracts out CUDA (NVRTC & driver) API calls for kernel instantiation, compilation, and launch

Item 1 is essential to compile any C++ header with NVRTC, because unlike std::move, std::forward and std::initializer_list the majority of C++ std libraries are not builtin in NVRTC.

Item 3 is a nice-to-have feature that is probably not needed by libraries like libcudacxx (certainly true for CuPy) that have their own infra (at least for testing purpose).

Item 2 is the problematic one (especially after NVIDIA/jitify#118 was merged). One way or another custom std patches from Jitify would conflict with libcudacxx (if available and included). Item 2 existed for historical reasons (NVRTC offered no builtin std functionalities & libcudacxx was yet not a thing), but now that libcudacxx is getting mature (especially with the full-fledged type_traits) I argue that Item 2 the custom std patches should be eliminated completely and always use libcudacxx instead. In particular, one should never mix-n-match C++ std code from Jiitfy/libcudacxx.

Ultimately, my wishlist is we hard-wire libcudacxx in NVRTC so that we can also completely eliminate Item 1 (and arguably Jitify too 😅), just like std::initializer_list, but it's just my wishful thinking.

cc: @maddyscientist @benbarsdell for vis

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 10, 2023
@gevtushenko gevtushenko self-assigned this Nov 10, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Nov 15, 2023
@m-schuetz
Copy link

m-schuetz commented Apr 18, 2024

Hi, pretty awesome to have more of these thing runnig with nvrtc! Should this also be working with device-wide sorts or is this not supported, yet?

I've been loosely following this example to make it work and I was able to compile with #include <cub/warp/warp_reduce.cuh>, but things fell apart when trying to include #include <cub/device/device_radix_sort.cuh>. There were some errors with stdio or other includes that could not be found, and after adding some more include paths I ultimately ended up with following error:

[...]/libs/cccl-main/cub/cub/detail/choose_offset.cuh(64): error: namespace "std" has no member "uint32_t"
    using type = typename ::cuda::std::conditional<sizeof(NumItemsT) <= 4, std::uint32_t, unsigned long long>::type;
                                                                                ^

[...]/libs/cccl-main/cub/cub/detail/choose_offset.cuh(87): error: namespace "std" has no member "int32_t"
    using type = typename ::cuda::std::conditional<sizeof(NumItemsT) < 4, std::int32_t, NumItemsT>::type;
                                                                               ^

[...]/libs/cccl-main/libcudacxx/include/cuda/std/detail/libcxx/include/stdint.h(129): catastrophic error: cannot open source file "stdint.h"
  #include_next <stdint.h>

Are the device-wide sort algorithms (callable from within a kernel) not ready for nvrtc yet, or am I doing something wrong?

nvrtcCompileProgram arguments:

--gpu-architecture=compute_89
--use_fast_math
--extra-device-vectorization
-lineinfo
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/cub
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include/cuda/std/detail/libcxx/include/
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include/cuda/std
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/thrust
-I C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4/include
--relocatable-device-code=true
-default-device
-dlto
--std=c++17

I was wondering if eventually I could use device-wide sort via something like

void kernel(){
	auto grid = cg::this_grid();
	
	// ... do stuff
  
  	// now radix-sort an array of integers.
	grid.sync();
	cub_radix_sort(...);
	grid.sync();
	
	// now do something with the sorted list of integers
}

Thanks!

@jrhemstad
Copy link
Collaborator Author

Hey @m-schuetz, you reminded me I never responded to the discussion you'd opened. I just responded there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants