[EPIC]: Make CUB device-side algorithms work with NVRTC/Jitify #403

jrhemstad · 2023-09-05T17:22:14Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

As a user of CUB, I would like to be able to use device-side algorithms like cub::BlockReduce in kernels that are compiled at runtime with NVRTC/Jitify.

However, this is not an explicitly supported use case nor does CUB have any testing that verifies this works.

Describe the solution you'd like

All CUB warp/block headers should support runtime compilation with NVRTC and/or Jitify.

Furthermore, CUB should expand its testing infrastructure to enable testing device-side algorithm headers.

Tasks

Give feedback

Design testing infrastructure to allow testing CUB device-side algorithms via NVRTC
Update all cub::Warp* algorithms to work with NVRTC
Update all cub::Block* algorithms to work with NVRTC
Options

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

jrhemstad · 2023-09-05T17:24:24Z

maybe we can kill two birds here and tackle #318 as part of this as well.

leofang · 2023-09-19T20:20:18Z

I have a few thoughts after working intensively on integrating CCCL + Jiitfy in CuPy. (Jitify 1, to be precise.) xref: cupy/cupy#7851, cupy/cupy#7869

This is the anatomy of Jitify from my perspective. For any user-provided CUDA C++ kernel string, it

Aggressively searches for std includes not available to (or not usable by) NVRTC
Applies custom std patches to the found std includes
Abstracts out CUDA (NVRTC & driver) API calls for kernel instantiation, compilation, and launch

Item 1 is essential to compile any C++ header with NVRTC, because unlike std::move, std::forward and std::initializer_list the majority of C++ std libraries are not builtin in NVRTC.

Item 3 is a nice-to-have feature that is probably not needed by libraries like libcudacxx (certainly true for CuPy) that have their own infra (at least for testing purpose).

Item 2 is the problematic one (especially after NVIDIA/jitify#118 was merged). One way or another custom std patches from Jitify would conflict with libcudacxx (if available and included). Item 2 existed for historical reasons (NVRTC offered no builtin std functionalities & libcudacxx was yet not a thing), but now that libcudacxx is getting mature (especially with the full-fledged type_traits) I argue that ~~Item 2~~ the custom std patches should be eliminated completely and always use libcudacxx instead. In particular, one should never mix-n-match C++ std code from Jiitfy/libcudacxx.

Ultimately, my wishlist is we hard-wire libcudacxx in NVRTC so that we can also completely eliminate Item 1 (and arguably Jitify too 😅), just like std::initializer_list, but it's just my wishful thinking.

cc: @maddyscientist @benbarsdell for vis

m-schuetz · 2024-04-18T15:34:38Z

Hi, pretty awesome to have more of these thing runnig with nvrtc! Should this also be working with device-wide sorts or is this not supported, yet?

I've been loosely following this example to make it work and I was able to compile with #include <cub/warp/warp_reduce.cuh>, but things fell apart when trying to include #include <cub/device/device_radix_sort.cuh>. There were some errors with stdio or other includes that could not be found, and after adding some more include paths I ultimately ended up with following error:

[...]/libs/cccl-main/cub/cub/detail/choose_offset.cuh(64): error: namespace "std" has no member "uint32_t"
    using type = typename ::cuda::std::conditional<sizeof(NumItemsT) <= 4, std::uint32_t, unsigned long long>::type;
                                                                                ^

[...]/libs/cccl-main/cub/cub/detail/choose_offset.cuh(87): error: namespace "std" has no member "int32_t"
    using type = typename ::cuda::std::conditional<sizeof(NumItemsT) < 4, std::int32_t, NumItemsT>::type;
                                                                               ^

[...]/libs/cccl-main/libcudacxx/include/cuda/std/detail/libcxx/include/stdint.h(129): catastrophic error: cannot open source file "stdint.h"
  #include_next <stdint.h>

Are the device-wide sort algorithms (callable from within a kernel) not ready for nvrtc yet, or am I doing something wrong?

nvrtcCompileProgram arguments:

--gpu-architecture=compute_89
--use_fast_math
--extra-device-vectorization
-lineinfo
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/cub
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include/cuda/std/detail/libcxx/include/
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include/cuda/std
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/thrust
-I C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4/include
--relocatable-device-code=true
-default-device
-dlto
--std=c++17

I was wondering if eventually I could use device-wide sort via something like

void kernel(){
	auto grid = cg::this_grid();
	
	// ... do stuff
  
  	// now radix-sort an array of integers.
	grid.sync();
	cub_radix_sort(...);
	grid.sync();
	
	// now do something with the sorted list of integers
}

Thanks!

jrhemstad · 2024-04-18T16:53:48Z

Hey @m-schuetz, you reminded me I never responded to the discussion you'd opened. I just responded there :)

jrhemstad added the feature request New feature or request. label Sep 5, 2023

github-project-automation bot added this to CCCL Sep 5, 2023

github-project-automation bot moved this to Todo in CCCL Sep 5, 2023

leofang mentioned this issue Sep 7, 2023

Remove util_device.cuh from iterator headers to enable online compilation #412

Merged

2 tasks

leofang mentioned this issue Sep 30, 2023

[FEA]: Replace thrust type traits with libcu++ #487

Open

1 task

jrhemstad mentioned this issue Nov 8, 2023

Make cub NVRTC-compatible #851

Closed

gevtushenko mentioned this issue Nov 10, 2023

Initial CUB/NVRTC support #1081

Merged

2 tasks

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 10, 2023

gevtushenko self-assigned this Nov 10, 2023

gevtushenko closed this as completed in #1081 Nov 15, 2023

github-project-automation bot moved this from In Review to Done in CCCL Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC]: Make CUB device-side algorithms work with NVRTC/Jitify #403

[EPIC]: Make CUB device-side algorithms work with NVRTC/Jitify #403

jrhemstad commented Sep 5, 2023

Tasks

jrhemstad commented Sep 5, 2023

leofang commented Sep 19, 2023 •

edited

Loading

m-schuetz commented Apr 18, 2024 •

edited

Loading

jrhemstad commented Apr 18, 2024

[EPIC]: Make CUB device-side algorithms work with NVRTC/Jitify #403

[EPIC]: Make CUB device-side algorithms work with NVRTC/Jitify #403

Comments

jrhemstad commented Sep 5, 2023

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Tasks

Describe alternatives you've considered

Additional context

jrhemstad commented Sep 5, 2023

leofang commented Sep 19, 2023 • edited Loading

m-schuetz commented Apr 18, 2024 • edited Loading

jrhemstad commented Apr 18, 2024

leofang commented Sep 19, 2023 •

edited

Loading

m-schuetz commented Apr 18, 2024 •

edited

Loading