Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the vectorized loading of BlockLoad #3517

Merged
merged 7 commits into from
Feb 4, 2025

Conversation

ChristinaZ
Copy link
Contributor

@ChristinaZ ChristinaZ commented Jan 24, 2025

Description

Partially addresses #431
What is remaining from #431 after this PR is applying the same logic to WARP_LOAD_VECTORIZE.

Hi @elstehle elias and @bernhardmgruber,

Background:

We (Elias and I) are working on adding a parallel top-k algorithm into CUB. In this work, we are trying to load data using Blockloading (BLOCK_LOAD_VECTORIZE). However, we found that the data loading used the instruction LDG instead of LDG128. This PR is intended to fix this issue.

Description

To show this bug, I wrote a standalone unit test code in this repo.

  1. To run this unit test, you first need to modify the cccl directory CCCL_DIR in Makefile. Then run make to compile.

  2. Then we can get related PTX instructions with cuobjdump --dump-ptx ./benchmark | grep ld

christinaz@****:/****/unitTestVectorizedLoading$ cuobjdump --dump-ptx ./benchmark | grep ld.global
ld.global.f32 %f1, [%rd8];
ld.global.f32 %f2, [%rd8+4];
ld.global.f32 %f3, [%rd8+8];
ld.global.f32 %f4, [%rd8+12];

We can find that it failed to perform vectorized loading. In this repo, we use the template parameter InputIterator for the input. Its value is actually float*. So it should be able to perform vectorized loading.

  1. We provide this fix, which checks the data type manually with the following code:
    template <typename RandomAccessIterator>
    _CCCL_DEVICE _CCCL_FORCEINLINE void Load(RandomAccessIterator block_src_it, T (&dst_items)[ITEMS_PER_THREAD])
    {
      constexpr bool is_pointer_to_type =
        ::cuda::std::is_pointer_v<RandomAccessIterator>
        && ::cuda::std::is_same_v<std::remove_cv_t<::cuda::std::remove_pointer_t<RandomAccessIterator>>, T>;

      if (is_pointer_to_type)
      {
        InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_src_it, dst_items);
      }
      else
      {
        LoadDirectBlocked(linear_tid, block_src_it, dst_items);
      }
    }
  1. After this modification, rerun the command cuobjdump --dump-ptx ./benchmark | grep ld, we can get the following result:
christinaz@***:/****/unitTestVectorizedLoading$ cuobjdump --dump-ptx ./benchmark | grep ld.global
ld.global.v4.f32 {%f1, %f2, %f3, %f4}, [%rd7];

We can see that after the modification, we can use the vectorized loading successfully.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Related issue

#431

@ChristinaZ ChristinaZ requested a review from a team as a code owner January 24, 2025 03:13
@ChristinaZ ChristinaZ requested a review from elstehle January 24, 2025 03:13
Copy link

copy-pr-bot bot commented Jan 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elstehle
Copy link
Collaborator

elstehle commented Jan 24, 2025

Thank you @ChristinaZ for looking into this.

It seems that the root cause is that we do have a superfluous template parameter that prevents the compiler to choose the overload implementing the vectorized load:

    // attempts vectorization (pointer)
    template <typename>
    _CCCL_DEVICE _CCCL_FORCEINLINE void Load(const T* block_ptr, T (&dst_items)[ITEMS_PER_THREAD])
    {
      InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, dst_items);
    }
    // any other iterator, no vectorization
    template <typename RandomAccessIterator>
    _CCCL_DEVICE _CCCL_FORCEINLINE void Load(RandomAccessIterator block_src_it, T (&dst_items)[ITEMS_PER_THREAD])
    {
      //<non-vectorized code path>
    }

#431 describes this limitation. However, as stated by Georgii in a comment on the issue, the actual shortcoming is that we need to safeguard against un-aligned pointers:
#431 (comment)

I think the right path forward is, as Georgii suggested in that comment:

[...] check input pointer alignment and fallback to direct load and mark this issue as blocked. Since adding extra checks might affect performance, I suggest we also consider providing cuda::aligned_size_t overload.

Is this something you could take on?

cub/cub/block/block_load.cuh Outdated Show resolved Hide resolved
@ChristinaZ
Copy link
Contributor Author

Is this something you could take on?

Yes, I think so. We can use a similar check within function Load(RandomAccessIterator block_src_it, T (&dst_items)[ITEMS_PER_THREAD]) for InternalLoadDirectBlockedVectorized().

@bernhardmgruber
Copy link
Contributor

I can reproduce the issue on godbolt: https://godbolt.org/z/dfTa9oe1v

@elstehle
Copy link
Collaborator

elstehle commented Jan 24, 2025

Just to summarize the findings from the offline discussion with @miscco and @ChristinaZ:

We have this overload that is supposed to be chosen when the iterator type qualifies for vectorization:

template <typename>
_CCCL_DEVICE _CCCL_FORCEINLINE void Load(const T* block_ptr, T (&dst_items)[ITEMS_PER_THREAD])
{
InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, dst_items);
}

There seems to be two root causes why that overload isn't chosen:

  1. We have an unused template parameter
  2. For a given pointer to load from, the pointer-to-const overload would require conversion, which makes overload resolution choose the non-pointer overload (@miscco filled me in on that one).

Since the implementation of the overload is a one liner, my suggestion was to add another overload taking a pointer-to-non-const.

As mentioned before, we should address the scenario for pointer not aligned to the vectorized type (see #431 (comment))

cub/cub/block/block_load.cuh Outdated Show resolved Hide resolved
cub/cub/block/block_load.cuh Show resolved Hide resolved
Copy link
Collaborator

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add a small test to test/catch2_test_block_load.cu that loads from an {aligned,unaligned} x {ptr-to-const,ptr-to-non-const}?

Regarding performance concerns: I also checked our tuning policies for the use of BLOCK_LOAD_VECTORIZE and it seems it is only considered for DeviceSpmv (which is about to be dropped) and in DeviceRadixSort for SM 35 (which is also about to be dropped).

Given that it's not used in any Device* algorithm today, I'd be happy to have it merged. It finally makes the BLOCK_LOAD_VECTORIZE algorithm do what it promises to; it provides upside for the vectorized case (i.e., we are seeing considerable improvements in our top-k algorithm) and only minor downside for the case where we now have the extra alignment checks but need to default to non-vectorized loading.

@ChristinaZ
Copy link
Contributor Author

Could you please also add a small test to test/catch2_test_block_load.cu that loads from an {aligned,unaligned} x {ptr-to-const,ptr-to-non-const}?

No problem. Let me add this test.

it provides upside for the vectorized case (i.e., we are seeing considerable improvements in our top-k algorithm).

Yes, I just rerun the benchmark with the latest updates in BlockLoad(). I find that adding address alignment checking doesn't hurt the performance of our topK. And the vectorized data loading truly helps improve the performance compared with the default direct loading.

cub/test/catch2_test_block_load.cu Outdated Show resolved Hide resolved
cub/test/catch2_test_block_load.cu Outdated Show resolved Hide resolved
cub/test/catch2_test_block_load.cu Outdated Show resolved Hide resolved
@elstehle
Copy link
Collaborator

/ok to test

@elstehle
Copy link
Collaborator

Note, we have the following line in the block load test:

// %PARAM% IPT it 1:11

Which means, the source file will be compiled twice: (1) once with #define IPT 1 and (2) once with #define IPT 11. We do this so we can split up the tests into two separate translation units and compile test cases in parallel. In this case, your tests that are independent of the number of items per thread are also getting compiled twice. Meaning we have two identical tests, which is superfluous. You can wrap your tests in:

#if IPT == 1
C2H_TEST("Vectorized block load with const and non-const datatype and different alignment cases",
         "[load][block]",
         is_const_or_not,
         offsets_for_elements)
{
...
}
#endif

Copy link
Contributor

🟨 CI finished in 1h 47m: Pass: 95%/90 | Total: 2d 14h | Avg: 41m 20s | Max: 1h 08m | Hits: 216%/12772
  • 🟨 cub: Pass: 93%/44 | Total: 1d 13h | Avg: 50m 59s | Max: 1h 08m | Hits: 220%/3552

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  92%/42  | Total:  1d 11h | Avg: 50m 34s | Max:  1h 08m | Hits: 220%/3552  
      🟩 arm64              Pass: 100%/2   | Total:  1h 59m | Avg: 59m 31s | Max:  1h 00m
    🔍 ctk: 12.6 🔍
      🟩 12.0               Pass: 100%/5   | Total:  4h 54m | Avg: 58m 52s | Max:  1h 01m | Hits: 220%/888   
      🟩 12.5               Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
      🔍 12.6               Pass:  91%/37  | Total:  1d 06h | Avg: 49m 06s | Max:  1h 08m | Hits: 219%/2664  
    🔍 cudacxx: nvcc12.6 🔍
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 03m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 54m | Avg: 58m 52s | Max:  1h 01m | Hits: 220%/888   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
      🔍 nvcc12.6           Pass:  91%/35  | Total:  1d 04h | Avg: 48m 22s | Max:  1h 08m | Hits: 219%/2664  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 03m
      🔍 nvcc               Pass:  92%/42  | Total:  1d 11h | Avg: 50m 28s | Max:  1h 08m | Hits: 220%/3552  
    🔍 gpu: v100 🔍
      🟩 h100               Pass: 100%/2   | Total: 45m 56s | Avg: 22m 58s | Max: 26m 56s
      🔍 v100               Pass:  92%/42  | Total:  1d 12h | Avg: 52m 19s | Max:  1h 08m | Hits: 220%/3552  
    🔍 std: 20 🔍
      🟩 17                 Pass: 100%/20  | Total: 19h 48m | Avg: 59m 24s | Max:  1h 06m | Hits: 220%/2664  
      🔍 20                 Pass:  87%/24  | Total: 17h 35m | Avg: 43m 58s | Max:  1h 08m | Hits: 219%/888   
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 53m | Avg: 58m 21s | Max:  1h 01m
      🟩 Clang15            Pass: 100%/2   | Total:  1h 56m | Avg: 58m 11s | Max: 59m 06s
      🟩 Clang16            Pass: 100%/2   | Total:  1h 55m | Avg: 57m 40s | Max:  1h 00m
      🟩 Clang17            Pass: 100%/2   | Total:  1h 50m | Avg: 55m 21s | Max: 55m 40s
      🟨 Clang18            Pass:  85%/7   | Total:  5h 33m | Avg: 47m 37s | Max:  1h 03m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 59m | Avg: 59m 39s | Max:  1h 02m
      🟩 GCC8               Pass: 100%/1   | Total: 57m 04s | Avg: 57m 04s | Max: 57m 04s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 56m | Avg: 58m 16s | Max:  1h 00m
      🟩 GCC10              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 23s | Max: 56m 39s
      🟩 GCC11              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 15s | Max: 57m 50s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 44m | Avg: 41m 02s | Max:  1h 02m
      🟨 GCC13              Pass:  75%/8   | Total:  4h 17m | Avg: 32m 08s | Max:  1h 04m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m | Hits: 220%/1776  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 15m | Avg:  1h 07m | Max:  1h 08m | Hits: 219%/1776  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
    🟨 cxx_family
      🟨 Clang              Pass:  94%/17  | Total: 15h 09m | Avg: 53m 29s | Max:  1h 03m
      🟨 GCC                Pass:  90%/21  | Total: 15h 39m | Avg: 44m 44s | Max:  1h 04m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 22m | Avg:  1h 05m | Max:  1h 08m | Hits: 220%/3552  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
    🟨 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 11h | Avg: 58m 00s | Max:  1h 08m | Hits: 220%/3552  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 56s | Avg: 19m 56s | Max: 19m 56s
      🟥 GraphCapture       Pass:   0%/1   | Total:  3m 33s | Avg:  3m 33s | Max:  3m 33s
      🟨 HostLaunch         Pass:  66%/3   | Total: 47m 08s | Avg: 15m 42s | Max: 23m 18s
      🟨 TestGPU            Pass:  50%/2   | Total: 26m 33s | Avg: 13m 16s | Max: 20m 48s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 45m 56s | Avg: 22m 58s | Max: 26m 56s
      🟩 90a                Pass: 100%/1   | Total: 26m 21s | Avg: 26m 21s | Max: 26m 21s
    
  • 🟥 python: Pass: 0%/1 | Total: 4m 50s | Avg: 4m 50s | Max: 4m 50s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    🟥 ctk
      🟥 12.6               Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    🟥 cudacxx
      🟥 nvcc12.6           Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    🟥 gpu
      🟥 v100               Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  4m 50s | Avg:  4m 50s | Max:  4m 50s
    
  • 🟩 thrust: Pass: 100%/43 | Total: 1d 00h | Avg: 34m 01s | Max: 1h 06m | Hits: 215%/9220

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 39m 50s | Avg: 19m 55s | Max: 29m 04s
    🟩 cpu
      🟩 amd64              Pass: 100%/41  | Total: 23h 24m | Avg: 34m 15s | Max:  1h 06m | Hits: 215%/9220  
      🟩 arm64              Pass: 100%/2   | Total: 58m 38s | Avg: 29m 19s | Max: 30m 44s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 05m | Avg: 37m 01s | Max: 53m 17s | Hits: 177%/1844  
      🟩 12.5               Pass: 100%/2   | Total:  1h 49m | Avg: 54m 46s | Max: 57m 36s
      🟩 12.6               Pass: 100%/36  | Total: 19h 28m | Avg: 32m 27s | Max:  1h 06m | Hits: 224%/7376  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 51m 58s | Avg: 25m 59s | Max: 27m 01s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 05m | Avg: 37m 01s | Max: 53m 17s | Hits: 177%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 49m | Avg: 54m 46s | Max: 57m 36s
      🟩 nvcc12.6           Pass: 100%/34  | Total: 18h 36m | Avg: 32m 50s | Max:  1h 06m | Hits: 224%/7376  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 51m 58s | Avg: 25m 59s | Max: 27m 01s
      🟩 nvcc               Pass: 100%/41  | Total: 23h 31m | Avg: 34m 25s | Max:  1h 06m | Hits: 215%/9220  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 09m | Avg: 32m 22s | Max: 33m 28s
      🟩 Clang15            Pass: 100%/2   | Total:  1h 00m | Avg: 30m 22s | Max: 30m 38s
      🟩 Clang16            Pass: 100%/2   | Total:  1h 07m | Avg: 33m 44s | Max: 34m 09s
      🟩 Clang17            Pass: 100%/2   | Total:  1h 03m | Avg: 31m 36s | Max: 33m 37s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 47m | Avg: 23m 52s | Max: 32m 05s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 05m | Avg: 32m 55s | Max: 33m 42s
      🟩 GCC8               Pass: 100%/1   | Total: 34m 18s | Avg: 34m 18s | Max: 34m 18s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 01s | Max: 33m 42s
      🟩 GCC10              Pass: 100%/2   | Total:  1h 04m | Avg: 32m 20s | Max: 33m 00s
      🟩 GCC11              Pass: 100%/2   | Total:  1h 06m | Avg: 33m 01s | Max: 33m 19s
      🟩 GCC12              Pass: 100%/2   | Total:  1h 10m | Avg: 35m 07s | Max: 36m 46s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 48m | Avg: 28m 33s | Max: 37m 51s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 48m | Avg: 54m 12s | Max: 55m 07s | Hits: 177%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 41m | Avg: 53m 49s | Max:  1h 06m | Hits: 240%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 49m | Avg: 54m 46s | Max: 57m 36s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  8h 08m | Avg: 28m 42s | Max: 34m 09s
      🟩 GCC                Pass: 100%/19  | Total:  9h 55m | Avg: 31m 20s | Max: 37m 51s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 29m | Avg: 53m 58s | Max:  1h 06m | Hits: 215%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 49m | Avg: 54m 46s | Max: 57m 36s
    🟩 gpu
      🟩 v100               Pass: 100%/43  | Total:  1d 00h | Avg: 34m 01s | Max:  1h 06m | Hits: 215%/9220  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total: 22h 13m | Avg: 36m 01s | Max:  1h 06m | Hits: 177%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total:  1h 07m | Avg: 22m 21s | Max: 31m 46s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 02m | Avg: 20m 58s | Max: 36m 42s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 20m 05s | Avg: 20m 05s | Max: 20m 05s
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 12h 24m | Avg: 37m 14s | Max:  1h 03m | Hits: 177%/5532  
      🟩 20                 Pass: 100%/21  | Total: 11h 18m | Avg: 32m 18s | Max:  1h 06m | Hits: 271%/3688  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 47s | Avg: 4m 53s | Max: 7m 19s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  7m 19s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  7m 19s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  7m 19s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  7m 19s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  7m 19s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  7m 19s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  7m 19s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 28s | Avg:  2m 28s | Max:  2m 28s
      🟩 Test               Pass: 100%/1   | Total:  7m 19s | Avg:  7m 19s | Max:  7m 19s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 90)

# Runner
65 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@ChristinaZ ChristinaZ force-pushed the christinaz/fix_vectorized_loading branch from 0f36910 to 18d3da9 Compare February 3, 2025 08:18
@ChristinaZ
Copy link
Contributor Author

/ok to test

1 similar comment
@elstehle
Copy link
Collaborator

elstehle commented Feb 3, 2025

/ok to test

Copy link
Collaborator

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great. Thanks a lot for your contribution!

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elstehle Do we need a before/after benchmark? AFAIK, we don't have one for block load. I expect the SASS to change (for good!).

cub/test/catch2_test_block_load.cu Outdated Show resolved Hide resolved
@elstehle
Copy link
Collaborator

elstehle commented Feb 3, 2025

@elstehle Do we need a before/after benchmark? AFAIK, we don't have one for block load. I expect the SASS to change (for good!).

My take on the performance would be this:

I also checked our tuning policies for the use of BLOCK_LOAD_VECTORIZE and it seems it is only considered for DeviceSpmv (which is about to be dropped) and in DeviceRadixSort for SM 35 (which is also about to be dropped).

Given that it's not used in any Device* algorithm today, I'd be happy to have it merged. It finally makes the BLOCK_LOAD_VECTORIZE algorithm do what it promises to; it provides upside for the vectorized case (i.e., we are seeing considerable improvements in our top-k algorithm) and only minor downside for the case where we now have the extra alignment checks but need to default to non-vectorized loading.

Christina has some very positive performance data in her top-k work that reassures the performance upside we get from her fix.

@bernhardmgruber
Copy link
Contributor

@elstehle Do we need a before/after benchmark? AFAIK, we don't have one for block load. I expect the SASS to change (for good!).

My take on the performance would be this:

[...]

I should learn how to read.

Copy link
Contributor

github-actions bot commented Feb 3, 2025

🟩 CI finished in 1h 45m: Pass: 100%/90 | Total: 2d 15h | Avg: 42m 31s | Max: 1h 14m | Hits: 216%/12730
  • 🟩 cub: Pass: 100%/44 | Total: 1d 15h | Avg: 53m 30s | Max: 1h 14m | Hits: 220%/3500

    🟩 cpu
      🟩 amd64              Pass: 100%/42  | Total:  1d 13h | Avg: 53m 11s | Max:  1h 14m | Hits: 220%/3500  
      🟩 arm64              Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 02m
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 52m | Avg: 58m 34s | Max:  1h 02m | Hits: 221%/875   
      🟩 12.5               Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
      🟩 12.8               Pass: 100%/37  | Total:  1d 08h | Avg: 51m 56s | Max:  1h 14m | Hits: 220%/2625  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 03m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 52m | Avg: 58m 34s | Max:  1h 02m | Hits: 221%/875   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
      🟩 nvcc12.8           Pass: 100%/35  | Total:  1d 06h | Avg: 51m 26s | Max:  1h 14m | Hits: 220%/2625  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 03m
      🟩 nvcc               Pass: 100%/42  | Total:  1d 13h | Avg: 53m 10s | Max:  1h 14m | Hits: 220%/3500  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 47m | Avg: 56m 59s | Max:  1h 02m
      🟩 Clang15            Pass: 100%/2   | Total:  1h 58m | Avg: 59m 25s | Max:  1h 00m
      🟩 Clang16            Pass: 100%/2   | Total:  1h 57m | Avg: 58m 36s | Max:  1h 03m
      🟩 Clang17            Pass: 100%/2   | Total:  1h 53m | Avg: 56m 33s | Max: 58m 30s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 36m | Avg: 48m 07s | Max:  1h 03m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 52m | Avg: 56m 14s | Max: 57m 18s
      🟩 GCC8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 GCC9               Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 04m
      🟩 GCC10              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 48s | Max:  1h 01m
      🟩 GCC11              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 19s | Max: 57m 04s
      🟩 GCC12              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 36s | Max: 57m 44s
      🟩 GCC13              Pass: 100%/10  | Total:  6h 23m | Avg: 38m 19s | Max:  1h 09m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 13m | Hits: 221%/1750  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 23m | Avg:  1h 11m | Max:  1h 14m | Hits: 220%/1750  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 15h 13m | Avg: 53m 45s | Max:  1h 03m
      🟩 GCC                Pass: 100%/21  | Total: 17h 02m | Avg: 48m 42s | Max:  1h 09m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 37m | Avg:  1h 09m | Max:  1h 14m | Hits: 220%/3500  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 47m 47s | Avg: 23m 53s | Max: 24m 13s
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 10h | Avg:  1h 00m | Max:  1h 14m | Hits: 220%/3500  
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 06m | Avg: 30m 49s | Max:  1h 04m
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 12h | Avg: 59m 38s | Max:  1h 14m | Hits: 220%/3500  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 20s | Avg: 19m 20s | Max: 19m 20s
      🟩 GraphCapture       Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 12m | Avg: 24m 07s | Max: 24m 59s
      🟩 TestGPU            Pass: 100%/2   | Total: 39m 43s | Avg: 19m 51s | Max: 20m 33s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 47m 47s | Avg: 23m 53s | Max: 24m 13s
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 09m | Avg:  1h 09m | Max:  1h 09m
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 19h 51m | Avg: 59m 33s | Max:  1h 13m | Hits: 221%/2625  
      🟩 20                 Pass: 100%/24  | Total: 19h 23m | Avg: 48m 28s | Max:  1h 14m | Hits: 219%/875   
    
  • 🟩 thrust: Pass: 100%/43 | Total: 23h 58m | Avg: 33m 27s | Max: 1h 01m | Hits: 215%/9230

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 38m 03s | Avg: 19m 01s | Max: 26m 45s
    🟩 cpu
      🟩 amd64              Pass: 100%/41  | Total: 22h 59m | Avg: 33m 39s | Max:  1h 01m | Hits: 215%/9230  
      🟩 arm64              Pass: 100%/2   | Total: 59m 09s | Avg: 29m 34s | Max: 31m 24s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 09m | Avg: 37m 58s | Max: 55m 54s | Hits: 177%/1846  
      🟩 12.5               Pass: 100%/2   | Total:  1h 57m | Avg: 58m 40s | Max:  1h 00m
      🟩 12.8               Pass: 100%/36  | Total: 18h 51m | Avg: 31m 26s | Max:  1h 01m | Hits: 224%/7384  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 56m 44s | Avg: 28m 22s | Max: 30m 28s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 09m | Avg: 37m 58s | Max: 55m 54s | Hits: 177%/1846  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 57m | Avg: 58m 40s | Max:  1h 00m
      🟩 nvcc12.8           Pass: 100%/34  | Total: 17h 54m | Avg: 31m 36s | Max:  1h 01m | Hits: 224%/7384  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 56m 44s | Avg: 28m 22s | Max: 30m 28s
      🟩 nvcc               Pass: 100%/41  | Total: 23h 02m | Avg: 33m 42s | Max:  1h 01m | Hits: 215%/9230  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 05m | Avg: 31m 27s | Max: 32m 19s
      🟩 Clang15            Pass: 100%/2   | Total:  1h 02m | Avg: 31m 16s | Max: 31m 33s
      🟩 Clang16            Pass: 100%/2   | Total:  1h 05m | Avg: 32m 47s | Max: 34m 10s
      🟩 Clang17            Pass: 100%/2   | Total:  1h 07m | Avg: 33m 49s | Max: 34m 51s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 47m | Avg: 23m 58s | Max: 33m 29s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 05m | Avg: 32m 31s | Max: 33m 57s
      🟩 GCC8               Pass: 100%/1   | Total: 32m 22s | Avg: 32m 22s | Max: 32m 22s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 08m | Avg: 34m 07s | Max: 36m 19s
      🟩 GCC10              Pass: 100%/2   | Total:  1h 04m | Avg: 32m 25s | Max: 33m 30s
      🟩 GCC11              Pass: 100%/2   | Total:  1h 09m | Avg: 34m 30s | Max: 34m 42s
      🟩 GCC12              Pass: 100%/2   | Total:  1h 10m | Avg: 35m 25s | Max: 37m 20s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 12m | Avg: 24m 06s | Max: 37m 06s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 54m | Avg: 57m 13s | Max: 58m 32s | Hits: 177%/3692  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 34m | Avg: 51m 30s | Max:  1h 01m | Hits: 240%/5538  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 40s | Max:  1h 00m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  8h 09m | Avg: 28m 47s | Max: 34m 51s
      🟩 GCC                Pass: 100%/19  | Total:  9h 23m | Avg: 29m 38s | Max: 37m 20s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 28m | Avg: 53m 47s | Max:  1h 01m | Hits: 215%/9230  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 57m | Avg: 58m 40s | Max:  1h 00m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/33  | Total: 20h 02m | Avg: 36m 26s | Max:  1h 01m | Hits: 177%/5538  
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 56m | Avg: 23m 38s | Max:  1h 01m | Hits: 271%/3692  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total: 22h 36m | Avg: 36m 40s | Max:  1h 01m | Hits: 177%/7384  
      🟩 TestCPU            Pass: 100%/3   | Total: 47m 57s | Avg: 15m 59s | Max: 31m 59s | Hits: 365%/1846  
      🟩 TestGPU            Pass: 100%/3   | Total: 34m 10s | Avg: 11m 23s | Max: 12m 09s
    🟩 sm
      🟩 90;90a;100         Pass: 100%/1   | Total: 37m 06s | Avg: 37m 06s | Max: 37m 06s
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 12h 31m | Avg: 37m 35s | Max:  1h 01m | Hits: 177%/5538  
      🟩 20                 Pass: 100%/21  | Total: 10h 49m | Avg: 30m 54s | Max:  1h 01m | Hits: 271%/3692  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 7m 11s | Avg: 3m 35s | Max: 4m 58s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  4m 58s
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  4m 58s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  4m 58s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  4m 58s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  4m 58s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  4m 58s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  4m 58s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 13s | Avg:  2m 13s | Max:  2m 13s
      🟩 Test               Pass: 100%/1   | Total:  4m 58s | Avg:  4m 58s | Max:  4m 58s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 03s | Avg: 26m 03s | Max: 26m 03s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 90)

# Runner
65 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

@ChristinaZ ChristinaZ force-pushed the christinaz/fix_vectorized_loading branch from 5cab148 to 4ae725b Compare February 4, 2025 07:40
@elstehle
Copy link
Collaborator

elstehle commented Feb 4, 2025

/ok to test

Copy link
Contributor

github-actions bot commented Feb 4, 2025

🟩 CI finished in 2h 49m: Pass: 100%/90 | Total: 2d 17h | Avg: 43m 35s | Max: 1h 45m | Hits: 216%/12730
  • 🟩 cub: Pass: 100%/44 | Total: 1d 17h | Avg: 56m 04s | Max: 1h 45m | Hits: 220%/3500

    🟩 cpu
      🟩 amd64              Pass: 100%/42  | Total:  1d 15h | Avg: 55m 47s | Max:  1h 45m | Hits: 220%/3500  
      🟩 arm64              Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 59m | Avg: 59m 55s | Max:  1h 05m | Hits: 221%/875   
      🟩 12.5               Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
      🟩 12.8               Pass: 100%/37  | Total:  1d 09h | Avg: 55m 01s | Max:  1h 45m | Hits: 220%/2625  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 59m | Avg: 59m 55s | Max:  1h 05m | Hits: 221%/875   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
      🟩 nvcc12.8           Pass: 100%/35  | Total:  1d 07h | Avg: 54m 32s | Max:  1h 45m | Hits: 220%/2625  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
      🟩 nvcc               Pass: 100%/42  | Total:  1d 15h | Avg: 55m 43s | Max:  1h 45m | Hits: 220%/3500  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 52m | Avg: 58m 01s | Max:  1h 00m
      🟩 Clang15            Pass: 100%/2   | Total:  1h 58m | Avg: 59m 08s | Max:  1h 02m
      🟩 Clang16            Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 02m
      🟩 Clang17            Pass: 100%/2   | Total:  1h 54m | Avg: 57m 04s | Max: 58m 11s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 48m | Avg: 49m 50s | Max:  1h 04m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 57m | Avg: 58m 59s | Max: 59m 03s
      🟩 GCC8               Pass: 100%/1   | Total: 58m 49s | Avg: 58m 49s | Max: 58m 49s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 53m | Avg: 56m 37s | Max: 57m 36s
      🟩 GCC10              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 35s | Max: 58m 12s
      🟩 GCC11              Pass: 100%/2   | Total:  1h 49m | Avg: 54m 50s | Max: 54m 57s
      🟩 GCC12              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m
      🟩 GCC13              Pass: 100%/10  | Total:  8h 00m | Avg: 48m 04s | Max:  1h 45m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 08m | Hits: 221%/1750  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 23m | Avg:  1h 11m | Max:  1h 12m | Hits: 220%/1750  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 15h 36m | Avg: 55m 03s | Max:  1h 04m
      🟩 GCC                Pass: 100%/21  | Total: 18h 42m | Avg: 53m 25s | Max:  1h 45m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 37m | Avg:  1h 09m | Max:  1h 12m | Hits: 220%/3500  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 50m 27s | Avg: 25m 13s | Max: 26m 35s
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 10h | Avg:  1h 01m | Max:  1h 13m | Hits: 220%/3500  
      🟩 rtxa6000           Pass: 100%/8   | Total:  5h 38m | Avg: 42m 15s | Max:  1h 45m
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 13h | Avg:  1h 00m | Max:  1h 13m | Hits: 220%/3500  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 14s | Avg: 22m 14s | Max: 22m 14s
      🟩 GraphCapture       Pass: 100%/1   | Total:  1h 45m | Avg:  1h 45m | Max:  1h 45m
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 11m | Avg: 23m 57s | Max: 24m 11s
      🟩 TestGPU            Pass: 100%/2   | Total: 41m 43s | Avg: 20m 51s | Max: 21m 01s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 50m 27s | Avg: 25m 13s | Max: 26m 35s
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 13m | Avg:  1h 13m | Max:  1h 13m
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 20h 06m | Avg:  1h 00m | Max:  1h 11m | Hits: 221%/2625  
      🟩 20                 Pass: 100%/24  | Total: 21h 01m | Avg: 52m 33s | Max:  1h 45m | Hits: 219%/875   
    
  • 🟩 thrust: Pass: 100%/43 | Total: 23h 42m | Avg: 33m 05s | Max: 1h 02m | Hits: 215%/9230

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 36m 38s | Avg: 18m 19s | Max: 25m 27s
    🟩 cpu
      🟩 amd64              Pass: 100%/41  | Total: 22h 43m | Avg: 33m 15s | Max:  1h 02m | Hits: 215%/9230  
      🟩 arm64              Pass: 100%/2   | Total: 59m 17s | Avg: 29m 38s | Max: 31m 30s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 04m | Avg: 36m 48s | Max: 55m 02s | Hits: 177%/1846  
      🟩 12.5               Pass: 100%/2   | Total:  1h 59m | Avg: 59m 53s | Max:  1h 00m
      🟩 12.8               Pass: 100%/36  | Total: 18h 38m | Avg: 31m 04s | Max:  1h 02m | Hits: 224%/7384  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 55m 17s | Avg: 27m 38s | Max: 29m 24s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 04m | Avg: 36m 48s | Max: 55m 02s | Hits: 177%/1846  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 59m | Avg: 59m 53s | Max:  1h 00m
      🟩 nvcc12.8           Pass: 100%/34  | Total: 17h 43m | Avg: 31m 16s | Max:  1h 02m | Hits: 224%/7384  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 55m 17s | Avg: 27m 38s | Max: 29m 24s
      🟩 nvcc               Pass: 100%/41  | Total: 22h 47m | Avg: 33m 21s | Max:  1h 02m | Hits: 215%/9230  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 11m | Avg: 32m 53s | Max: 34m 07s
      🟩 Clang15            Pass: 100%/2   | Total:  1h 00m | Avg: 30m 17s | Max: 30m 36s
      🟩 Clang16            Pass: 100%/2   | Total:  1h 04m | Avg: 32m 00s | Max: 34m 00s
      🟩 Clang17            Pass: 100%/2   | Total:  1h 00m | Avg: 30m 27s | Max: 30m 37s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 46m | Avg: 23m 43s | Max: 32m 50s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 05m | Avg: 32m 52s | Max: 33m 35s
      🟩 GCC8               Pass: 100%/1   | Total: 32m 44s | Avg: 32m 44s | Max: 32m 44s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 24s | Max: 34m 53s
      🟩 GCC10              Pass: 100%/2   | Total:  1h 04m | Avg: 32m 10s | Max: 33m 48s
      🟩 GCC11              Pass: 100%/2   | Total:  1h 10m | Avg: 35m 13s | Max: 35m 24s
      🟩 GCC12              Pass: 100%/2   | Total:  1h 04m | Avg: 32m 07s | Max: 32m 15s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 09m | Avg: 23m 39s | Max: 34m 33s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 50m | Avg: 55m 14s | Max: 55m 27s | Hits: 177%/3692  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 35m | Avg: 51m 54s | Max:  1h 02m | Hits: 240%/5538  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 59m | Avg: 59m 53s | Max:  1h 00m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  8h 03m | Avg: 28m 25s | Max: 34m 07s
      🟩 GCC                Pass: 100%/19  | Total:  9h 13m | Avg: 29m 08s | Max: 35m 24s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 26m | Avg: 53m 14s | Max:  1h 02m | Hits: 215%/9230  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 59m | Avg: 59m 53s | Max:  1h 00m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/33  | Total: 19h 45m | Avg: 35m 55s | Max:  1h 00m | Hits: 177%/5538  
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 56m | Avg: 23m 41s | Max:  1h 02m | Hits: 271%/3692  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total: 22h 19m | Avg: 36m 12s | Max:  1h 02m | Hits: 177%/7384  
      🟩 TestCPU            Pass: 100%/3   | Total: 49m 22s | Avg: 16m 27s | Max: 32m 49s | Hits: 365%/1846  
      🟩 TestGPU            Pass: 100%/3   | Total: 33m 51s | Avg: 11m 17s | Max: 12m 10s
    🟩 sm
      🟩 90;90a;100         Pass: 100%/1   | Total: 32m 47s | Avg: 32m 47s | Max: 32m 47s
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 12h 22m | Avg: 37m 08s | Max:  1h 00m | Hits: 177%/5538  
      🟩 20                 Pass: 100%/21  | Total: 10h 43m | Avg: 30m 37s | Max:  1h 02m | Hits: 271%/3692  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 7m 22s | Avg: 3m 41s | Max: 5m 00s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  5m 00s
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  5m 00s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  5m 00s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  5m 00s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  5m 00s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  5m 00s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  5m 00s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 22s | Avg:  2m 22s | Max:  2m 22s
      🟩 Test               Pass: 100%/1   | Total:  5m 00s | Avg:  5m 00s | Max:  5m 00s
    
  • 🟩 python: Pass: 100%/1 | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 90)

# Runner
65 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

@elstehle elstehle merged commit dd6e66c into NVIDIA:main Feb 4, 2025
101 of 104 checks passed
@elstehle
Copy link
Collaborator

elstehle commented Feb 4, 2025

Thanks a lot for your contribution, @ChristinaZ

@ChristinaZ
Copy link
Contributor Author

Thanks a lot for your contribution, @ChristinaZ

Thanks Elias! It's my honor to contribute to CUB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants