Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor fix for a regressing tuning in reduce.by_key #3723

Merged
merged 2 commits into from
Feb 6, 2025

Conversation

gonidelis
Copy link
Member

This workload regresses and needs to be defaulted back

@gonidelis gonidelis requested a review from a team as a code owner February 6, 2025 18:59
@gonidelis gonidelis requested a review from elstehle February 6, 2025 18:59
@gonidelis
Copy link
Member Author

Otherwise perf results after tuning for the other workloads are amazing:

#3610 (comment)

Copy link
Contributor

github-actions bot commented Feb 6, 2025

🟩 CI finished in 1h 37m: Pass: 100%/90 | Total: 2d 15h | Avg: 42m 34s | Max: 1h 17m | Hits: 74%/132225
  • 🟩 cub: Pass: 100%/44 | Total: 1d 15h | Avg: 53m 56s | Max: 1h 17m | Hits: 68%/52320

    🟩 cpu
      🟩 amd64              Pass: 100%/42  | Total:  1d 13h | Avg: 53m 41s | Max:  1h 17m | Hits:  69%/49888 
      🟩 arm64              Pass: 100%/2   | Total:  1h 58m | Avg: 59m 10s | Max:  1h 00m | Hits:  67%/2432  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 02m | Avg:  1h 00m | Max:  1h 02m | Hits:  58%/5914  
      🟩 12.5               Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 10m | Hits:  67%/2250  
      🟩 12.8               Pass: 100%/37  | Total:  1d 08h | Avg: 52m 19s | Max:  1h 17m | Hits:  70%/44156 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 58m | Avg: 59m 18s | Max:  1h 00m | Hits:  73%/2104  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 02m | Avg:  1h 00m | Max:  1h 02m | Hits:  58%/5914  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 10m | Hits:  67%/2250  
      🟩 nvcc12.8           Pass: 100%/35  | Total:  1d 06h | Avg: 51m 56s | Max:  1h 17m | Hits:  70%/42052 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 18s | Max:  1h 00m | Hits:  73%/2104  
      🟩 nvcc               Pass: 100%/42  | Total:  1d 13h | Avg: 53m 40s | Max:  1h 17m | Hits:  68%/50216 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 47m | Avg: 56m 53s | Max: 59m 12s | Hits:  68%/4872  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 56m | Avg: 58m 15s | Max:  1h 01m | Hits:  68%/2432  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 51m | Avg: 55m 30s | Max: 56m 08s | Hits:  68%/2432  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 56m | Avg: 58m 21s | Max: 59m 27s | Hits:  68%/2432  
      🟩 Clang18            Pass: 100%/7   | Total:  5h 41m | Avg: 48m 45s | Max:  1h 01m | Hits:  79%/8184  
      🟩 GCC7               Pass: 100%/2   | Total:  1h 58m | Avg: 59m 14s | Max:  1h 00m | Hits:  67%/2436  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  67%/1218  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 02m | Hits:  67%/2436  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 53s | Max: 59m 54s | Hits:  67%/2436  
      🟩 GCC11              Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m | Hits:  67%/2432  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 57m | Avg: 58m 38s | Max: 59m 51s | Hits:  67%/2432  
      🟩 GCC13              Pass: 100%/10  | Total:  6h 25m | Avg: 38m 32s | Max:  1h 09m | Hits:  83%/12160 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 17m | Hits:  14%/2084  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 23m | Avg:  1h 11m | Max:  1h 13m | Hits:  14%/2084  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 10m | Hits:  67%/2250  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 15h 13m | Avg: 53m 42s | Max:  1h 01m | Hits:  72%/20352 
      🟩 GCC                Pass: 100%/21  | Total: 17h 22m | Avg: 49m 37s | Max:  1h 09m | Hits:  75%/25550 
      🟩 MSVC               Pass: 100%/4   | Total:  4h 43m | Avg:  1h 10m | Max:  1h 17m | Hits:  14%/4168  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 10m | Hits:  67%/2250  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 49m 57s | Avg: 24m 58s | Max: 25m 39s | Hits:  83%/2432  
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 10h | Avg:  1h 00m | Max:  1h 17m | Hits:  62%/40160 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 12m | Avg: 31m 31s | Max:  1h 01m | Hits:  91%/9728  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 12h | Avg: 59m 56s | Max:  1h 17m | Hits:  63%/43808 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 20m 23s | Avg: 20m 23s | Max: 20m 23s | Hits:  99%/1216  
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 55s | Avg: 15m 55s | Max: 15m 55s | Hits:  99%/1216  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 14m | Avg: 24m 48s | Max: 25m 12s | Hits:  99%/3648  
      🟩 TestGPU            Pass: 100%/2   | Total: 44m 33s | Avg: 22m 16s | Max: 22m 45s | Hits:  99%/2432  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 49m 57s | Avg: 24m 58s | Max: 25m 39s | Hits:  83%/2432  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 09m | Avg:  1h 09m | Max:  1h 09m | Hits:  67%/1216  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 20h 18m | Avg:  1h 00m | Max:  1h 17m | Hits:  60%/23559 
      🟩 20                 Pass: 100%/24  | Total: 19h 14m | Avg: 48m 06s | Max:  1h 13m | Hits:  75%/28761 
    
  • 🟩 thrust: Pass: 100%/43 | Total: 23h 36m | Avg: 32m 56s | Max: 1h 01m | Hits: 78%/79625

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 39m 50s | Avg: 19m 55s | Max: 28m 20s | Hits:  89%/3706  
    🟩 cpu
      🟩 amd64              Pass: 100%/41  | Total: 22h 37m | Avg: 33m 06s | Max:  1h 01m | Hits:  78%/75920 
      🟩 arm64              Pass: 100%/2   | Total: 58m 44s | Avg: 29m 22s | Max: 31m 52s | Hits:  78%/3705  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 00m | Avg: 36m 04s | Max: 53m 15s | Hits:  73%/9256  
      🟩 12.5               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 17s | Hits:  73%/3704  
      🟩 12.8               Pass: 100%/36  | Total: 18h 48m | Avg: 31m 20s | Max:  1h 01m | Hits:  79%/66665 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 55m 43s | Avg: 27m 51s | Max: 30m 22s | Hits:  78%/3704  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 00m | Avg: 36m 04s | Max: 53m 15s | Hits:  73%/9256  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 17s | Hits:  73%/3704  
      🟩 nvcc12.8           Pass: 100%/34  | Total: 17h 52m | Avg: 31m 33s | Max:  1h 01m | Hits:  79%/62961 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 55m 43s | Avg: 27m 51s | Max: 30m 22s | Hits:  78%/3704  
      🟩 nvcc               Pass: 100%/41  | Total: 22h 40m | Avg: 33m 11s | Max:  1h 01m | Hits:  78%/75921 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 07m | Avg: 31m 58s | Max: 32m 50s | Hits:  78%/7408  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 03m | Avg: 31m 33s | Max: 32m 05s | Hits:  78%/3704  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 06m | Avg: 33m 19s | Max: 33m 33s | Hits:  78%/3704  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 03m | Avg: 31m 51s | Max: 33m 29s | Hits:  78%/3704  
      🟩 Clang18            Pass: 100%/7   | Total:  2h 46m | Avg: 23m 47s | Max: 32m 57s | Hits:  84%/12964 
      🟩 GCC7               Pass: 100%/2   | Total:  1h 03m | Avg: 31m 38s | Max: 32m 23s | Hits:  78%/3706  
      🟩 GCC8               Pass: 100%/1   | Total: 32m 02s | Avg: 32m 02s | Max: 32m 02s | Hits:  78%/1853  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 04m | Avg: 32m 06s | Max: 32m 56s | Hits:  78%/3706  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 09m | Avg: 34m 49s | Max: 35m 35s | Hits:  78%/3706  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 07m | Avg: 33m 54s | Max: 35m 55s | Hits:  78%/3706  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 08m | Avg: 34m 17s | Max: 35m 22s | Hits:  78%/3706  
      🟩 GCC13              Pass: 100%/8   | Total:  3h 12m | Avg: 24m 06s | Max: 34m 31s | Hits:  86%/14824 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 49m | Avg: 54m 44s | Max: 56m 13s | Hits:  53%/3692  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 33m | Avg: 51m 02s | Max:  1h 01m | Hits:  58%/5538  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 17s | Hits:  73%/3704  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  8h 07m | Avg: 28m 41s | Max: 33m 33s | Hits:  81%/31484 
      🟩 GCC                Pass: 100%/19  | Total:  9h 18m | Avg: 29m 23s | Max: 35m 55s | Hits:  82%/35207 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 22m | Avg: 52m 31s | Max:  1h 01m | Hits:  56%/9230  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 17s | Hits:  73%/3704  
    🟩 gpu
      🟩 rtx2080            Pass: 100%/33  | Total: 19h 37m | Avg: 35m 41s | Max:  1h 00m | Hits:  76%/61112 
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 58m | Avg: 23m 52s | Max:  1h 01m | Hits:  85%/18513 
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total: 22h 14m | Avg: 36m 03s | Max:  1h 01m | Hits:  75%/68516 
      🟩 TestCPU            Pass: 100%/3   | Total: 48m 15s | Avg: 16m 05s | Max: 31m 14s | Hits:  89%/5551  
      🟩 TestGPU            Pass: 100%/3   | Total: 33m 45s | Avg: 11m 15s | Max: 11m 43s | Hits:  99%/5558  
    🟩 sm
      🟩 90;90a;100         Pass: 100%/1   | Total: 34m 31s | Avg: 34m 31s | Max: 34m 31s | Hits:  78%/1853  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 12h 14m | Avg: 36m 44s | Max:  1h 00m | Hits:  74%/37031 
      🟩 20                 Pass: 100%/21  | Total: 10h 41m | Avg: 30m 33s | Max:  1h 01m | Hits:  80%/38888 
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 06s | Avg: 4m 03s | Max: 5m 40s | Hits: 98%/280

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  5m 40s | Hits:  98%/280   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  5m 40s | Hits:  98%/280   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  5m 40s | Hits:  98%/280   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  5m 40s | Hits:  98%/280   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  5m 40s | Hits:  98%/280   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  5m 40s | Hits:  98%/280   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  5m 40s | Hits:  98%/280   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 26s | Avg:  2m 26s | Max:  2m 26s | Hits:  97%/140   
      🟩 Test               Pass: 100%/1   | Total:  5m 40s | Avg:  5m 40s | Max:  5m 40s | Hits:  98%/140   
    
  • 🟩 python: Pass: 100%/1 | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 34m 33s | Avg: 34m 33s | Max: 34m 33s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 90)

# Runner
65 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

@bernhardmgruber bernhardmgruber merged commit 3795966 into NVIDIA:main Feb 6, 2025
104 of 106 checks passed
Copy link
Contributor

github-actions bot commented Feb 6, 2025

Backport failed for branch/2.8.x, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin branch/2.8.x
git worktree add -d .worktree/backport-3723-to-branch/2.8.x origin/branch/2.8.x
cd .worktree/backport-3723-to-branch/2.8.x
git switch --create backport-3723-to-branch/2.8.x
git cherry-pick -x 37959663dd5a663e1db587d319ab785e78f99bf4

bernhardmgruber pushed a commit to bernhardmgruber/cccl that referenced this pull request Feb 6, 2025
bernhardmgruber added a commit that referenced this pull request Feb 7, 2025
* Add b200 tunings for reduce.by_key (#3610)
Co-authored-by: Giannis Gonidelis <[email protected]>

* Minor fix for a regressing tuning in reduce.by_key (#3723)
Co-authored-by: Giannis Gonidelis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants