Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move strided batch pointer conversion to GPU #2608

Merged
merged 4 commits into from
Jan 8, 2025

Conversation

THargreaves
Copy link
Contributor

Replacement of #2601 due to botched rebase.

Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 58.82353% with 7 lines in your changes missing coverage. Please review.

Project coverage is 73.51%. Comparing base (792aec5) to head (8bac8cc).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
lib/cublas/wrappers.jl 58.82% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2608       +/-   ##
===========================================
+ Coverage    9.27%   73.51%   +64.23%     
===========================================
  Files         157      157               
  Lines       15025    15220      +195     
===========================================
+ Hits         1394    11189     +9795     
+ Misses      13631     4031     -9600     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: 8bac8cc Previous: a0c2f4b Ratio
latency/precompile 45289278120.5 ns 45297385295 ns 1.00
latency/ttfp 6403157132.5 ns 6375596178 ns 1.00
latency/import 3034963126.5 ns 3036561495 ns 1.00
integration/volumerhs 9566982 ns 9567419 ns 1.00
integration/byval/slices=1 146872 ns 146746 ns 1.00
integration/byval/slices=3 425466 ns 425517.5 ns 1.00
integration/byval/reference 144701 ns 145010 ns 1.00
integration/byval/slices=2 286086 ns 286216 ns 1.00
integration/cudadevrt 103540.5 ns 103513 ns 1.00
kernel/indexing 14046 ns 14419 ns 0.97
kernel/indexing_checked 15243 ns 15499 ns 0.98
kernel/occupancy 683.4090909090909 ns 748.2734375 ns 0.91
kernel/launch 2197.5555555555557 ns 2194.6666666666665 ns 1.00
kernel/rand 15096 ns 17335 ns 0.87
array/reverse/1d 19407.5 ns 19412 ns 1.00
array/reverse/2d 24582 ns 24576 ns 1.00
array/reverse/1d_inplace 10291 ns 11029 ns 0.93
array/reverse/2d_inplace 12294 ns 13223 ns 0.93
array/copy 20667 ns 20740 ns 1.00
array/iteration/findall/int 158515 ns 158179 ns 1.00
array/iteration/findall/bool 138431.5 ns 138583 ns 1.00
array/iteration/findfirst/int 153794 ns 153423 ns 1.00
array/iteration/findfirst/bool 154565 ns 154821 ns 1.00
array/iteration/scalar 76025.5 ns 77451 ns 0.98
array/iteration/logical 212805 ns 216735 ns 0.98
array/iteration/findmin/1d 40734.5 ns 41556.5 ns 0.98
array/iteration/findmin/2d 93523.5 ns 94128 ns 0.99
array/reductions/reduce/1d 40736.5 ns 42013 ns 0.97
array/reductions/reduce/2d 45975.5 ns 51911 ns 0.89
array/reductions/mapreduce/1d 36066.5 ns 39275 ns 0.92
array/reductions/mapreduce/2d 51367.5 ns 49505.5 ns 1.04
array/broadcast 21420 ns 21668 ns 0.99
array/copyto!/gpu_to_gpu 13403 ns 11569 ns 1.16
array/copyto!/cpu_to_gpu 211626 ns 211873 ns 1.00
array/copyto!/gpu_to_cpu 244722.5 ns 245423 ns 1.00
array/accumulate/1d 109041.5 ns 108388.5 ns 1.01
array/accumulate/2d 79870 ns 79823 ns 1.00
array/construct 1190.05 ns 1208.35 ns 0.98
array/random/randn/Float32 47714 ns 43873.5 ns 1.09
array/random/randn!/Float32 26376 ns 25937 ns 1.02
array/random/rand!/Int64 26999 ns 27271 ns 0.99
array/random/rand!/Float32 8518.333333333334 ns 8766.666666666666 ns 0.97
array/random/rand/Int64 37559 ns 29637 ns 1.27
array/random/rand/Float32 12986 ns 12723 ns 1.02
array/permutedims/4d 66577 ns 66923 ns 0.99
array/permutedims/2d 56402 ns 56439 ns 1.00
array/permutedims/3d 59002 ns 58867 ns 1.00
array/sorting/1d 2920806 ns 2933352 ns 1.00
array/sorting/by 3485108.5 ns 3500830 ns 1.00
array/sorting/2d 1084755 ns 1085059 ns 1.00
cuda/synchronization/stream/auto 1013.5 ns 1038.4 ns 0.98
cuda/synchronization/stream/nonblocking 6452 ns 6432 ns 1.00
cuda/synchronization/stream/blocking 789.0515463917526 ns 807.5918367346939 ns 0.98
cuda/synchronization/context/auto 1173.7 ns 1194.1 ns 0.98
cuda/synchronization/context/nonblocking 6578.2 ns 6649.8 ns 0.99
cuda/synchronization/context/blocking 917.6136363636364 ns 886.6415094339623 ns 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Member

maleadt commented Jan 8, 2025

Thanks! Couple of clean-ups and micro-optimizations:

  • I inlined the kernel definition to keep everything close together, using captures instead of arguments (personal preference)
  • since you were already using a grid-stride kernel, I clamped the number of blocks to what the launch configuration suggests
  • avoid step ranges removes some potential exceptions
  • added @inbound

This improves kernel execution time from 5us to 3.5us on your example.

@maleadt maleadt linked an issue Jan 8, 2025 that may be closed by this pull request
@maleadt maleadt added enhancement New feature or request cuda array Stuff about CuArray. performance How fast can we go? labels Jan 8, 2025
@maleadt maleadt merged commit 74b8eff into JuliaGPU:master Jan 8, 2025
1 of 2 checks passed
maleadt referenced this pull request Jan 8, 2025
avik-pal pushed a commit to avik-pal/CUDA.jl that referenced this pull request Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. enhancement New feature or request performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster strided-batched to batched wrapper
2 participants