Move strided batch pointer conversion to GPU #2608

THargreaves · 2025-01-07T20:51:39Z

Replacement of #2601 due to botched rebase.

codecov · 2025-01-08T00:03:04Z

Codecov Report

Attention: Patch coverage is 58.82353% with 7 lines in your changes missing coverage. Please review.

Project coverage is 73.51%. Comparing base (792aec5) to head (8bac8cc).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/cublas/wrappers.jl	58.82%	7 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2608       +/-   ##
===========================================
+ Coverage    9.27%   73.51%   +64.23%     
===========================================
  Files         157      157               
  Lines       15025    15220      +195     
===========================================
+ Hits         1394    11189     +9795     
+ Misses      13631     4031     -9600

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `8bac8cc`	Previous: `a0c2f4b`	Ratio
`latency/precompile`	`45289278120.5` ns	`45297385295` ns	`1.00`
`latency/ttfp`	`6403157132.5` ns	`6375596178` ns	`1.00`
`latency/import`	`3034963126.5` ns	`3036561495` ns	`1.00`
`integration/volumerhs`	`9566982` ns	`9567419` ns	`1.00`
`integration/byval/slices=1`	`146872` ns	`146746` ns	`1.00`
`integration/byval/slices=3`	`425466` ns	`425517.5` ns	`1.00`
`integration/byval/reference`	`144701` ns	`145010` ns	`1.00`
`integration/byval/slices=2`	`286086` ns	`286216` ns	`1.00`
`integration/cudadevrt`	`103540.5` ns	`103513` ns	`1.00`
`kernel/indexing`	`14046` ns	`14419` ns	`0.97`
`kernel/indexing_checked`	`15243` ns	`15499` ns	`0.98`
`kernel/occupancy`	`683.4090909090909` ns	`748.2734375` ns	`0.91`
`kernel/launch`	`2197.5555555555557` ns	`2194.6666666666665` ns	`1.00`
`kernel/rand`	`15096` ns	`17335` ns	`0.87`
`array/reverse/1d`	`19407.5` ns	`19412` ns	`1.00`
`array/reverse/2d`	`24582` ns	`24576` ns	`1.00`
`array/reverse/1d_inplace`	`10291` ns	`11029` ns	`0.93`
`array/reverse/2d_inplace`	`12294` ns	`13223` ns	`0.93`
`array/copy`	`20667` ns	`20740` ns	`1.00`
`array/iteration/findall/int`	`158515` ns	`158179` ns	`1.00`
`array/iteration/findall/bool`	`138431.5` ns	`138583` ns	`1.00`
`array/iteration/findfirst/int`	`153794` ns	`153423` ns	`1.00`
`array/iteration/findfirst/bool`	`154565` ns	`154821` ns	`1.00`
`array/iteration/scalar`	`76025.5` ns	`77451` ns	`0.98`
`array/iteration/logical`	`212805` ns	`216735` ns	`0.98`
`array/iteration/findmin/1d`	`40734.5` ns	`41556.5` ns	`0.98`
`array/iteration/findmin/2d`	`93523.5` ns	`94128` ns	`0.99`
`array/reductions/reduce/1d`	`40736.5` ns	`42013` ns	`0.97`
`array/reductions/reduce/2d`	`45975.5` ns	`51911` ns	`0.89`
`array/reductions/mapreduce/1d`	`36066.5` ns	`39275` ns	`0.92`
`array/reductions/mapreduce/2d`	`51367.5` ns	`49505.5` ns	`1.04`
`array/broadcast`	`21420` ns	`21668` ns	`0.99`
`array/copyto!/gpu_to_gpu`	`13403` ns	`11569` ns	`1.16`
`array/copyto!/cpu_to_gpu`	`211626` ns	`211873` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`244722.5` ns	`245423` ns	`1.00`
`array/accumulate/1d`	`109041.5` ns	`108388.5` ns	`1.01`
`array/accumulate/2d`	`79870` ns	`79823` ns	`1.00`
`array/construct`	`1190.05` ns	`1208.35` ns	`0.98`
`array/random/randn/Float32`	`47714` ns	`43873.5` ns	`1.09`
`array/random/randn!/Float32`	`26376` ns	`25937` ns	`1.02`
`array/random/rand!/Int64`	`26999` ns	`27271` ns	`0.99`
`array/random/rand!/Float32`	`8518.333333333334` ns	`8766.666666666666` ns	`0.97`
`array/random/rand/Int64`	`37559` ns	`29637` ns	`1.27`
`array/random/rand/Float32`	`12986` ns	`12723` ns	`1.02`
`array/permutedims/4d`	`66577` ns	`66923` ns	`0.99`
`array/permutedims/2d`	`56402` ns	`56439` ns	`1.00`
`array/permutedims/3d`	`59002` ns	`58867` ns	`1.00`
`array/sorting/1d`	`2920806` ns	`2933352` ns	`1.00`
`array/sorting/by`	`3485108.5` ns	`3500830` ns	`1.00`
`array/sorting/2d`	`1084755` ns	`1085059` ns	`1.00`
`cuda/synchronization/stream/auto`	`1013.5` ns	`1038.4` ns	`0.98`
`cuda/synchronization/stream/nonblocking`	`6452` ns	`6432` ns	`1.00`
`cuda/synchronization/stream/blocking`	`789.0515463917526` ns	`807.5918367346939` ns	`0.98`
`cuda/synchronization/context/auto`	`1173.7` ns	`1194.1` ns	`0.98`
`cuda/synchronization/context/nonblocking`	`6578.2` ns	`6649.8` ns	`0.99`
`cuda/synchronization/context/blocking`	`917.6136363636364` ns	`886.6415094339623` ns	`1.03`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2025-01-08T08:18:01Z

Thanks! Couple of clean-ups and micro-optimizations:

I inlined the kernel definition to keep everything close together, using captures instead of arguments (personal preference)
since you were already using a grid-stride kernel, I clamped the number of blocks to what the launch configuration suggests
avoid step ranges removes some potential exceptions
added @inbound

This improves kernel execution time from 5us to 3.5us on your example.

Co-authored-by: Tim Besard <[email protected]>

THargreaves mentioned this pull request Jan 7, 2025

Move strided batch pointer conversion to GPU #2601

Closed

THargreaves added 3 commits January 7, 2025 20:53

Move strided batch pointer conversion to GPU

727dff7

Remove redundant/incorrect nblock check

18a4652

Generalise address logic

6d9b85d

THargreaves force-pushed the th/strided_batch branch from 9efb28b to 6d9b85d Compare January 7, 2025 20:57

github-actions bot reviewed Jan 8, 2025

View reviewed changes

maleadt force-pushed the th/strided_batch branch from 464b2bf to db22eea Compare January 8, 2025 08:09

Clean-up.

8bac8cc

maleadt force-pushed the th/strided_batch branch from db22eea to 8bac8cc Compare January 8, 2025 08:10

maleadt linked an issue Jan 8, 2025 that may be closed by this pull request

Faster strided-batched to batched wrapper #2592

Closed

maleadt added enhancement New feature or request cuda array Stuff about CuArray. performance How fast can we go? labels Jan 8, 2025

maleadt merged commit 74b8eff into JuliaGPU:master Jan 8, 2025
1 of 2 checks passed

maleadt referenced this pull request Jan 8, 2025

Bump version.

fc952a3

avik-pal pushed a commit to avik-pal/CUDA.jl that referenced this pull request Jan 11, 2025

Move strided batch pointer conversion to GPU (JuliaGPU#2608)

ff92ca1

Co-authored-by: Tim Besard <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move strided batch pointer conversion to GPU #2608

Move strided batch pointer conversion to GPU #2608

Uh oh!

THargreaves commented Jan 7, 2025

Uh oh!

codecov bot commented Jan 8, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

maleadt commented Jan 8, 2025

Uh oh!

Uh oh!

Uh oh!

Move strided batch pointer conversion to GPU #2608

Move strided batch pointer conversion to GPU #2608

Uh oh!

Conversation

THargreaves commented Jan 7, 2025

Uh oh!

codecov bot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

maleadt commented Jan 8, 2025

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 8, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading