Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Use non-blocking device side pointer mode in CUBLAS, with fallbacks #2616

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kshyatt
Copy link
Contributor

@kshyatt kshyatt commented Jan 10, 2025

Attempting to address #2571

I've set the pointer mode to "device side" during handle creation. Since gemmGroupedBatched doesn't support device side pointer mode, it won't be usable. One workaround for this would be to add a new function to create a handle with host side mode, or add the pointer mode as an optional kwarg to handle(). Very open to feedback on this.

I've set this up so that users can supply CuRefs of the appropriate result type to the level 1 functions for results. If that's not provided, the functions execute as they do today (synchronously). Similarly, for functions taking alpha or beta scalar arguments, if the user provides CuRef (actually a CuRefArray), the functions will execute asynchronously and return instantly. If the user provides a Number, the behaviour is unchanged from today. I'm not married to this design and it can certainly be changed.

cc @Jutho

@kshyatt kshyatt requested a review from maleadt January 10, 2025 21:03
@kshyatt kshyatt added the cuda libraries Stuff about CUDA library wrappers. label Jan 10, 2025
@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 10, 2025

I can also add some more @eval blocks to try to cut down on the repetitive fallback logic

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 10, 2025

Sample speedup:

julia> using CUDA, CUDA.CUBLAS, LinearAlgebra;

julia> n = Int(2^26);

julia> X = CUDA.rand(Float64, n);

julia> res = CuRef{Float64}(0.0);

# do some precompilation runs first

julia> @time CUBLAS.nrm2(n, X, res);
  0.000104 seconds (18 allocations: 288 bytes)

julia> @time CUBLAS.nrm2(n, X);
  0.001564 seconds (73 allocations: 3.094 KiB)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: c7ed0bd Previous: 774abc6 Ratio
latency/precompile 45365614662.5 ns 45532671418 ns 1.00
latency/ttfp 6441318413 ns 6382276443.5 ns 1.01
latency/import 3060099711.5 ns 3039078540.5 ns 1.01
integration/volumerhs 9568628 ns 9567627 ns 1.00
integration/byval/slices=1 146553 ns 146713 ns 1.00
integration/byval/slices=3 425324 ns 425286 ns 1.00
integration/byval/reference 144769 ns 144622 ns 1.00
integration/byval/slices=2 286055 ns 286077 ns 1.00
integration/cudadevrt 103383 ns 103283 ns 1.00
kernel/indexing 14082 ns 14073 ns 1.00
kernel/indexing_checked 15133.5 ns 15126 ns 1.00
kernel/occupancy 715.1357142857142 ns 710.5460992907801 ns 1.01
kernel/launch 2093.6 ns 2120.3 ns 0.99
kernel/rand 14814 ns 14743 ns 1.00
array/reverse/1d 19313 ns 19325.5 ns 1.00
array/reverse/2d 24637 ns 24669 ns 1.00
array/reverse/1d_inplace 10851.333333333334 ns 10913.666666666666 ns 0.99
array/reverse/2d_inplace 11223 ns 11253 ns 1.00
array/copy 20155 ns 20229 ns 1.00
array/iteration/findall/int 157082.5 ns 157863.5 ns 1.00
array/iteration/findall/bool 138070 ns 138404.5 ns 1.00
array/iteration/findfirst/int 153486 ns 153375 ns 1.00
array/iteration/findfirst/bool 153939.5 ns 154273 ns 1.00
array/iteration/scalar 74727 ns 75697 ns 0.99
array/iteration/logical 209900 ns 212853.5 ns 0.99
array/iteration/findmin/1d 40777 ns 41543 ns 0.98
array/iteration/findmin/2d 93643 ns 93933.5 ns 1.00
array/reductions/reduce/1d 39626.5 ns 35999 ns 1.10
array/reductions/reduce/2d 51087 ns 41907.5 ns 1.22
array/reductions/mapreduce/1d 36925.5 ns 33891.5 ns 1.09
array/reductions/mapreduce/2d 44381 ns 41528 ns 1.07
array/broadcast 21247 ns 21376 ns 0.99
array/copyto!/gpu_to_gpu 11467 ns 11516 ns 1.00
array/copyto!/cpu_to_gpu 209723 ns 210665 ns 1.00
array/copyto!/gpu_to_cpu 244978 ns 243223.5 ns 1.01
array/accumulate/1d 107843 ns 108164 ns 1.00
array/accumulate/2d 79671 ns 79823.5 ns 1.00
array/construct 1204.15 ns 1284.3 ns 0.94
array/random/randn/Float32 43404.5 ns 49740 ns 0.87
array/random/randn!/Float32 26187 ns 26117 ns 1.00
array/random/rand!/Int64 26977 ns 27030 ns 1.00
array/random/rand!/Float32 8639.666666666666 ns 8836.333333333334 ns 0.98
array/random/rand/Int64 29605 ns 37762.5 ns 0.78
array/random/rand/Float32 12774 ns 13046 ns 0.98
array/permutedims/4d 66602 ns 66810 ns 1.00
array/permutedims/2d 56416 ns 56518 ns 1.00
array/permutedims/3d 58786 ns 59273.5 ns 0.99
array/sorting/1d 2919467 ns 2933200.5 ns 1.00
array/sorting/by 3483160 ns 3500043 ns 1.00
array/sorting/2d 1084247.5 ns 1084935 ns 1.00
cuda/synchronization/stream/auto 1013.1666666666666 ns 1035.9 ns 0.98
cuda/synchronization/stream/nonblocking 6415.4 ns 6536.8 ns 0.98
cuda/synchronization/stream/blocking 788.5754716981132 ns 791.2244897959183 ns 1.00
cuda/synchronization/context/auto 1180 ns 1182.9 ns 1.00
cuda/synchronization/context/nonblocking 6570.4 ns 6769.6 ns 0.97
cuda/synchronization/context/blocking 888 ns 915.2666666666667 ns 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant